QIMMA: The Arabic LLM Leaderboard That Actually Checks Its Homework

QIMMA: The Arabic LLM Leaderboard That Actually Checks Its Homework

7 0 0

If you’ve been watching the Arabic NLP space for a while, you’ve probably noticed the same thing I have: leaderboards keep multiplying, but nobody seems to be asking whether the benchmarks themselves are any good.

That’s the problem TII’s team set out to solve with QIMMA (قمّة, “summit”). Instead of just running models on existing Arabic benchmarks and calling it a day, they built a quality validation pipeline that checks each sample before any evaluation happens. What they found is… not great. Even well-known Arabic benchmarks have systematic quality issues that quietly skew results.

What’s broken about Arabic NLP evaluation

Arabic is spoken by over 400 million people across dozens of countries and dialects. You’d think the evaluation tools would be mature by now. They’re not.

Translation artifacts everywhere. A lot of Arabic benchmarks are just English benchmarks translated word-for-word. Questions that make sense in English become awkward or culturally irrelevant in Arabic. You’re not measuring Arabic language ability; you’re measuring how well a model can parse bad translations.

No quality checks. Even native Arabic benchmarks often ship without any serious validation. Annotation inconsistencies, wrong gold answers, encoding errors, cultural bias in labels — I’ve seen all of these in published benchmarks that people treat as ground truth.

You can’t reproduce anything. Evaluation scripts and per-sample outputs are rarely public. If you want to audit a result or build on someone’s work, good luck.

Fragmented coverage. Existing leaderboards usually cover one or two tasks in narrow domains. You can’t get a holistic picture of a model’s Arabic capability.

Here’s where QIMMA sits relative to the other platforms:

OALL v1 is open source but mixed native/translated content with no quality validation. OALL v2 is mostly native but still no validation. BALSAM is only partially open source with 50% native content. AraGen is fully native but no validation. SILMA ABL is fully native with validation but no code eval. ILMAAM is fully native with validation but not fully open source. HELM Arabic is mixed with no validation.

QIMMA is the only one that’s fully open source, 99% native Arabic, has systematic quality validation, code evaluation, and public per-sample outputs. That’s a real differentiator.

What’s actually in QIMMA

The suite consolidates 109 subsets from 14 source benchmarks into over 52,000 samples across 7 domains:

  • Cultural: AraDiCE-Culture, ArabCulture, PalmX
  • STEM: ArabicMMLU, GAT, 3LM STEM
  • Legal: ArabLegalQA, MizanQA
  • Medical: MedArabiQ, MedAraBench
  • Safety: AraTrust
  • Poetry & Literature: FannOrFlop
  • Coding: 3LM HumanEval+, 3LM MBPP+

A few things worth noting. 99% of the content is native Arabic — the only exception is code evaluation, which is inherently language-agnostic. It’s the first Arabic leaderboard with code evaluation, using Arabic-adapted versions of HumanEval+ and MBPP+ with Arabic problem statements. And the domain coverage is genuinely diverse: education, governance, healthcare, creative expression, software development.

The quality pipeline is the real story

Before any model gets evaluated, every single sample goes through a multi-stage validation pipeline. This is where QIMMA earns its keep.

Stage 1: Multi-model automated assessment. Each sample is independently evaluated by two strong LLMs: Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B. They chose two models with strong Arabic capability but different training data, so their combined judgment is more robust than either alone.

Each model scores a sample against a 10-point rubric with binary scores per criterion. A sample is eliminated if either model scores it below 7/10. If both models agree on elimination, it’s dropped immediately. If only one model flags it, it goes to human review.

Stage 2: Human annotation and review. Flagged samples get reviewed by native Arabic speakers with cultural and dialectal familiarity. They make final calls on cultural context, regional variation, dialectal nuance, and subjective interpretation. For culturally sensitive content, multiple perspectives are considered.

What they found: systematic quality problems

The pipeline revealed recurring quality issues across benchmarks — not isolated errors but systematic problems. I’m not going to list every finding here, but the pattern is clear: even widely-used benchmarks have issues that quietly corrupt evaluation results. Translation artifacts, wrong answers, cultural bias, encoding problems. The kind of stuff that makes you question every leaderboard ranking you’ve ever seen.

The rankings (once you clean things up)

I won’t reproduce the full leaderboard here — you can check that on the QIMMA page — but the interesting part is how rankings shift compared to uncleaned benchmarks. Some models that look strong on raw benchmarks drop significantly when you filter out bad samples. Others hold steady or improve. The correlation isn’t as tight as you’d hope.

This is higher than I expected. I thought the quality issues would be minor, maybe 5-10% of samples. The team found more than that in several benchmarks. It makes you wonder how many “state-of-the-art” Arabic models are actually just good at gaming noisy benchmarks.

What this means for the field

QIMMA is a step in the right direction, but it’s not a silver bullet. The pipeline itself depends on two LLMs for initial screening, which introduces its own biases. Human annotation is expensive and hard to scale. And the benchmark coverage, while broad, still leaves gaps — dialectal Arabic, speech, and multimodal tasks are mostly absent.

That said, the approach is overdue. If you’re building Arabic LLMs or evaluating them, you should be paying attention to data quality first. Running models on garbage benchmarks and publishing flashy numbers helps nobody.

The team has open-sourced everything: the leaderboard, the code, the paper. Go check it out, run your own models, and see how they actually perform on clean Arabic data. I suspect some reputations are about to get revised.

Comments (0)

Be the first to comment!