Spaces:

qimma
/

leaderboard

Running on CPU Upgrade

App Files Files Community

Shaikha710 commited on Mar 3

Commit

2f314ce

1 Parent(s): f8be5d4

updated info in about

Browse files

Files changed (1) hide show

frontend/about.html +70 -43

frontend/about.html CHANGED Viewed

@@ -25,9 +25,9 @@
         <div class="text-center mb-12">
             <h1 class="text-4xl font-extrabold tracking-tight sm:text-5xl gradient-text mb-6">About QIMMA</h1>
-            <p class="text-lg text-slate-600 dark:text-slate-400 max-w-2xl mx-auto">
                 Understanding the methodology and metrics behind the Open Arabic LLM Leaderboard.
-            </p>
         </div>
         <div class="space-y-12 animate-fade-in">
@@ -43,15 +43,10 @@
                 </div>
                 <div class="prose dark:prose-invert max-w-none text-slate-600 dark:text-slate-300 leading-relaxed">
                     <p class="mb-4">
-                        QIMMA (Summit in Arabic) is a comprehensive leaderboard designed to evaluate and compare the
-                        performance of Large Language Models (LLMs) on Arabic language tasks. As the field of Arabic NLP
-                        grows, there is a pressing need for a standardized, transparent, and rigorous benchmark to
-                        assess model capabilities.
                     </p>
                     <p>
-                        Our goal is to foster innovation in the Arabic AI ecosystem by providing researchers and
-                        developers with clear insights into model performance across a diverse set of linguistic and
-                        reasoning challenges.
                     </p>
                 </div>
             </section>
@@ -67,21 +62,15 @@
                 </div>
                 <div class="prose dark:prose-invert max-w-none text-slate-600 dark:text-slate-300 leading-relaxed">
                     <p class="mb-6">
-                        The leaderboard utilizes a robust evaluation pipeline powered by <a
-                            href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"
-                            class="text-indigo-600 dark:text-indigo-400 hover:underline">EleutherAI's LM Evaluation
-                            Harness</a>. We ensure fair comparisons by running all models in a controlled environment
-                        with consistent settings.
                     </p>
                     <h3 class="text-lg font-bold text-slate-800 dark:text-slate-200 mb-3">Evaluation Pipeline</h3>
                     <ul class="space-y-3 list-disc list-inside marker:text-indigo-500">
                         <li><strong>Submission:</strong> Users submit their Hugging Face model ID.</li>
                         <li><strong>Queue:</strong> Models are added to a processing queue.</li>
-                        <li><strong>Execution:</strong> Our GPU cluster loads the model and runs the benchmark suite.
-                        </li>
-                        <li><strong>Scoring:</strong> Results are computed, aggregated, and published to the
-                            leaderboard.</li>
                     </ul>
                 </div>
             </section>
@@ -96,29 +85,67 @@
                     <h2 class="text-2xl font-bold text-slate-800 dark:text-slate-100">Benchmarks & Metrics</h2>
                 </div>
                 <div class="grid grid-cols-1 md:grid-cols-2 gap-6">
-                    <div
-                        class="p-4 bg-slate-50 dark:bg-slate-700/30 rounded-xl border border-slate-100 dark:border-slate-700">
-                        <h4 class="font-bold text-slate-800 dark:text-slate-200 mb-2">ArabicMMLU</h4>
-                        <p class="text-sm text-slate-600 dark:text-slate-400">Massive Multitask Language Understanding
-                            adapted for Arabic, covering diverse topics from STEM to humanities.</p>
                     </div>
-                    <div
-                        class="p-4 bg-slate-50 dark:bg-slate-700/30 rounded-xl border border-slate-100 dark:border-slate-700">
-                        <h4 class="font-bold text-slate-800 dark:text-slate-200 mb-2">EXAMS</h4>
-                        <p class="text-sm text-slate-600 dark:text-slate-400">A benchmark for evaluating model
-                            performance on high school examinations across various subjects.</p>
                     </div>
-                    <div
-                        class="p-4 bg-slate-50 dark:bg-slate-700/30 rounded-xl border border-slate-100 dark:border-slate-700">
-                        <h4 class="font-bold text-slate-800 dark:text-slate-200 mb-2">AlGhafa</h4>
-                        <p class="text-sm text-slate-600 dark:text-slate-400">A suite of multiple-choice QA tasks
-                            designed to test native Arabic reasoning and cultural knowledge.</p>
                     </div>
-                    <div
-                        class="p-4 bg-slate-50 dark:bg-slate-700/30 rounded-xl border border-slate-100 dark:border-slate-700">
-                        <h4 class="font-bold text-slate-800 dark:text-slate-200 mb-2">MadinahQA</h4>
-                        <p class="text-sm text-slate-600 dark:text-slate-400">A question answering dataset focusing on
-                            religious and historical knowledge relevant to the Arab world.</p>
                     </div>
                 </div>
             </section>
@@ -136,11 +163,11 @@
                 </div>
                 <pre id="citationCode"
                     class="bg-slate-100 dark:bg-slate-900/50 p-6 rounded-xl border border-slate-200 dark:border-slate-700 overflow-x-auto text-xs sm:text-sm text-slate-600 dark:text-slate-400 font-mono leading-relaxed">@misc{QIMMA,
-  author = {AlQadi, Leen and Alzubaidi, Ahmed and Alobeidli, Hamza and Alhammadi, Maitha and Alyafeai, Mohammed and Alswuidi, Shaikha and  Alkaabi, Omar and Boussaha, Basma El Amel and Hacid, Hakim},
-  title = {QIMMA Leaderboard},
-  year = {2026},
-  publisher = {QIMMA},
-  howpublished = {https://huggingface.co/spaces/qimma/Qimma-Leaderboard}
 }</pre>
             </div>
         </div>

         <div class="text-center mb-12">
             <h1 class="text-4xl font-extrabold tracking-tight sm:text-5xl gradient-text mb-6">About QIMMA</h1>
+            <!-- <p class="text-lg text-slate-600 dark:text-slate-400 max-w-2xl mx-auto">
                 Understanding the methodology and metrics behind the Open Arabic LLM Leaderboard.
+            </p> -->
         </div>
         <div class="space-y-12 animate-fade-in">
                 </div>
                 <div class="prose dark:prose-invert max-w-none text-slate-600 dark:text-slate-300 leading-relaxed">
                     <p class="mb-4">
+                        QIMMA قمّة (Summit in Arabic) is a quality-assured Arabic LLM evaluation leaderboard built on 13 carefully chosen benchmarks spanning STEM, legal reasoning, medical knowledge, poetry, cultural understanding, and code generation. QIMMA includes over 52,000 quality-validated samples across multiple-choice, generative, and code evaluation tracks. Over 99% of QIMMA's content is native Arabic, ensuring authentic linguistic and cultural assessment rather than relying on translated materials.
                     </p>
                     <p>
+                        QIMMA was constructed through a systematic benchmark curation process: candidate benchmarks were assessed using a multi-model quality validation pipeline that identified issues in the samples, including false, missing or invalid gold answers, textual encoding problems and many more. Only clean, validated samples made it into the final leaderboard. This process also revealed that quality problems are more pervasive across existing Arabic benchmarks than previously documented.
                     </p>
                 </div>
             </section>
                 </div>
                 <div class="prose dark:prose-invert max-w-none text-slate-600 dark:text-slate-300 leading-relaxed">
                     <p class="mb-6">
+                        QIMMA uses a multi-framework evaluation pipeline tailored to each task type. Multiple-choice and generative QA tasks are evaluated using <a href="https://github.com/huggingface/lighteval" target="_blank" class="text-indigo-600 dark:text-indigo-400 hover:underline">LightEval</a>, code generation is evaluated using <a href="https://github.com/evalplus/evalplus" target="_blank" class="text-indigo-600 dark:text-indigo-400 hover:underline">EvalPlus</a>, and poetry understanding is evaluated using an adapted version of the <a href="https://github.com/mbzuai-oryx/FannOrFlop" target="_blank" class="text-indigo-600 dark:text-indigo-400 hover:underline">FannOrFlop</a> evaluation framework. All models are run in a controlled environment with consistent prompt templates to ensure fair comparison.
                     </p>
                     <h3 class="text-lg font-bold text-slate-800 dark:text-slate-200 mb-3">Evaluation Pipeline</h3>
                     <ul class="space-y-3 list-disc list-inside marker:text-indigo-500">
                         <li><strong>Submission:</strong> Users submit their Hugging Face model ID.</li>
                         <li><strong>Queue:</strong> Models are added to a processing queue.</li>
+                        <li><strong>Execution:</strong> Our GPU cluster loads the model and runs the benchmark suite across all tracks.</li>
+                        <li><strong>Scoring:</strong> Per-subset scores are aggregated by domain and published alongside per-sample model predictions for full transparency.</li>
                     </ul>
                 </div>
             </section>
                     <h2 class="text-2xl font-bold text-slate-800 dark:text-slate-100">Benchmarks & Metrics</h2>
                 </div>
                 <div class="grid grid-cols-1 md:grid-cols-2 gap-6">
+                    <div class="p-4 bg-slate-50 dark:bg-slate-700/30 rounded-xl border border-slate-100 dark:border-slate-700">
+                        <div class="flex items-center justify-between mb-2">
+                            <h4 class="font-bold text-slate-800 dark:text-slate-200">Cultural Understanding</h4>
+                            <span class="text-xs font-medium px-2 py-0.5 rounded-full bg-indigo-100 dark:bg-indigo-900/50 text-indigo-600 dark:text-indigo-400">MCQ</span>
+                        </div>
+                        <p class="text-sm text-slate-600 dark:text-slate-400">AraDiCE-Culture, ArabCulture, PalmX — evaluating native Arabic cultural knowledge, social norms, and regional understanding.</p>
                     </div>
+                    <div class="p-4 bg-slate-50 dark:bg-slate-700/30 rounded-xl border border-slate-100 dark:border-slate-700">
+                        <div class="flex items-center justify-between mb-2">
+                            <h4 class="font-bold text-slate-800 dark:text-slate-200">STEM</h4>
+                            <span class="text-xs font-medium px-2 py-0.5 rounded-full bg-indigo-100 dark:bg-indigo-900/50 text-indigo-600 dark:text-indigo-400">MCQ</span>
+                        </div>
+                        <p class="text-sm text-slate-600 dark:text-slate-400">ArabicMMLU, 3LM STEM — covering science, mathematics, and technical reasoning across diverse academic subjects.</p>
                     </div>
+                    <div class="p-4 bg-slate-50 dark:bg-slate-700/30 rounded-xl border border-slate-100 dark:border-slate-700">
+                        <div class="flex items-center justify-between mb-2">
+                            <h4 class="font-bold text-slate-800 dark:text-slate-200">Language & Reasoning</h4>
+                            <span class="text-xs font-medium px-2 py-0.5 rounded-full bg-indigo-100 dark:bg-indigo-900/50 text-indigo-600 dark:text-indigo-400">MCQ</span>
+                        </div>
+                        <p class="text-sm text-slate-600 dark:text-slate-400">GAT (Saudi Aptitude Test) — assessing verbal reasoning, language comprehension, and mathematical aptitude.</p>
                     </div>
+                    <div class="p-4 bg-slate-50 dark:bg-slate-700/30 rounded-xl border border-slate-100 dark:border-slate-700">
+                        <div class="flex items-center justify-between mb-2">
+                            <h4 class="font-bold text-slate-800 dark:text-slate-200">Legal</h4>
+                            <div class="flex gap-1">
+                                <span class="text-xs font-medium px-2 py-0.5 rounded-full bg-indigo-100 dark:bg-indigo-900/50 text-indigo-600 dark:text-indigo-400">MCQ</span>
+                                <span class="text-xs font-medium px-2 py-0.5 rounded-full bg-cyan-100 dark:bg-cyan-900/50 text-cyan-600 dark:text-cyan-400">QA</span>
+                            </div>
+                        </div>
+                        <p class="text-sm text-slate-600 dark:text-slate-400">ArabLegalQA (Saudi law), MizanQA (Moroccan law) — testing legal reasoning and regulatory understanding across Arabic jurisdictions.</p>
+                    </div>
+                    <div class="p-4 bg-slate-50 dark:bg-slate-700/30 rounded-xl border border-slate-100 dark:border-slate-700">
+                        <div class="flex items-center justify-between mb-2">
+                            <h4 class="font-bold text-slate-800 dark:text-slate-200">Medical</h4>
+                            <div class="flex gap-1">
+                                <span class="text-xs font-medium px-2 py-0.5 rounded-full bg-indigo-100 dark:bg-indigo-900/50 text-indigo-600 dark:text-indigo-400">MCQ</span>
+                                <span class="text-xs font-medium px-2 py-0.5 rounded-full bg-cyan-100 dark:bg-cyan-900/50 text-cyan-600 dark:text-cyan-400">QA</span>
+                            </div>
+                        </div>
+                        <p class="text-sm text-slate-600 dark:text-slate-400">MedArabiQ, MedAraBench — evaluating healthcare knowledge, clinical reasoning, and medical terminology in Arabic.</p>
+                    </div>
+                    <div class="p-4 bg-slate-50 dark:bg-slate-700/30 rounded-xl border border-slate-100 dark:border-slate-700">
+                        <div class="flex items-center justify-between mb-2">
+                            <h4 class="font-bold text-slate-800 dark:text-slate-200">Poetry & Literature</h4>
+                            <span class="text-xs font-medium px-2 py-0.5 rounded-full bg-indigo-100 dark:bg-indigo-900/50 text-indigo-600 dark:text-indigo-400">MCQ</span>
+                        </div>
+                        <p class="text-sm text-slate-600 dark:text-slate-400">FannOrFlop — assessing understanding of classical and modern Arabic poetry, literary devices, and cultural context.</p>
+                    </div>
+                    <div class="p-4 bg-slate-50 dark:bg-slate-700/30 rounded-xl border border-slate-100 dark:border-slate-700">
+                        <div class="flex items-center justify-between mb-2">
+                            <h4 class="font-bold text-slate-800 dark:text-slate-200">Trust & Safety</h4>
+                            <span class="text-xs font-medium px-2 py-0.5 rounded-full bg-indigo-100 dark:bg-indigo-900/50 text-indigo-600 dark:text-indigo-400">MCQ</span>
+                        </div>
+                        <p class="text-sm text-slate-600 dark:text-slate-400">AraTrust — probing model behavior on sensitive content, bias, and safety-relevant scenarios in Arabic.</p>
+                    </div>
+                    <div class="p-4 bg-slate-50 dark:bg-slate-700/30 rounded-xl border border-slate-100 dark:border-slate-700">
+                        <div class="flex items-center justify-between mb-2">
+                            <h4 class="font-bold text-slate-800 dark:text-slate-200">Code Generation</h4>
+                            <span class="text-xs font-medium px-2 py-0.5 rounded-full bg-emerald-100 dark:bg-emerald-900/50 text-emerald-600 dark:text-emerald-400">CODE</span>
+                        </div>
+                        <p class="text-sm text-slate-600 dark:text-slate-400">3LM HumanEval+, 3LM MBPP+ — assessing code generation from Arabic instructions, evaluated via functional correctness (pass@1).</p>
                     </div>
                 </div>
             </section>
                 </div>
                 <pre id="citationCode"
                     class="bg-slate-100 dark:bg-slate-900/50 p-6 rounded-xl border border-slate-200 dark:border-slate-700 overflow-x-auto text-xs sm:text-sm text-slate-600 dark:text-slate-400 font-mono leading-relaxed">@misc{QIMMA,
+author = {AlQadi, Leen and Alzubaidi, Ahmed and Alobeidli, Hamza and Alhammadi, Maitha and Alyafeai, Mohammed and Alswuidi, Shaikha and  Alkaabi, Omar and Boussaha, Basma El Amel and Hacid, Hakim},
+title = {QIMMA Leaderboard},
+year = {2026},
+publisher = {QIMMA},
+howpublished = {https://huggingface.co/spaces/qimma/Qimma-Leaderboard}
 }</pre>
             </div>
         </div>