Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -103,10 +103,10 @@ This model v1 and is NVFP4 quantized with nvidia-modelopt **v0.43.0**
 **Properties:** Undisclosed
 ## Evaluation Dataset:
-**Datasets:** MMLU-Pro, LiveCodebench, IFEval, GPQA Diamond, SciCode, AIME 2025, and IFBench<br>
 **Data Collection Method by dataset:** Hybrid, Automated, Human <br>
 **Labeling Method by dataset:** Hybrid, Automated, Human<br>
-**Properties:** We evaluated the model on text-based reasoning and coding benchmarks: MMLU Pro is a multi-task language understanding benchmark with challenging multiple-choice questions across diverse academic domains; LiveCodeBench V6 contains competitive programming problems; SciCode evaluates scientific coding capabilities; IFEval is a benchmark that tests whether language models can follow explicit, verifiable formatting and structural constraints layered on top of content generation prompts; GPQA Diamond contains 448 graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry; AIME 2025 contains problems from the American Invitational Mathematics Examination; IFBench is a benchmark for evaluating instruction-following capabilities across diverse and structured task constraints.
 ## Inference:

 **Properties:** Undisclosed
 ## Evaluation Dataset:
+**Datasets:** MMLU-Pro, LiveCodebench, IFEval, GPQA Diamond, SciCode, AIME 2025, IFBench, and AA-LCR<br>
 **Data Collection Method by dataset:** Hybrid, Automated, Human <br>
 **Labeling Method by dataset:** Hybrid, Automated, Human<br>
+**Properties:** We evaluated the model on text-based reasoning and coding benchmarks: MMLU Pro is a multi-task language understanding benchmark with challenging multiple-choice questions across diverse academic domains; LiveCodeBench V6 contains competitive programming problems; SciCode evaluates scientific coding capabilities; IFEval is a benchmark that tests whether language models can follow explicit, verifiable formatting and structural constraints layered on top of content generation prompts; GPQA Diamond contains 448 graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry; AIME 2025 contains problems from the American Invitational Mathematics Examination; IFBench is a benchmark for evaluating instruction-following capabilities across diverse and structured task constraints; AA-LCR (Artificial Analysis Long Context Reasoning) is a long-context benchmark of 100 questions over documents ranging from 10k to 100k tokens that requires multi-step reasoning and synthesis across dispersed sections rather than simple retrieval.
 ## Inference: