Update README.md
Browse files
README.md
CHANGED
|
@@ -103,10 +103,10 @@ This model v1 and is NVFP4 quantized with nvidia-modelopt **v0.43.0**
|
|
| 103 |
**Properties:** Undisclosed
|
| 104 |
|
| 105 |
## Evaluation Dataset:
|
| 106 |
-
**Datasets:** MMLU-Pro, LiveCodebench, IFEval, GPQA Diamond, SciCode, AIME 2025, and
|
| 107 |
**Data Collection Method by dataset:** Hybrid, Automated, Human <br>
|
| 108 |
**Labeling Method by dataset:** Hybrid, Automated, Human<br>
|
| 109 |
-
**Properties:** We evaluated the model on text-based reasoning and coding benchmarks: MMLU Pro is a multi-task language understanding benchmark with challenging multiple-choice questions across diverse academic domains; LiveCodeBench V6 contains competitive programming problems; SciCode evaluates scientific coding capabilities; IFEval is a benchmark that tests whether language models can follow explicit, verifiable formatting and structural constraints layered on top of content generation prompts; GPQA Diamond contains 448 graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry; AIME 2025 contains problems from the American Invitational Mathematics Examination; IFBench is a benchmark for evaluating instruction-following capabilities across diverse and structured task constraints.
|
| 110 |
|
| 111 |
|
| 112 |
## Inference:
|
|
|
|
| 103 |
**Properties:** Undisclosed
|
| 104 |
|
| 105 |
## Evaluation Dataset:
|
| 106 |
+
**Datasets:** MMLU-Pro, LiveCodebench, IFEval, GPQA Diamond, SciCode, AIME 2025, IFBench, and AA-LCR<br>
|
| 107 |
**Data Collection Method by dataset:** Hybrid, Automated, Human <br>
|
| 108 |
**Labeling Method by dataset:** Hybrid, Automated, Human<br>
|
| 109 |
+
**Properties:** We evaluated the model on text-based reasoning and coding benchmarks: MMLU Pro is a multi-task language understanding benchmark with challenging multiple-choice questions across diverse academic domains; LiveCodeBench V6 contains competitive programming problems; SciCode evaluates scientific coding capabilities; IFEval is a benchmark that tests whether language models can follow explicit, verifiable formatting and structural constraints layered on top of content generation prompts; GPQA Diamond contains 448 graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry; AIME 2025 contains problems from the American Invitational Mathematics Examination; IFBench is a benchmark for evaluating instruction-following capabilities across diverse and structured task constraints; AA-LCR (Artificial Analysis Long Context Reasoning) is a long-context benchmark of 100 questions over documents ranging from 10k to 100k tokens that requires multi-step reasoning and synthesis across dispersed sections rather than simple retrieval.
|
| 110 |
|
| 111 |
|
| 112 |
## Inference:
|