cvoegele-nv commited on
Commit
91dfedb
·
verified ·
1 Parent(s): 3c61e14

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -103,10 +103,10 @@ This model v1 and is NVFP4 quantized with nvidia-modelopt **v0.43.0**
103
  **Properties:** Undisclosed
104
 
105
  ## Evaluation Dataset:
106
- **Datasets:** MMLU-Pro, LiveCodebench, IFEval, GPQA Diamond, SciCode, AIME 2025, and IFBench<br>
107
  **Data Collection Method by dataset:** Hybrid, Automated, Human <br>
108
  **Labeling Method by dataset:** Hybrid, Automated, Human<br>
109
- **Properties:** We evaluated the model on text-based reasoning and coding benchmarks: MMLU Pro is a multi-task language understanding benchmark with challenging multiple-choice questions across diverse academic domains; LiveCodeBench V6 contains competitive programming problems; SciCode evaluates scientific coding capabilities; IFEval is a benchmark that tests whether language models can follow explicit, verifiable formatting and structural constraints layered on top of content generation prompts; GPQA Diamond contains 448 graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry; AIME 2025 contains problems from the American Invitational Mathematics Examination; IFBench is a benchmark for evaluating instruction-following capabilities across diverse and structured task constraints.
110
 
111
 
112
  ## Inference:
 
103
  **Properties:** Undisclosed
104
 
105
  ## Evaluation Dataset:
106
+ **Datasets:** MMLU-Pro, LiveCodebench, IFEval, GPQA Diamond, SciCode, AIME 2025, IFBench, and AA-LCR<br>
107
  **Data Collection Method by dataset:** Hybrid, Automated, Human <br>
108
  **Labeling Method by dataset:** Hybrid, Automated, Human<br>
109
+ **Properties:** We evaluated the model on text-based reasoning and coding benchmarks: MMLU Pro is a multi-task language understanding benchmark with challenging multiple-choice questions across diverse academic domains; LiveCodeBench V6 contains competitive programming problems; SciCode evaluates scientific coding capabilities; IFEval is a benchmark that tests whether language models can follow explicit, verifiable formatting and structural constraints layered on top of content generation prompts; GPQA Diamond contains 448 graduate-level multiple-choice questions written by domain experts in biology, physics, and chemistry; AIME 2025 contains problems from the American Invitational Mathematics Examination; IFBench is a benchmark for evaluating instruction-following capabilities across diverse and structured task constraints; AA-LCR (Artificial Analysis Long Context Reasoning) is a long-context benchmark of 100 questions over documents ranging from 10k to 100k tokens that requires multi-step reasoning and synthesis across dispersed sections rather than simple retrieval.
110
 
111
 
112
  ## Inference: