Great job on this one!

#1
by ubergarm - opened

Heya @tarruda , great job cooking up this nice looking quant!

I appreciate the details and curious to hear more how your benchmarking pans out as you mentioned on this discussion: https://huggingface.co/AesSedai/Qwen3.5-397B-A17B-GGUF/discussions/6#69cd0e742213b5d30c30a419

I'll update my repo of this model to link here as well! Cheers!

Thanks @ubergarm !

After I finish benchmarking (probably beginning of next week) I will post all results here.

I added lm-evaluation-harness results here: https://huggingface.co/tarruda/Qwen3.5-397B-A17B-GGUF/tree/main/IQ3_XXS/lm-evaluation-harness-results

Here's everything concatenated for convenience:

|                 Tasks                 |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu                                   |      2|none  |      |acc   |   |0.8796|±  |0.0026|
| - humanities                          |      2|none  |     0|acc   |↑  |0.8142|±  |0.0054|
|  - formal_logic                       |      1|none  |     0|acc   |↑  |0.7937|±  |0.0362|
|  - high_school_european_history       |      1|none  |     0|acc   |↑  |0.9091|±  |0.0224|
|  - high_school_us_history             |      1|none  |     0|acc   |↑  |0.9657|±  |0.0128|
|  - high_school_world_history          |      1|none  |     0|acc   |↑  |0.9662|±  |0.0118|
|  - international_law                  |      1|none  |     0|acc   |↑  |0.9421|±  |0.0213|
|  - jurisprudence                      |      1|none  |     0|acc   |↑  |0.8889|±  |0.0304|
|  - logical_fallacies                  |      1|none  |     0|acc   |↑  |0.9325|±  |0.0197|
|  - moral_disputes                     |      1|none  |     0|acc   |↑  |0.8902|±  |0.0168|
|  - moral_scenarios                    |      1|none  |     0|acc   |↑  |0.6559|±  |0.0159|
|  - philosophy                         |      1|none  |     0|acc   |↑  |0.9100|±  |0.0163|
|  - prehistory                         |      1|none  |     0|acc   |↑  |0.9444|±  |0.0127|
|  - professional_law                   |      1|none  |     0|acc   |↑  |0.7497|±  |0.0111|
|  - world_religions                    |      1|none  |     0|acc   |↑  |0.9298|±  |0.0196|
| - other                               |      2|none  |     0|acc   |↑  |0.9057|±  |0.0050|
|  - business_ethics                    |      1|none  |     0|acc   |↑  |0.8500|±  |0.0359|
|  - clinical_knowledge                 |      1|none  |     0|acc   |↑  |0.9434|±  |0.0142|
|  - college_medicine                   |      1|none  |     0|acc   |↑  |0.8844|±  |0.0244|
|  - global_facts                       |      1|none  |     0|acc   |↑  |0.6900|±  |0.0465|
|  - human_aging                        |      1|none  |     0|acc   |↑  |0.8430|±  |0.0244|
|  - management                         |      1|none  |     0|acc   |↑  |0.9320|±  |0.0249|
|  - marketing                          |      1|none  |     0|acc   |↑  |0.9701|±  |0.0112|
|  - medical_genetics                   |      1|none  |     0|acc   |↑  |1.0000|±  |0.0000|
|  - miscellaneous                      |      1|none  |     0|acc   |↑  |0.9655|±  |0.0065|
|  - nutrition                          |      1|none  |     0|acc   |↑  |0.9314|±  |0.0145|
|  - professional_accounting            |      1|none  |     0|acc   |↑  |0.8794|±  |0.0194|
|  - professional_medicine              |      1|none  |     0|acc   |↑  |0.9449|±  |0.0139|
|  - virology                           |      1|none  |     0|acc   |↑  |0.6024|±  |0.0381|
| - social sciences                     |      2|none  |     0|acc   |↑  |0.9363|±  |0.0043|
|  - econometrics                       |      1|none  |     0|acc   |↑  |0.8596|±  |0.0327|
|  - high_school_geography              |      1|none  |     0|acc   |↑  |0.9646|±  |0.0132|
|  - high_school_government_and_politics|      1|none  |     0|acc   |↑  |0.9793|±  |0.0103|
|  - high_school_macroeconomics         |      1|none  |     0|acc   |↑  |0.9564|±  |0.0104|
|  - high_school_microeconomics         |      1|none  |     0|acc   |↑  |0.9706|±  |0.0110|
|  - high_school_psychology             |      1|none  |     0|acc   |↑  |0.9725|±  |0.0070|
|  - human_sexuality                    |      1|none  |     0|acc   |↑  |0.9466|±  |0.0197|
|  - professional_psychology            |      1|none  |     0|acc   |↑  |0.9167|±  |0.0112|
|  - public_relations                   |      1|none  |     0|acc   |↑  |0.7818|±  |0.0396|
|  - security_studies                   |      1|none  |     0|acc   |↑  |0.8735|±  |0.0213|
|  - sociology                          |      1|none  |     0|acc   |↑  |0.9303|±  |0.0180|
|  - us_foreign_policy                  |      1|none  |     0|acc   |↑  |0.9700|±  |0.0171|
| - stem                                |      2|none  |     0|acc   |↑  |0.8960|±  |0.0053|
|  - abstract_algebra                   |      1|none  |     0|acc   |↑  |0.7800|±  |0.0416|
|  - anatomy                            |      1|none  |     0|acc   |↑  |0.8593|±  |0.0300|
|  - astronomy                          |      1|none  |     0|acc   |↑  |0.9605|±  |0.0158|
|  - college_biology                    |      1|none  |     0|acc   |↑  |0.9722|±  |0.0137|
|  - college_chemistry                  |      1|none  |     0|acc   |↑  |0.6800|±  |0.0469|
|  - college_computer_science           |      1|none  |     0|acc   |↑  |0.9000|±  |0.0302|
|  - college_mathematics                |      1|none  |     0|acc   |↑  |0.8400|±  |0.0368|
|  - college_physics                    |      1|none  |     0|acc   |↑  |0.8824|±  |0.0321|
|  - computer_security                  |      1|none  |     0|acc   |↑  |0.8700|±  |0.0338|
|  - conceptual_physics                 |      1|none  |     0|acc   |↑  |0.9319|±  |0.0165|
|  - electrical_engineering             |      1|none  |     0|acc   |↑  |0.9103|±  |0.0238|
|  - elementary_mathematics             |      1|none  |     0|acc   |↑  |0.9524|±  |0.0110|
|  - high_school_biology                |      1|none  |     0|acc   |↑  |0.9742|±  |0.0090|
|  - high_school_chemistry              |      1|none  |     0|acc   |↑  |0.8966|±  |0.0214|
|  - high_school_computer_science       |      1|none  |     0|acc   |↑  |0.9400|±  |0.0239|
|  - high_school_mathematics            |      1|none  |     0|acc   |↑  |0.7815|±  |0.0252|
|  - high_school_physics                |      1|none  |     0|acc   |↑  |0.9007|±  |0.0244|
|  - high_school_statistics             |      1|none  |     0|acc   |↑  |0.9028|±  |0.0202|
|  - machine_learning                   |      1|none  |     0|acc   |↑  |0.8482|±  |0.0341|

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |   |0.8796|±  |0.0026|
| - humanities     |      2|none  |     0|acc   |↑  |0.8142|±  |0.0054|
| - other          |      2|none  |     0|acc   |↑  |0.9057|±  |0.0050|
| - social sciences|      2|none  |     0|acc   |↑  |0.9363|±  |0.0043|
| - stem           |      2|none  |     0|acc   |↑  |0.8960|±  |0.0053|

|        Tasks        |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|---------------------|------:|------|-----:|--------|---|-----:|---|-----:|
|gpqa_diamond_zeroshot|      1|none  |     0|acc     |↑  |0.4949|±  |0.0356|
|                     |       |none  |     0|acc_norm|↑  |0.4949|±  |0.0356|

|          Tasks          |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gpqa_diamond_cot_zeroshot|      1|flexible-extract|     0|exact_match|↑  |0.8636|±  |0.0245|
|                         |       |strict-match    |     0|exact_match|↑  |0.8636|±  |0.0245|

|Tasks |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|---|-----:|---|------|
|ifeval|      4|none  |     0|inst_level_loose_acc   |↑  |0.9269|±  |   N/A|
|      |       |none  |     0|inst_level_strict_acc  |↑  |0.9113|±  |   N/A|
|      |       |none  |     0|prompt_level_loose_acc |↑  |0.9316|±  |0.0109|
|      |       |none  |     0|prompt_level_strict_acc|↑  |0.9113|±  |0.0122|

|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     0|exact_match|↑  |0.9257|±  |0.0072|
|         |       |strict-match    |     0|exact_match|↑  |0.7043|±  |0.0126|

The difference in ifeval is too big. So big that I think I must have done something wrong when I ran the harness against your smol-IQ2_XS.

gsm8k is also different: Against IQ3_XXS I ran with -nshot == 0 so it got a much worse score in strict match. Still, it was a better in flexible match.

Sign up or log in to comment