Great job on this one!
Heya @tarruda , great job cooking up this nice looking quant!
I appreciate the details and curious to hear more how your benchmarking pans out as you mentioned on this discussion: https://huggingface.co/AesSedai/Qwen3.5-397B-A17B-GGUF/discussions/6#69cd0e742213b5d30c30a419
I'll update my repo of this model to link here as well! Cheers!
I added lm-evaluation-harness results here: https://huggingface.co/tarruda/Qwen3.5-397B-A17B-GGUF/tree/main/IQ3_XXS/lm-evaluation-harness-results
Here's everything concatenated for convenience:
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|---------------------------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc | |0.8796|± |0.0026|
| - humanities | 2|none | 0|acc |↑ |0.8142|± |0.0054|
| - formal_logic | 1|none | 0|acc |↑ |0.7937|± |0.0362|
| - high_school_european_history | 1|none | 0|acc |↑ |0.9091|± |0.0224|
| - high_school_us_history | 1|none | 0|acc |↑ |0.9657|± |0.0128|
| - high_school_world_history | 1|none | 0|acc |↑ |0.9662|± |0.0118|
| - international_law | 1|none | 0|acc |↑ |0.9421|± |0.0213|
| - jurisprudence | 1|none | 0|acc |↑ |0.8889|± |0.0304|
| - logical_fallacies | 1|none | 0|acc |↑ |0.9325|± |0.0197|
| - moral_disputes | 1|none | 0|acc |↑ |0.8902|± |0.0168|
| - moral_scenarios | 1|none | 0|acc |↑ |0.6559|± |0.0159|
| - philosophy | 1|none | 0|acc |↑ |0.9100|± |0.0163|
| - prehistory | 1|none | 0|acc |↑ |0.9444|± |0.0127|
| - professional_law | 1|none | 0|acc |↑ |0.7497|± |0.0111|
| - world_religions | 1|none | 0|acc |↑ |0.9298|± |0.0196|
| - other | 2|none | 0|acc |↑ |0.9057|± |0.0050|
| - business_ethics | 1|none | 0|acc |↑ |0.8500|± |0.0359|
| - clinical_knowledge | 1|none | 0|acc |↑ |0.9434|± |0.0142|
| - college_medicine | 1|none | 0|acc |↑ |0.8844|± |0.0244|
| - global_facts | 1|none | 0|acc |↑ |0.6900|± |0.0465|
| - human_aging | 1|none | 0|acc |↑ |0.8430|± |0.0244|
| - management | 1|none | 0|acc |↑ |0.9320|± |0.0249|
| - marketing | 1|none | 0|acc |↑ |0.9701|± |0.0112|
| - medical_genetics | 1|none | 0|acc |↑ |1.0000|± |0.0000|
| - miscellaneous | 1|none | 0|acc |↑ |0.9655|± |0.0065|
| - nutrition | 1|none | 0|acc |↑ |0.9314|± |0.0145|
| - professional_accounting | 1|none | 0|acc |↑ |0.8794|± |0.0194|
| - professional_medicine | 1|none | 0|acc |↑ |0.9449|± |0.0139|
| - virology | 1|none | 0|acc |↑ |0.6024|± |0.0381|
| - social sciences | 2|none | 0|acc |↑ |0.9363|± |0.0043|
| - econometrics | 1|none | 0|acc |↑ |0.8596|± |0.0327|
| - high_school_geography | 1|none | 0|acc |↑ |0.9646|± |0.0132|
| - high_school_government_and_politics| 1|none | 0|acc |↑ |0.9793|± |0.0103|
| - high_school_macroeconomics | 1|none | 0|acc |↑ |0.9564|± |0.0104|
| - high_school_microeconomics | 1|none | 0|acc |↑ |0.9706|± |0.0110|
| - high_school_psychology | 1|none | 0|acc |↑ |0.9725|± |0.0070|
| - human_sexuality | 1|none | 0|acc |↑ |0.9466|± |0.0197|
| - professional_psychology | 1|none | 0|acc |↑ |0.9167|± |0.0112|
| - public_relations | 1|none | 0|acc |↑ |0.7818|± |0.0396|
| - security_studies | 1|none | 0|acc |↑ |0.8735|± |0.0213|
| - sociology | 1|none | 0|acc |↑ |0.9303|± |0.0180|
| - us_foreign_policy | 1|none | 0|acc |↑ |0.9700|± |0.0171|
| - stem | 2|none | 0|acc |↑ |0.8960|± |0.0053|
| - abstract_algebra | 1|none | 0|acc |↑ |0.7800|± |0.0416|
| - anatomy | 1|none | 0|acc |↑ |0.8593|± |0.0300|
| - astronomy | 1|none | 0|acc |↑ |0.9605|± |0.0158|
| - college_biology | 1|none | 0|acc |↑ |0.9722|± |0.0137|
| - college_chemistry | 1|none | 0|acc |↑ |0.6800|± |0.0469|
| - college_computer_science | 1|none | 0|acc |↑ |0.9000|± |0.0302|
| - college_mathematics | 1|none | 0|acc |↑ |0.8400|± |0.0368|
| - college_physics | 1|none | 0|acc |↑ |0.8824|± |0.0321|
| - computer_security | 1|none | 0|acc |↑ |0.8700|± |0.0338|
| - conceptual_physics | 1|none | 0|acc |↑ |0.9319|± |0.0165|
| - electrical_engineering | 1|none | 0|acc |↑ |0.9103|± |0.0238|
| - elementary_mathematics | 1|none | 0|acc |↑ |0.9524|± |0.0110|
| - high_school_biology | 1|none | 0|acc |↑ |0.9742|± |0.0090|
| - high_school_chemistry | 1|none | 0|acc |↑ |0.8966|± |0.0214|
| - high_school_computer_science | 1|none | 0|acc |↑ |0.9400|± |0.0239|
| - high_school_mathematics | 1|none | 0|acc |↑ |0.7815|± |0.0252|
| - high_school_physics | 1|none | 0|acc |↑ |0.9007|± |0.0244|
| - high_school_statistics | 1|none | 0|acc |↑ |0.9028|± |0.0202|
| - machine_learning | 1|none | 0|acc |↑ |0.8482|± |0.0341|
| Groups |Version|Filter|n-shot|Metric| |Value | |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu | 2|none | |acc | |0.8796|± |0.0026|
| - humanities | 2|none | 0|acc |↑ |0.8142|± |0.0054|
| - other | 2|none | 0|acc |↑ |0.9057|± |0.0050|
| - social sciences| 2|none | 0|acc |↑ |0.9363|± |0.0043|
| - stem | 2|none | 0|acc |↑ |0.8960|± |0.0053|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------------------|------:|------|-----:|--------|---|-----:|---|-----:|
|gpqa_diamond_zeroshot| 1|none | 0|acc |↑ |0.4949|± |0.0356|
| | |none | 0|acc_norm|↑ |0.4949|± |0.0356|
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|-------------------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gpqa_diamond_cot_zeroshot| 1|flexible-extract| 0|exact_match|↑ |0.8636|± |0.0245|
| | |strict-match | 0|exact_match|↑ |0.8636|± |0.0245|
|Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------|------:|------|-----:|-----------------------|---|-----:|---|------|
|ifeval| 4|none | 0|inst_level_loose_acc |↑ |0.9269|± | N/A|
| | |none | 0|inst_level_strict_acc |↑ |0.9113|± | N/A|
| | |none | 0|prompt_level_loose_acc |↑ |0.9316|± |0.0109|
| | |none | 0|prompt_level_strict_acc|↑ |0.9113|± |0.0122|
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot| 3|flexible-extract| 0|exact_match|↑ |0.9257|± |0.0072|
| | |strict-match | 0|exact_match|↑ |0.7043|± |0.0126|
The difference in ifeval is too big. So big that I think I must have done something wrong when I ran the harness against your smol-IQ2_XS.
gsm8k is also different: Against IQ3_XXS I ran with -nshot == 0 so it got a much worse score in strict match. Still, it was a better in flexible match.