gpqa diamond
Thanks Christian! I did one benchmark run using evalscope (vLLM with 32K context):
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
| Qwen3-32B-AWQ | gpqa | AveragePass@1 | gpqa_diamond | 198 | 0.702 | default |
evalscope eval
--eval-type service
--api-key dummy
--api-url http://localhost:8000/v1
--model "Qwen3-32B-AWQ"
--generation-config temperature=0.6,top_p=0.95,top_k=20
--datasets gpqa
--dataset-args '{"gpqa": {"subset_list": ["gpqa_diamond"], "few_shot_num": 0}}'
--eval-batch-size 2
With Qwen/QwQ-32B-AWQ I benchmarked 0.6313 in diamond.
Nice. Now Qwen has also released its own AWQ Quants. ;-) I assume they used also AutoAWQ as they patched it themselves:
https://github.com/casper-hansen/AutoAWQ/pull/751
They are also about to add support for the MoE as I have read. Although the project has been adopted by the vllm-project and will not be continued.
As I had the opportunity to run evalscope with your parameters against a non quantized version on a H100 and I was curious, here are the results of my run.
| Qwen3-32B | gpqa | AveragePass@1 | gpqa_diamond | 198 | 0.697 | default |