GGUF
English
qwen3
reasoning
distillation
claude-opus
full-finetune
conversational

No increased performance in HumanEval benchmark

#4
by Gogich77 - opened

UQ4-K-XL

Tasks Version Filter n-shot Metric Value Stderr
humaneval 1 create_test 0 pass@1 ↑ 0.75 Β± 0.0339

This Q4-K-M

Tasks Version Filter n-shot Metric Value Stderr
humaneval 1 create_test 0 pass@1 ↑ 0.75 Β± 0.0339

Thanks for running these benchmarks β€” this is really valuable feedback!

You're right that the Opus distillation doesn't improve (and slightly degrades) pure coding benchmarks like MBPP and HumanEval. The training data was focused on reasoning traces and tool-calling patterns rather than coding-specific tasks, so this result is expected.

Where the distillation does show improvement is in agentic workflows β€” tool calling went from 3.4 to 7.2 in our evaluations, and bug detection improved from 8.4 to 9.0. But that came at the cost of some raw coding ability, which your benchmarks clearly confirm.

This is helpful β€” we'll use this to steer future dataset curation toward including more coding-specific training data alongside the reasoning traces, so the next iteration doesn't trade off coding performance for agentic gains. Appreciate you taking the time to run these!

samuelcardillo changed discussion status to closed

Sign up or log in to comment