No increased performance in HumanEval benchmark
UQ4-K-XL
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| humaneval | 1 | create_test | 0 | pass@1 | β | 0.75 | Β± | 0.0339 |
This Q4-K-M
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| humaneval | 1 | create_test | 0 | pass@1 | β | 0.75 | Β± | 0.0339 |
Thanks for running these benchmarks β this is really valuable feedback!
You're right that the Opus distillation doesn't improve (and slightly degrades) pure coding benchmarks like MBPP and HumanEval. The training data was focused on reasoning traces and tool-calling patterns rather than coding-specific tasks, so this result is expected.
Where the distillation does show improvement is in agentic workflows β tool calling went from 3.4 to 7.2 in our evaluations, and bug detection improved from 8.4 to 9.0. But that came at the cost of some raw coding ability, which your benchmarks clearly confirm.
This is helpful β we'll use this to steer future dataset curation toward including more coding-specific training data alongside the reasoning traces, so the next iteration doesn't trade off coding performance for agentic gains. Appreciate you taking the time to run these!