No increased performance in HumanEval benchmark

by Gogich77 - opened 16 days ago

UQ4-K-XL

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
humaneval	1	create_test	0	pass@1	↑	0.75	±	0.0339

This Q4-K-M

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
humaneval	1	create_test	0	pass@1	↑	0.75	±	0.0339

samuelcardillo

Owner 15 days ago

Thanks for running these benchmarks — this is really valuable feedback!

You're right that the Opus distillation doesn't improve (and slightly degrades) pure coding benchmarks like MBPP and HumanEval. The training data was focused on reasoning traces and tool-calling patterns rather than coding-specific tasks, so this result is expected.

Where the distillation does show improvement is in agentic workflows — tool calling went from 3.4 to 7.2 in our evaluations, and bug detection improved from 8.4 to 9.0. But that came at the cost of some raw coding ability, which your benchmarks clearly confirm.

This is helpful — we'll use this to steer future dataset curation toward including more coding-specific training data alongside the reasoning traces, so the next iteration doesn't trade off coding performance for agentic gains. Appreciate you taking the time to run these!

samuelcardillo changed discussion status to closed 15 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment