benchmarks
hello, nice work. Are there any benchmarks?
The model doesn't fit on my gpu, so to benchmark the model I'd have to use a cloud provider to do the compute (aka spend more money I don't have lol)
If anyone in the community is able to run these benchmarks that would be appreciated, otherwise let me wait for my next paycheck before running the benchmarks ;)
how to run bencharmk? I can do it if you want
how to run bencharmk? I can do it if you want
It would be much appreciated if you could bench the base glm 4.7 flash model against this one.
Yes this lm_eval harness is the way I do it as well.
benchmarks are live
I know this is closed, but I think its really cool that such a small dataset gave several of the benchmarks a nice little bump!
Yep! This is exactly what we're trying to show with our distills.
Many people see a small dataset and immediately dismiss the model or the distillation process, but it's important to remember that our goal isn't to distill massive amounts of knowledge from teacher models. Instead of that daunting task, we are only distilling the chain of thought (CoT). The performance shifts we're seeing across benchmarks show just how important a well-formulated CoT is when it comes to language models that are, at their core, predicting the next token.
So yes it's very interesting to see what areas certain CoT's show improvements and regressions in. Instead of trying to compete with these large AI labs by making the "smartest" model, we are distilling the step by step process that the teacher model (in this case Claude Opus 4.5) takes each time it is presented with a new task. This approach effectively unlocks the model's existing pre-training, which is exactly why google and openai don't reveal their raw reasoning traces π
Well this is why CoT was so transformative for distillation