benchmarks

by ale-volpe - opened Jan 24

Discussion

ale-volpe

Jan 24

hello, nice work. Are there any benchmarks?

armand0e

TeichAI org Jan 24

The model doesn't fit on my gpu, so to benchmark the model I'd have to use a cloud provider to do the compute (aka spend more money I don't have lol)
If anyone in the community is able to run these benchmarks that would be appreciated, otherwise let me wait for my next paycheck before running the benchmarks ;)

ox-ox

Feb 6

how to run bencharmk? I can do it if you want

DogOnKeyboard

Feb 6

how to run bencharmk? I can do it if you want

It would be much appreciated if you could bench the base glm 4.7 flash model against this one.

https://github.com/EleutherAI/lm-evaluation-harness

armand0e

TeichAI org Feb 6

Yes this lm_eval harness is the way I do it as well.

armand0e

TeichAI org Feb 8

benchmarks are live

armand0e changed discussion status to closed Feb 8

SuperbEmphasis

Feb 9

I know this is closed, but I think its really cool that such a small dataset gave several of the benchmarks a nice little bump!

armand0e changed discussion status to open Feb 9

armand0e

TeichAI org Feb 9

Yep! This is exactly what we're trying to show with our distills.

Many people see a small dataset and immediately dismiss the model or the distillation process, but it's important to remember that our goal isn't to distill massive amounts of knowledge from teacher models. Instead of that daunting task, we are only distilling the chain of thought (CoT). The performance shifts we're seeing across benchmarks show just how important a well-formulated CoT is when it comes to language models that are, at their core, predicting the next token.

So yes it's very interesting to see what areas certain CoT's show improvements and regressions in. Instead of trying to compete with these large AI labs by making the "smartest" model, we are distilling the step by step process that the teacher model (in this case Claude Opus 4.5) takes each time it is presented with a new task. This approach effectively unlocks the model's existing pre-training, which is exactly why google and openai don't reveal their raw reasoning traces 😉

Bob-the-Koala

Feb 15

Well this is why CoT was so transformative for distillation

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment