How did you achieve such remarkable metrics?

#39
by Regrin - opened

Hello!
Gemma is my favorite model family! I'm so grateful that they're being published. One thing's a shame: there's no 100b version of the parameters. Unfortunately, a 31b model can't be an encyclopedia. Oh well!
Could you please tell me how the developers managed to achieve such excellent metrics? How did they manage to cram so much power into 32 gigabytes?
That's my question. But this question is just a pretext for starting a thread.

My real question is this:
I'm planning to train a 16-megabyte language model myself.

So, where can I find a dataset large enough for the model to learn the grammar without overfitting? And one that produces meaningful output?

It would be ideal if the dataset came with testing tasks.

Which dataset would be suitable for me? The model has a non-standard architecture. Not a transformer, and not even a neural network. A Tsetlin machine.

Curiosity got the better of me..

16-megabyte language model? What actual parameter count/design?
Because to try to semi-answer your question there are many datasets that would be plenty for a 16-megabyte model.

I am only very, very passingly familiar with Tsetlin though, so apologies if the framing doesn't make sense in context.

Look, in Tsetlin's machine, one parameter is essentially one bit. In compressed form.
But I don't know how much the uncompressed version will weigh. The uncompressed version can be trained. The compressed version is essentially one big logical formula.
I'd be very interested if you could give me links to one or more datasets. I just don't know how to search.

Finite state machine basically?

Anyway, in an attempt to give a helpful answer, there are things like:
https://huggingface.co/datasets/Crownelius/Opus-4.6-Reasoning-2100x-formatted (and many such similar ones)
https://huggingface.co/datasets/amd/InstructGpt-educational
Good ol fineweb:
https://huggingface.co/datasets/HuggingFaceFW/fineweb

This isn't a finite state machine.
I recommend you read the original article to get a rough idea.

https://arxiv.org/abs/1804.01508

Oh, thank you, I'll read that over sometime and try to learn more about the topic.

Look at https://github.com/arman-bd/guppylm, maybe it fulfills your need of a MB-sized Modell training.

Sign up or log in to comment