Today we are releasing Supra Mini v4 2M: the fourth version of our Supra Mini series and our biggest leap yet. Trained on 3 billion tokens of Fineweb-Edu for 2 epochs, v4 pushes our parameter count to 2.6M while keeping the model light enough to run on any CPU.

What changed from v3?

Look at the numbers: v4 has ~5× more parameters than v3. We went from 467k to 2.6M parameters. This is not just a bigger model, the entire config was rethought to fit more capacity while keeping the architecture clean and the training fast.

// supra mini v4 2m — model config Parameters       â†’ 2,623,104 (~2.6M)
Architecture    â†’ Llama
Vocab size      â†’ 8,192 (custom BPE)
Hidden size     â†’ 128
Intermediate   â†’ 512
Layers          â†’ 6
Attention heads  â†’ 4
Context length   â†’ 1,024 tokens
Trained in      â†’ bfloat16

Training setup

We trained v4 on a single NVIDIA RTX 5060 Ti 16GB in approximately 3 hours for 2 epochs. The dataset is the first 3 billion tokens of Sample-10BT from Fineweb-Edu, streamed and tokenized on the fly with our custom BPE tokenizer.

The final training loss after 2 epochs came in at 4.618. The full training code, tokenizer, training loop, and inference script, is available directly in the model repo.

Benchmarks

We evaluated v4 using lm-eval on three tasks. The random baselines are included so you can judge fairly.

Task Score Random baseline Delta
ARC_Easy0.31520.25 (25%)+6.5pp above random
Wikitext (PPL)3.1652—lower is better
BLiMP0.6070.50 (50%)+10.7pp above random

A 2.6M parameter model beating random by over 10 points on BLiMP, a test of grammatical knowledge, is a solid result at this scale. Not GPT-4, obviously, but that is never the point with Supra Mini.

Example outputs

Here is what v4 generates at temperature=0.5, top_k=25, top_p=0.9:

// prompt
"Artificial intelligence is "
// output
"Artificial intelligence is the idea of the theory that the world has a very high-performance technology, which is also more important to society's lives than people who are being able to find their own knowledge and understanding how it can be used for future generations..."
// prompt
"The main concept of physics is "
// output
"The main concept of physics is 'animi-hisi', and therefore the universe's own light. In this case, a theory that is not only used to explain what it can be called 'the universe' or 'two planets, which are exactly about the earth's gravitational energy, but also in reality, we know how much things do..."

The model clearly has a coherent sense of topic, it stays on subject and builds sentences. It hallucinates and drifts (as all base models at this scale do), but the fluency is real.

How to run it

Drop this into any Python environment with Transformers installed:

from transformers import pipeline
import torch

pipe = pipeline(
    "text-generation",
    model="SupraLabs/Supra-Mini-v4-2M",
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

result = pipe(
    "The importance of education is",
    max_new_tokens=150,
    do_sample=True,
    temperature=0.5,
    top_k=25,
    top_p=0.9,
    repetition_penalty=1.2
)
print(result[0]['generated_text'])

What's next?

v4 is a base model, it is not fine-tuned for instruction following or chat. The next experiments on our roadmap include fine-tuning on instruction datasets, exploring quantization at this new scale, and continuing to push the parameter count while keeping training accessible to everyone with a consumer GPU.

The model is live on HuggingFace. Go try it.

// links Model   â†’ huggingface.co/SupraLabs/Supra-Mini-v4-2M
License → Apache 2.0
Series  â†’ Supra Mini collection on HuggingFace
#release #supra-mini-v4 #tinyml #llama #open-source #fineweb-edu #edge-ai