Today we are releasing Supra Mini v4 2M: the fourth version of our Supra Mini series and our biggest leap yet. Trained on 3 billion tokens of Fineweb-Edu for 2 epochs, v4 pushes our parameter count to 2.6M while keeping the model light enough to run on any CPU.
What changed from v3?
Look at the numbers: v4 has ~5× more parameters than v3. We went from 467k to 2.6M parameters. This is not just a bigger model, the entire config was rethought to fit more capacity while keeping the architecture clean and the training fast.
Architecture → Llama
Vocab size → 8,192 (custom BPE)
Hidden size → 128
Intermediate → 512
Layers → 6
Attention heads → 4
Context length → 1,024 tokens
Trained in → bfloat16
Training setup
We trained v4 on a single NVIDIA RTX 5060 Ti 16GB in approximately 3 hours for 2 epochs. The dataset is the first 3 billion tokens of Sample-10BT from Fineweb-Edu, streamed and tokenized on the fly with our custom BPE tokenizer.
The final training loss after 2 epochs came in at 4.618. The full training code, tokenizer, training loop, and inference script, is available directly in the model repo.
Benchmarks
We evaluated v4 using lm-eval on three tasks. The random baselines are included so you can judge fairly.
| Task | Score | Random baseline | Delta |
|---|---|---|---|
| ARC_Easy | 0.3152 | 0.25 (25%) | +6.5pp above random |
| Wikitext (PPL) | 3.1652 | — | lower is better |
| BLiMP | 0.607 | 0.50 (50%) | +10.7pp above random |
A 2.6M parameter model beating random by over 10 points on BLiMP, a test of grammatical knowledge, is a solid result at this scale. Not GPT-4, obviously, but that is never the point with Supra Mini.
Example outputs
Here is what v4 generates at temperature=0.5, top_k=25, top_p=0.9:
The model clearly has a coherent sense of topic, it stays on subject and builds sentences. It hallucinates and drifts (as all base models at this scale do), but the fluency is real.
How to run it
Drop this into any Python environment with Transformers installed:
import torch
pipe = pipeline(
"text-generation",
model="SupraLabs/Supra-Mini-v4-2M",
device_map="auto",
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)
result = pipe(
"The importance of education is",
max_new_tokens=150,
do_sample=True,
temperature=0.5,
top_k=25,
top_p=0.9,
repetition_penalty=1.2
)
print(result[0]['generated_text'])
What's next?
v4 is a base model, it is not fine-tuned for instruction following or chat. The next experiments on our roadmap include fine-tuning on instruction datasets, exploring quantization at this new scale, and continuing to push the parameter count while keeping training accessible to everyone with a consumer GPU.
The model is live on HuggingFace. Go try it.
License → Apache 2.0
Series → Supra Mini collection on HuggingFace
SupraLabs_