β οΈ Research Proof of Concept β NOT Production Ready
This is a research artifact demonstrating Multi-Head Latent Attention (MLA) in a multimodal VLM below 1B parameters. The architecture is novel and validated. The quality is not production-grade.
What Actually Works
| Component |
Status |
Evidence |
| Architecture |
β
Validated |
245.7M params, MLA reduces KV cache -38.9% |
| Text understanding |
β
Functional |
Forward pass produces coherent logits |
| Image generation pipeline |
β οΈ Runs but quality poor |
CLIP score 0.11 (random ~0.15, good ~0.30) |
| VQA / Vision understanding |
β Not trained |
No VQA instruction tuning performed |
| TTS mel generation |
β οΈ Architecture sound |
MSE ~2.0, trained on only 8 samples |
| ASR (Audio β Text) |
β Broken |
100% CER β model never learned ASR |
| Speed |
β
Fast |
17,140 tok/s AR, 0.30s / 50 flow steps |
| Memory |
β
Efficient |
510 MB peak VRAM, 109 MB NF4 quantized |
Production Readiness: What's Missing & Cost
To make image generation good:
| Gap |
What's needed |
Estimated cost |
| Training data |
LAION-5B or COYO-700M (50M+ image-text pairs) |
$0 (download) |
| Training steps |
20,000β100,000 (vs our 2,000) |
$500β$5,000 on A100 |
| Latent normalization |
Fix VAE scaling in generation code |
$0 (code fix) |
| CFG + null conditioning |
Add classifier-free guidance training |
$100β$500 |
| Total for good images |
|
$600β$5,500 |
To make VQA work:
| Gap |
What's needed |
Estimated cost |
| Instruction data |
ChartQA train (28K) + VQAv2 (200K) + DocVQA |
$0 (download) |
| Training |
2β5 epochs VQA instruction tuning |
$5β$20 on L4 |
| Evaluation |
VQAv2, TextVQA, ChartQA benchmarks |
$0 (CPU after training) |
| Total for working VQA |
|
$5β$20 |
To make ASR work (native):
| Gap |
What's needed |
Estimated cost |
| Speech encoder |
Whisper-small encoder (frozen, 87M) |
$0 (download) |
| Projector |
Linear/MLP adapter (~20M trainable) |
$0 (code) |
| Data |
LibriSpeech-960 (960h) + optionally GigaSpeech (10Kh) |
$0 (download) |
| Training |
5β10 epochs, projector only, LLM frozen |
$10β$50 on A100 |
| Total for native ASR |
|
$10β$50 |
Easier alternative (already available):
Use Moonshine ASR integration in the Toolkit β 27M params, 3.5% WER, 0.1s latency. No training needed.
What This Checkpoint Actually Does
| Capability |
Status |
Quality |
Notes |
| Text β Text |
β
Functional |
Coherent |
SVD-initialized from SmolVLM |
| Image β Text understanding |
β οΈ Pipeline runs |
Untested |
Needs VQA instruction tuning |
| Text β Image |
β οΈ Pipeline runs |
Poor (CLIP 0.11) |
Architecture valid, needs scale |
| Text β Mel |
β οΈ Architecture sound |
Undertrained |
Only 8 samples in training |
| Audio β Text |
β Broken |
N/A |
Deleted, use Moonshine instead |
π Files
| File |
Size |
Purpose |
stage4_v2/final/model.pt |
938 MB |
Full checkpoint (text + image gen + TTS mel) |
model_nf4.safetensors |
109 MB |
4-bit quantized (recommended for downloading) |
*.onnx |
~1.3 GB |
ONNX exports for understanding + generation |
π₯ Download
pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
import torch
from smolomni import SmolOmni
from transformers import AutoTokenizer
model = SmolOmni.from_hub(
"TinmanLabSL/SmolOmni-MLA-256M",
checkpoint="stage4_v2/final/model.pt",
config="mla-hybrid-ar-flow-256M",
device="cuda",
dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct")
ποΈ Architecture
- Base: SmolVLM-256M-Instruct
- GQA layers: 0β9 (preserves vision knowledge)
- MLA layers: 10β29 (KV cache compression)
- NoPE: every 4th layer
- SVD init: 294 of 464 weights copied from pretrained GQA via X-EcoMLA
- Parameters: 245.7M total
π Benchmarks (Verified on L4)
| Metric |
Value |
| KV Cache / token |
7,040 floats (-38.9%) |
| AR Throughput |
17,140 tok/s |
| Peak VRAM |
510 MB |
| Image Gen (50 steps) |
0.30 s |
| NF4 Size |
109 MB |
π Links
π Citation
@software{smolomni256m2025,
title = {Tinman-SmolOmni-MLA-256M: Research POC β MLA in Multimodal VLM},
author = {TinmanLabSL},
year = {2025},
url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M},
note = {Research proof of concept. Not production ready.}
}
License
Apache 2.0