β οΈ Research Proof of Concept β NOT Production Ready
586M-parameter unified multimodal model. Validated architecture with Multi-Head Latent Attention (MLA). Low-quality image generation. No native ASR. Use Moonshine (included in Toolkit) for audio understanding.
What Actually Works
| Component |
Status |
Evidence |
| Architecture |
β
Validated |
585.8M params, verified GQA/MLA layer assignment |
| Checkpoint loading |
β
Verified |
SmolOmni.from_hub() downloads and loads |
| Text understanding |
β
Functional |
Coherent next-token predictions from prompts |
| Image generation pipeline |
β οΈ Runs but quality near-random |
CLIP score 0.11 (random ~0.15, good ~0.30-0.35) |
| VQA / Vision QA |
β Not trained |
0% on ChartQA (10 samples) β no VQA tuning ever done |
| Audio ASR (native) |
β Does NOT exist in checkpoint |
Stage 3 deleted (144% WER failure) |
| Audio ASR (Moonshine) |
β
Working via Toolkit |
27M params, 3.5% WER, 0.1s latency |
| Speed |
β
Fast |
15,901 tok/s AR, 0.40s / 50 flow steps |
| KV cache |
β
Compressed |
12,160 floats/token (-40.6% vs baseline) |
| Memory |
β
Efficient |
1,239 MB peak VRAM |
Production Readiness: What's Missing & Cost
π΄ Image Generation (Requires significant investment)
| Gap |
What's needed |
Estimated cost |
| Training data |
LAION-5B or COYO-700M (50M+ image-text pairs) |
$0 (download) |
| Training steps |
20,000β100,000 (vs our 2,000 on small subset) |
$500β$5,000 on A100 |
| VAE latent normalization |
Fix latent β VAE decoding scale mismatch |
$0 (code fix, ~2 hours) |
| CFG + null conditioning |
Train with 10% unconditional dropout |
$50β$200 on A100 |
| Total for good images |
|
$550β$5,200 |
π‘ VQA / Vision Understanding (Low cost, straightforward)
| Gap |
What's needed |
Estimated cost |
| Instruction data |
ChartQA train (28K) + VQAv2 (200K) + DocVQA + TextVQA |
$0 (download) |
| Training |
2β5 epochs LoRA instruction tuning (rank 8β16) |
$5β$20 on L4 |
| Evaluation |
Standard VQA benchmarks |
$0 (CPU) |
| Total for working VQA |
|
$5β$20 |
π’ Audio ASR (Recommended: Moonshine β already working)
| Approach |
WER |
Effort |
Cost |
| Moonshine + VLM (current) |
~3.5% |
Zero β already implemented |
$0 |
| Native ASR (Whisper encoder + projector) |
~2-5% |
Requires training |
$10β$50 |
| Native ASR via discrete tokens (our attempt) |
144% |
β Failed, not recommended |
N/A |
Recommendation: Use Moonshine. It's better than Whisper-tiny, 4Γ faster, and requires zero training.
π’ Text Understanding (Already functional)
The model's text decoder was SVD-initialized from SmolVLM-500M and underwent Stage 1 KL distillation. It produces coherent text. For competitive chat quality, add instruction fine-tuning on UltraChat or OpenHermes (~$10 on L4).
What This Checkpoint Actually Does
| Capability |
Status |
Quality |
Notes |
| Text β Text |
β
Functional |
Coherent |
SVD-initialized + KL distilled from SmolVLM |
| Image β Text (VQA) |
β Not trained |
N/A |
Pipeline runs, answers are random |
| Text β Image |
β οΈ Pipeline runs |
Near-random (CLIP 0.11) |
Architecture valid, needs 10Γβ50Γ more data/steps |
| Audio β Text (native) |
β Deleted |
N/A |
Stage 3 failed; use Moonshine instead |
| Text β Speech |
β Not trained |
N/A |
Not present in this checkpoint |
Architecture Verified From Checkpoint Weights
- Base: SmolVLM-500M-Instruct
- 32 layers (verified from 520 weight keys in state dict)
- GQA layers 0β9, 30β31 (verified:
q_proj.weight keys present)
- MLA layers 10β29 (verified:
q_a_proj.weight + kv_a_proj.weight keys present)
- NoPE layers: 0, 4, 8, 12, 16, 20, 24, 28
- SVD init: 312 of 520 weights copied from pretrained GQA via X-EcoMLA
- Flow head: 8-layer DiT, adaLN-Zero, 165M params
π Benchmarks (Verified on NVIDIA L4)
| Metric |
Baseline |
Tinman-MLA-500M |
Change |
| KV Cache / token |
20,480 |
12,160 |
-40.6% |
| AR Throughput |
2,100 tok/s |
15,901 tok/s |
+657% |
| Peak VRAM |
5,800 MB |
1,239 MB |
-79% |
| Image Gen (50 steps) |
β |
0.40 s |
New |
| Parameters |
507.5M |
585.8M |
+15% (flow head) |
π₯ Download and Run
pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
import torch
from smolomni import SmolOmni
from transformers import AutoTokenizer
model = SmolOmni.from_hub(
"TinmanLabSL/SmolOmni-MLA-500M",
checkpoint="stage2_final/model.pt",
config="mla-hybrid-ar-flow-500M",
device="cuda",
dtype=torch.bfloat16,
strict=False,
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
π Files
| File |
Size |
Purpose |
stage2_final/model.pt |
1.1 GB |
Production checkpoint: text + image generation |
config.json |
1.3 KB |
Architecture metadata |
π Links
π Citation
@software{smolomni500m2025,
title = {Tinman-SmolOmni-MLA-500M: Research POC β MLA Multimodal VLM},
author = {TinmanLabSL},
year = {2025},
url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-500M},
note = {Research proof of concept. Not production ready. Image generation quality is poor.}
}
License
Apache 2.0