β οΈ Research Proof of Concept β NOT Production Ready
Drop-in toolkit for building unified any-to-any multimodal models at smol scale (245Mβ586M parameters).
Tinman-SmolOmni-MLA demonstrates Multi-Head Latent Attention (MLA) from DeepSeek-V2 adapted for multimodal VLMs. The architecture is novel, validated, and functional. The quality is not production-grade β image generation produces near-random outputs, VQA was never trained, and ASR was attempted and failed.
See evaluation_results.json for full validation data.
What Actually Works
| Component | Status | Evidence |
|---|---|---|
| Architecture | β Validated | GQA/MLA layer assignments verified from checkpoint weight keys |
| Checkpoint loading | β Verified | from_hub() and load_checkpoint() both tested |
| Text understanding | β Functional | Coherent next-token predictions |
| Image generation pipeline | β οΈ Runs but poor quality | CLIP score 0.11 (random ~0.15, good ~0.30) |
| VQA | β Not trained | 0% on ChartQA β no instruction tuning performed |
| ASR (native) | β Failed | 144% WER. Use Moonshine instead. |
| Speed / Memory | β Excellent | 15.9K tok/s, 1.2GB VRAM, 40% KV reduction |
Production Readiness: What's Missing & Cost
Image Generation ($500-$5,000)
- Data: LAION-5B or COYO-700M (50M+ image-text pairs)
- Training: 20,000-100,000 steps (vs our 2,000)
- Fix: VAE latent normalization (std mismatch: 38Γ too large)
- Add: Classifier-free guidance training
VQA ($5-$20 on L4)
- Data: ChartQA train (28K) + VQAv2 (200K) + DocVQA
- Training: 2-5 epochs LoRA instruction tuning (rank 8-16)
Audio ASR (recommended: Moonshine β $0)
- Moonshine-tiny: 27M params, 3.5% WER, 0.1s latency
- No training needed β already working in
moonshine_integration.py
π Quick Start
pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
import torch
from smolomni import SmolOmni
# Load 500M (auto-downloads 1.1GB)
model = SmolOmni.from_hub(
"TinmanLabSL/SmolOmni-MLA-500M",
checkpoint="stage2_final/model.pt",
config="mla-hybrid-ar-flow-500M",
device="cuda",
dtype=torch.float16,
strict=False,
)
# Audio β Moonshine ASR β VLM
from moonshine_integration import SmolOmniAudio
audio = SmolOmniAudio()
text = audio.transcribe("meeting.wav")
π¦ Structure
| File | What it does |
|---|---|
smolomni/model.py |
SmolOmni.from_hub(), load_checkpoint(), from_pretrained() |
smolomni/config.py |
256M/500M presets with verified layer assignments |
moonshine_integration.py |
Production-ready audio pipeline |
test_production_readiness.py |
6-test validation suite (all passing) |
ποΈ Architecture
Verified from 500M checkpoint weight keys:
- GQA layers: 0-9, 30-31 (
q_proj,k_proj,v_proj,o_proj) - MLA layers: 10-29 (
q_a_proj,q_b_proj,kv_a_proj_with_mqa,kv_b_proj) - SVD init: 312 of 520 weights copied from pretrained GQA
- Flow head: 8-layer DiT, adaLN-Zero, 165M params
π Related Models
- 256M (Research POC): TinmanLabSL/SmolOmni-MLA-256M
- 500M (Research POC): TinmanLabSL/SmolOmni-MLA-500M
- Evaluation Results: evaluation_results.json
π Citation
@software{smolomni2025,
title = {Tinman-SmolOmni-MLA Toolkit: Research POC for MLA in Multimodal VLM},
author = {TinmanLabSL},
year = {2025},
url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit},
note = {Research proof of concept. Not production ready.}
}
License
Apache 2.0
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support