⚠️ Research Proof of Concept — NOT Production Ready

Drop-in toolkit for building unified any-to-any multimodal models at smol scale (245M–586M parameters).

Tinman-SmolOmni-MLA demonstrates Multi-Head Latent Attention (MLA) from DeepSeek-V2 adapted for multimodal VLMs. The architecture is novel, validated, and functional. The quality is not production-grade — image generation produces near-random outputs, VQA was never trained, and ASR was attempted and failed.

See evaluation_results.json for full validation data.

What Actually Works

Component	Status	Evidence
Architecture	✅ Validated	GQA/MLA layer assignments verified from checkpoint weight keys
Checkpoint loading	✅ Verified	`from_hub()` and `load_checkpoint()` both tested
Text understanding	✅ Functional	Coherent next-token predictions
Image generation pipeline	⚠️ Runs but poor quality	CLIP score 0.11 (random ~0.15, good ~0.30)
VQA	❌ Not trained	0% on ChartQA — no instruction tuning performed
ASR (native)	❌ Failed	144% WER. Use Moonshine instead.
Speed / Memory	✅ Excellent	15.9K tok/s, 1.2GB VRAM, 40% KV reduction

Production Readiness: What's Missing & Cost

Image Generation ($500-$5,000)

Data: LAION-5B or COYO-700M (50M+ image-text pairs)
Training: 20,000-100,000 steps (vs our 2,000)
Fix: VAE latent normalization (std mismatch: 38× too large)
Add: Classifier-free guidance training

VQA ($5-$20 on L4)

Data: ChartQA train (28K) + VQAv2 (200K) + DocVQA
Training: 2-5 epochs LoRA instruction tuning (rank 8-16)

Audio ASR (recommended: Moonshine — $0)

Moonshine-tiny: 27M params, 3.5% WER, 0.1s latency
No training needed — already working in moonshine_integration.py

🚀 Quick Start

pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit

import torch
from smolomni import SmolOmni

# Load 500M (auto-downloads 1.1GB)
model = SmolOmni.from_hub(
    "TinmanLabSL/SmolOmni-MLA-500M",
    checkpoint="stage2_final/model.pt",
    config="mla-hybrid-ar-flow-500M",
    device="cuda",
    dtype=torch.float16,
    strict=False,
)

# Audio → Moonshine ASR → VLM
from moonshine_integration import SmolOmniAudio
audio = SmolOmniAudio()
text = audio.transcribe("meeting.wav")

📦 Structure

File	What it does
`smolomni/model.py`	`SmolOmni.from_hub()`, `load_checkpoint()`, `from_pretrained()`
`smolomni/config.py`	256M/500M presets with verified layer assignments
`moonshine_integration.py`	Production-ready audio pipeline
`test_production_readiness.py`	6-test validation suite (all passing)

🏗️ Architecture

Verified from 500M checkpoint weight keys:

GQA layers: 0-9, 30-31 (q_proj, k_proj, v_proj, o_proj)
MLA layers: 10-29 (q_a_proj, q_b_proj, kv_a_proj_with_mqa, kv_b_proj)
SVD init: 312 of 520 weights copied from pretrained GQA
Flow head: 8-layer DiT, adaLN-Zero, 165M params

🔗 Related Models

256M (Research POC): TinmanLabSL/SmolOmni-MLA-256M
500M (Research POC): TinmanLabSL/SmolOmni-MLA-500M
Evaluation Results: evaluation_results.json

📝 Citation

@software{smolomni2025,
  title = {Tinman-SmolOmni-MLA Toolkit: Research POC for MLA in Multimodal VLM},
  author = {TinmanLabSL},
  year = {2025},
  url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit},
  note = {Research proof of concept. Not production ready.}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support