⚠️ Research Proof of Concept β€” NOT Production Ready

586M-parameter unified multimodal model. Validated architecture with Multi-Head Latent Attention (MLA). Low-quality image generation. No native ASR. Use Moonshine (included in Toolkit) for audio understanding.

What Actually Works

Component Status Evidence
Architecture βœ… Validated 585.8M params, verified GQA/MLA layer assignment
Checkpoint loading βœ… Verified SmolOmni.from_hub() downloads and loads
Text understanding βœ… Functional Coherent next-token predictions from prompts
Image generation pipeline ⚠️ Runs but quality near-random CLIP score 0.11 (random ~0.15, good ~0.30-0.35)
VQA / Vision QA ❌ Not trained 0% on ChartQA (10 samples) β€” no VQA tuning ever done
Audio ASR (native) ❌ Does NOT exist in checkpoint Stage 3 deleted (144% WER failure)
Audio ASR (Moonshine) βœ… Working via Toolkit 27M params, 3.5% WER, 0.1s latency
Speed βœ… Fast 15,901 tok/s AR, 0.40s / 50 flow steps
KV cache βœ… Compressed 12,160 floats/token (-40.6% vs baseline)
Memory βœ… Efficient 1,239 MB peak VRAM

Production Readiness: What's Missing & Cost

πŸ”΄ Image Generation (Requires significant investment)

Gap What's needed Estimated cost
Training data LAION-5B or COYO-700M (50M+ image-text pairs) $0 (download)
Training steps 20,000–100,000 (vs our 2,000 on small subset) $500–$5,000 on A100
VAE latent normalization Fix latent β†’ VAE decoding scale mismatch $0 (code fix, ~2 hours)
CFG + null conditioning Train with 10% unconditional dropout $50–$200 on A100
Total for good images $550–$5,200

🟑 VQA / Vision Understanding (Low cost, straightforward)

Gap What's needed Estimated cost
Instruction data ChartQA train (28K) + VQAv2 (200K) + DocVQA + TextVQA $0 (download)
Training 2–5 epochs LoRA instruction tuning (rank 8–16) $5–$20 on L4
Evaluation Standard VQA benchmarks $0 (CPU)
Total for working VQA $5–$20

🟒 Audio ASR (Recommended: Moonshine β€” already working)

Approach WER Effort Cost
Moonshine + VLM (current) ~3.5% Zero β€” already implemented $0
Native ASR (Whisper encoder + projector) ~2-5% Requires training $10–$50
Native ASR via discrete tokens (our attempt) 144% ❌ Failed, not recommended N/A

Recommendation: Use Moonshine. It's better than Whisper-tiny, 4Γ— faster, and requires zero training.

🟒 Text Understanding (Already functional)

The model's text decoder was SVD-initialized from SmolVLM-500M and underwent Stage 1 KL distillation. It produces coherent text. For competitive chat quality, add instruction fine-tuning on UltraChat or OpenHermes (~$10 on L4).

What This Checkpoint Actually Does

Capability Status Quality Notes
Text β†’ Text βœ… Functional Coherent SVD-initialized + KL distilled from SmolVLM
Image β†’ Text (VQA) ❌ Not trained N/A Pipeline runs, answers are random
Text β†’ Image ⚠️ Pipeline runs Near-random (CLIP 0.11) Architecture valid, needs 10×–50Γ— more data/steps
Audio β†’ Text (native) ❌ Deleted N/A Stage 3 failed; use Moonshine instead
Text β†’ Speech ❌ Not trained N/A Not present in this checkpoint

Architecture Verified From Checkpoint Weights

  • Base: SmolVLM-500M-Instruct
  • 32 layers (verified from 520 weight keys in state dict)
  • GQA layers 0–9, 30–31 (verified: q_proj.weight keys present)
  • MLA layers 10–29 (verified: q_a_proj.weight + kv_a_proj.weight keys present)
  • NoPE layers: 0, 4, 8, 12, 16, 20, 24, 28
  • SVD init: 312 of 520 weights copied from pretrained GQA via X-EcoMLA
  • Flow head: 8-layer DiT, adaLN-Zero, 165M params

πŸ“Š Benchmarks (Verified on NVIDIA L4)

Metric Baseline Tinman-MLA-500M Change
KV Cache / token 20,480 12,160 -40.6%
AR Throughput 2,100 tok/s 15,901 tok/s +657%
Peak VRAM 5,800 MB 1,239 MB -79%
Image Gen (50 steps) β€” 0.40 s New
Parameters 507.5M 585.8M +15% (flow head)

πŸ“₯ Download and Run

pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
import torch
from smolomni import SmolOmni
from transformers import AutoTokenizer

# 1.1 GB, auto-downloads from Hub
model = SmolOmni.from_hub(
    "TinmanLabSL/SmolOmni-MLA-500M",
    checkpoint="stage2_final/model.pt",
    config="mla-hybrid-ar-flow-500M",
    device="cuda",
    dtype=torch.bfloat16,
    strict=False,
)

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")

πŸ“ Files

File Size Purpose
stage2_final/model.pt 1.1 GB Production checkpoint: text + image generation
config.json 1.3 KB Architecture metadata

πŸ”— Links

πŸ“ Citation

@software{smolomni500m2025,
  title = {Tinman-SmolOmni-MLA-500M: Research POC β€” MLA Multimodal VLM},
  author = {TinmanLabSL},
  year = {2025},
  url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-500M},
  note = {Research proof of concept. Not production ready. Image generation quality is poor.}
}

License

Apache 2.0

Downloads last month
339
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support