⚠️ Research Proof of Concept β€” NOT Production Ready

This is a research artifact demonstrating Multi-Head Latent Attention (MLA) in a multimodal VLM below 1B parameters. The architecture is novel and validated. The quality is not production-grade.

What Actually Works

Component Status Evidence
Architecture βœ… Validated 245.7M params, MLA reduces KV cache -38.9%
Text understanding βœ… Functional Forward pass produces coherent logits
Image generation pipeline ⚠️ Runs but quality poor CLIP score 0.11 (random ~0.15, good ~0.30)
VQA / Vision understanding ❌ Not trained No VQA instruction tuning performed
TTS mel generation ⚠️ Architecture sound MSE ~2.0, trained on only 8 samples
ASR (Audio β†’ Text) ❌ Broken 100% CER β€” model never learned ASR
Speed βœ… Fast 17,140 tok/s AR, 0.30s / 50 flow steps
Memory βœ… Efficient 510 MB peak VRAM, 109 MB NF4 quantized

Production Readiness: What's Missing & Cost

To make image generation good:

Gap What's needed Estimated cost
Training data LAION-5B or COYO-700M (50M+ image-text pairs) $0 (download)
Training steps 20,000–100,000 (vs our 2,000) $500–$5,000 on A100
Latent normalization Fix VAE scaling in generation code $0 (code fix)
CFG + null conditioning Add classifier-free guidance training $100–$500
Total for good images $600–$5,500

To make VQA work:

Gap What's needed Estimated cost
Instruction data ChartQA train (28K) + VQAv2 (200K) + DocVQA $0 (download)
Training 2–5 epochs VQA instruction tuning $5–$20 on L4
Evaluation VQAv2, TextVQA, ChartQA benchmarks $0 (CPU after training)
Total for working VQA $5–$20

To make ASR work (native):

Gap What's needed Estimated cost
Speech encoder Whisper-small encoder (frozen, 87M) $0 (download)
Projector Linear/MLP adapter (~20M trainable) $0 (code)
Data LibriSpeech-960 (960h) + optionally GigaSpeech (10Kh) $0 (download)
Training 5–10 epochs, projector only, LLM frozen $10–$50 on A100
Total for native ASR $10–$50

Easier alternative (already available):

Use Moonshine ASR integration in the Toolkit β€” 27M params, 3.5% WER, 0.1s latency. No training needed.

What This Checkpoint Actually Does

Capability Status Quality Notes
Text β†’ Text βœ… Functional Coherent SVD-initialized from SmolVLM
Image β†’ Text understanding ⚠️ Pipeline runs Untested Needs VQA instruction tuning
Text β†’ Image ⚠️ Pipeline runs Poor (CLIP 0.11) Architecture valid, needs scale
Text β†’ Mel ⚠️ Architecture sound Undertrained Only 8 samples in training
Audio β†’ Text ❌ Broken N/A Deleted, use Moonshine instead

πŸ“ Files

File Size Purpose
stage4_v2/final/model.pt 938 MB Full checkpoint (text + image gen + TTS mel)
model_nf4.safetensors 109 MB 4-bit quantized (recommended for downloading)
*.onnx ~1.3 GB ONNX exports for understanding + generation

πŸ“₯ Download

pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
import torch
from smolomni import SmolOmni
from transformers import AutoTokenizer

model = SmolOmni.from_hub(
    "TinmanLabSL/SmolOmni-MLA-256M",
    checkpoint="stage4_v2/final/model.pt",
    config="mla-hybrid-ar-flow-256M",
    device="cuda",
    dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolVLM-256M-Instruct")

πŸ—οΈ Architecture

  • Base: SmolVLM-256M-Instruct
  • GQA layers: 0–9 (preserves vision knowledge)
  • MLA layers: 10–29 (KV cache compression)
  • NoPE: every 4th layer
  • SVD init: 294 of 464 weights copied from pretrained GQA via X-EcoMLA
  • Parameters: 245.7M total

πŸ“Š Benchmarks (Verified on L4)

Metric Value
KV Cache / token 7,040 floats (-38.9%)
AR Throughput 17,140 tok/s
Peak VRAM 510 MB
Image Gen (50 steps) 0.30 s
NF4 Size 109 MB

πŸ”— Links

πŸ“ Citation

@software{smolomni256m2025,
  title = {Tinman-SmolOmni-MLA-256M: Research POC β€” MLA in Multimodal VLM},
  author = {TinmanLabSL},
  year = {2025},
  url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M},
  note = {Research proof of concept. Not production ready.}
}

License

Apache 2.0

Downloads last month
793
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support