⚠️ Research Proof of Concept β€” NOT Production Ready

Drop-in toolkit for building unified any-to-any multimodal models at smol scale (245M–586M parameters).

Tinman-SmolOmni-MLA demonstrates Multi-Head Latent Attention (MLA) from DeepSeek-V2 adapted for multimodal VLMs. The architecture is novel, validated, and functional. The quality is not production-grade β€” image generation produces near-random outputs, VQA was never trained, and ASR was attempted and failed.

See evaluation_results.json for full validation data.

What Actually Works

Component Status Evidence
Architecture βœ… Validated GQA/MLA layer assignments verified from checkpoint weight keys
Checkpoint loading βœ… Verified from_hub() and load_checkpoint() both tested
Text understanding βœ… Functional Coherent next-token predictions
Image generation pipeline ⚠️ Runs but poor quality CLIP score 0.11 (random ~0.15, good ~0.30)
VQA ❌ Not trained 0% on ChartQA β€” no instruction tuning performed
ASR (native) ❌ Failed 144% WER. Use Moonshine instead.
Speed / Memory βœ… Excellent 15.9K tok/s, 1.2GB VRAM, 40% KV reduction

Production Readiness: What's Missing & Cost

Image Generation ($500-$5,000)

  • Data: LAION-5B or COYO-700M (50M+ image-text pairs)
  • Training: 20,000-100,000 steps (vs our 2,000)
  • Fix: VAE latent normalization (std mismatch: 38Γ— too large)
  • Add: Classifier-free guidance training

VQA ($5-$20 on L4)

  • Data: ChartQA train (28K) + VQAv2 (200K) + DocVQA
  • Training: 2-5 epochs LoRA instruction tuning (rank 8-16)

Audio ASR (recommended: Moonshine β€” $0)

  • Moonshine-tiny: 27M params, 3.5% WER, 0.1s latency
  • No training needed β€” already working in moonshine_integration.py

πŸš€ Quick Start

pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
import torch
from smolomni import SmolOmni

# Load 500M (auto-downloads 1.1GB)
model = SmolOmni.from_hub(
    "TinmanLabSL/SmolOmni-MLA-500M",
    checkpoint="stage2_final/model.pt",
    config="mla-hybrid-ar-flow-500M",
    device="cuda",
    dtype=torch.float16,
    strict=False,
)

# Audio β†’ Moonshine ASR β†’ VLM
from moonshine_integration import SmolOmniAudio
audio = SmolOmniAudio()
text = audio.transcribe("meeting.wav")

πŸ“¦ Structure

File What it does
smolomni/model.py SmolOmni.from_hub(), load_checkpoint(), from_pretrained()
smolomni/config.py 256M/500M presets with verified layer assignments
moonshine_integration.py Production-ready audio pipeline
test_production_readiness.py 6-test validation suite (all passing)

πŸ—οΈ Architecture

Verified from 500M checkpoint weight keys:

  • GQA layers: 0-9, 30-31 (q_proj, k_proj, v_proj, o_proj)
  • MLA layers: 10-29 (q_a_proj, q_b_proj, kv_a_proj_with_mqa, kv_b_proj)
  • SVD init: 312 of 520 weights copied from pretrained GQA
  • Flow head: 8-layer DiT, adaLN-Zero, 165M params

πŸ”— Related Models

πŸ“ Citation

@software{smolomni2025,
  title = {Tinman-SmolOmni-MLA Toolkit: Research POC for MLA in Multimodal VLM},
  author = {TinmanLabSL},
  year = {2025},
  url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit},
  note = {Research proof of concept. Not production ready.}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support