--- tags: - multimodal - vision-language-model - image-generation - mla-attention - multi-head-latent-attention - flow-matching - diffusion - rectified-flow - on-device - efficient-attention - smol-scale - next-gpt - show-o - janusflow - svd-initialization - mha2mla-vlm - x-ecomla - research - proof-of-concept license: apache-2.0 language: - en --- # ⚠️ Research Proof of Concept — NOT Production Ready **Drop-in toolkit for building unified any-to-any multimodal models at smol scale (245M–586M parameters).** Tinman-SmolOmni-MLA demonstrates **Multi-Head Latent Attention (MLA)** from DeepSeek-V2 adapted for multimodal VLMs. The architecture is novel, validated, and functional. The **quality is not production-grade** — image generation produces near-random outputs, VQA was never trained, and ASR was attempted and failed. See [evaluation_results.json](evaluation_results.json) for full validation data. ## What Actually Works | Component | Status | Evidence | |-----------|--------|----------| | **Architecture** | ✅ Validated | GQA/MLA layer assignments verified from checkpoint weight keys | | **Checkpoint loading** | ✅ Verified | `from_hub()` and `load_checkpoint()` both tested | | **Text understanding** | ✅ Functional | Coherent next-token predictions | | **Image generation pipeline** | ⚠️ Runs but poor quality | CLIP score 0.11 (random ~0.15, good ~0.30) | | **VQA** | ❌ Not trained | 0% on ChartQA — no instruction tuning performed | | **ASR (native)** | ❌ Failed | 144% WER. Use Moonshine instead. | | **Speed / Memory** | ✅ Excellent | 15.9K tok/s, 1.2GB VRAM, 40% KV reduction | ## Production Readiness: What's Missing & Cost ### Image Generation ($500-$5,000) - **Data**: LAION-5B or COYO-700M (50M+ image-text pairs) - **Training**: 20,000-100,000 steps (vs our 2,000) - **Fix**: VAE latent normalization (std mismatch: 38× too large) - **Add**: Classifier-free guidance training ### VQA ($5-$20 on L4) - **Data**: ChartQA train (28K) + VQAv2 (200K) + DocVQA - **Training**: 2-5 epochs LoRA instruction tuning (rank 8-16) ### Audio ASR (recommended: Moonshine — $0) - **Moonshine-tiny**: 27M params, 3.5% WER, 0.1s latency - **No training needed** — already working in `moonshine_integration.py` ## 🚀 Quick Start ```bash pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit ``` ```python import torch from smolomni import SmolOmni # Load 500M (auto-downloads 1.1GB) model = SmolOmni.from_hub( "TinmanLabSL/SmolOmni-MLA-500M", checkpoint="stage2_final/model.pt", config="mla-hybrid-ar-flow-500M", device="cuda", dtype=torch.float16, strict=False, ) # Audio → Moonshine ASR → VLM from moonshine_integration import SmolOmniAudio audio = SmolOmniAudio() text = audio.transcribe("meeting.wav") ``` ## 📦 Structure | File | What it does | |------|-------------| | `smolomni/model.py` | `SmolOmni.from_hub()`, `load_checkpoint()`, `from_pretrained()` | | `smolomni/config.py` | 256M/500M presets with **verified** layer assignments | | `moonshine_integration.py` | Production-ready audio pipeline | | `test_production_readiness.py` | 6-test validation suite (all passing) | ## 🏗️ Architecture Verified from 500M checkpoint weight keys: - **GQA layers**: 0-9, 30-31 (`q_proj`, `k_proj`, `v_proj`, `o_proj`) - **MLA layers**: 10-29 (`q_a_proj`, `q_b_proj`, `kv_a_proj_with_mqa`, `kv_b_proj`) - **SVD init**: 312 of 520 weights copied from pretrained GQA - **Flow head**: 8-layer DiT, adaLN-Zero, 165M params ## 🔗 Related Models - **256M (Research POC)**: [TinmanLabSL/SmolOmni-MLA-256M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M) - **500M (Research POC)**: [TinmanLabSL/SmolOmni-MLA-500M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-500M) - **Evaluation Results**: [evaluation_results.json](evaluation_results.json) ## 📝 Citation ```bibtex @software{smolomni2025, title = {Tinman-SmolOmni-MLA Toolkit: Research POC for MLA in Multimodal VLM}, author = {TinmanLabSL}, year = {2025}, url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit}, note = {Research proof of concept. Not production ready.} } ``` ## License Apache 2.0