HONEST README: Research POC with evaluation results, prod-readiness costs, verified architecture
dd85326 verified | tags: | |
| - multimodal | |
| - vision-language-model | |
| - image-generation | |
| - mla-attention | |
| - multi-head-latent-attention | |
| - flow-matching | |
| - diffusion | |
| - rectified-flow | |
| - on-device | |
| - efficient-attention | |
| - smol-scale | |
| - next-gpt | |
| - show-o | |
| - janusflow | |
| - svd-initialization | |
| - mha2mla-vlm | |
| - x-ecomla | |
| - research | |
| - proof-of-concept | |
| license: apache-2.0 | |
| language: | |
| - en | |
| # β οΈ Research Proof of Concept β NOT Production Ready | |
| **Drop-in toolkit for building unified any-to-any multimodal models at smol scale (245Mβ586M parameters).** | |
| Tinman-SmolOmni-MLA demonstrates **Multi-Head Latent Attention (MLA)** from DeepSeek-V2 adapted for multimodal VLMs. The architecture is novel, validated, and functional. The **quality is not production-grade** β image generation produces near-random outputs, VQA was never trained, and ASR was attempted and failed. | |
| See [evaluation_results.json](evaluation_results.json) for full validation data. | |
| ## What Actually Works | |
| | Component | Status | Evidence | | |
| |-----------|--------|----------| | |
| | **Architecture** | β Validated | GQA/MLA layer assignments verified from checkpoint weight keys | | |
| | **Checkpoint loading** | β Verified | `from_hub()` and `load_checkpoint()` both tested | | |
| | **Text understanding** | β Functional | Coherent next-token predictions | | |
| | **Image generation pipeline** | β οΈ Runs but poor quality | CLIP score 0.11 (random ~0.15, good ~0.30) | | |
| | **VQA** | β Not trained | 0% on ChartQA β no instruction tuning performed | | |
| | **ASR (native)** | β Failed | 144% WER. Use Moonshine instead. | | |
| | **Speed / Memory** | β Excellent | 15.9K tok/s, 1.2GB VRAM, 40% KV reduction | | |
| ## Production Readiness: What's Missing & Cost | |
| ### Image Generation ($500-$5,000) | |
| - **Data**: LAION-5B or COYO-700M (50M+ image-text pairs) | |
| - **Training**: 20,000-100,000 steps (vs our 2,000) | |
| - **Fix**: VAE latent normalization (std mismatch: 38Γ too large) | |
| - **Add**: Classifier-free guidance training | |
| ### VQA ($5-$20 on L4) | |
| - **Data**: ChartQA train (28K) + VQAv2 (200K) + DocVQA | |
| - **Training**: 2-5 epochs LoRA instruction tuning (rank 8-16) | |
| ### Audio ASR (recommended: Moonshine β $0) | |
| - **Moonshine-tiny**: 27M params, 3.5% WER, 0.1s latency | |
| - **No training needed** β already working in `moonshine_integration.py` | |
| ## π Quick Start | |
| ```bash | |
| pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit | |
| ``` | |
| ```python | |
| import torch | |
| from smolomni import SmolOmni | |
| # Load 500M (auto-downloads 1.1GB) | |
| model = SmolOmni.from_hub( | |
| "TinmanLabSL/SmolOmni-MLA-500M", | |
| checkpoint="stage2_final/model.pt", | |
| config="mla-hybrid-ar-flow-500M", | |
| device="cuda", | |
| dtype=torch.float16, | |
| strict=False, | |
| ) | |
| # Audio β Moonshine ASR β VLM | |
| from moonshine_integration import SmolOmniAudio | |
| audio = SmolOmniAudio() | |
| text = audio.transcribe("meeting.wav") | |
| ``` | |
| ## π¦ Structure | |
| | File | What it does | | |
| |------|-------------| | |
| | `smolomni/model.py` | `SmolOmni.from_hub()`, `load_checkpoint()`, `from_pretrained()` | | |
| | `smolomni/config.py` | 256M/500M presets with **verified** layer assignments | | |
| | `moonshine_integration.py` | Production-ready audio pipeline | | |
| | `test_production_readiness.py` | 6-test validation suite (all passing) | | |
| ## ποΈ Architecture | |
| Verified from 500M checkpoint weight keys: | |
| - **GQA layers**: 0-9, 30-31 (`q_proj`, `k_proj`, `v_proj`, `o_proj`) | |
| - **MLA layers**: 10-29 (`q_a_proj`, `q_b_proj`, `kv_a_proj_with_mqa`, `kv_b_proj`) | |
| - **SVD init**: 312 of 520 weights copied from pretrained GQA | |
| - **Flow head**: 8-layer DiT, adaLN-Zero, 165M params | |
| ## π Related Models | |
| - **256M (Research POC)**: [TinmanLabSL/SmolOmni-MLA-256M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M) | |
| - **500M (Research POC)**: [TinmanLabSL/SmolOmni-MLA-500M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-500M) | |
| - **Evaluation Results**: [evaluation_results.json](evaluation_results.json) | |
| ## π Citation | |
| ```bibtex | |
| @software{smolomni2025, | |
| title = {Tinman-SmolOmni-MLA Toolkit: Research POC for MLA in Multimodal VLM}, | |
| author = {TinmanLabSL}, | |
| year = {2025}, | |
| url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit}, | |
| note = {Research proof of concept. Not production ready.} | |
| } | |
| ``` | |
| ## License | |
| Apache 2.0 | |