---
tags:
- multimodal
- vision-language-model
- image-generation
- mla-attention
- multi-head-latent-attention
- flow-matching
- diffusion
- rectified-flow
- on-device
- efficient-attention
- smol-scale
- next-gpt
- show-o
- janusflow
- svd-initialization
- mha2mla-vlm
- x-ecomla
- research
- proof-of-concept
license: apache-2.0
language:
- en
---

# ⚠️ Research Proof of Concept — NOT Production Ready

**Drop-in toolkit for building unified any-to-any multimodal models at smol scale (245M–586M parameters).**

Tinman-SmolOmni-MLA demonstrates **Multi-Head Latent Attention (MLA)** from DeepSeek-V2 adapted for multimodal VLMs. The architecture is novel, validated, and functional. The **quality is not production-grade** — image generation produces near-random outputs, VQA was never trained, and ASR was attempted and failed.

See [evaluation_results.json](evaluation_results.json) for full validation data.

## What Actually Works

| Component | Status | Evidence |
|-----------|--------|----------|
| **Architecture** | ✅ Validated | GQA/MLA layer assignments verified from checkpoint weight keys |
| **Checkpoint loading** | ✅ Verified | `from_hub()` and `load_checkpoint()` both tested |
| **Text understanding** | ✅ Functional | Coherent next-token predictions | 
| **Image generation pipeline** | ⚠️ Runs but poor quality | CLIP score 0.11 (random ~0.15, good ~0.30) |
| **VQA** | ❌ Not trained | 0% on ChartQA — no instruction tuning performed |
| **ASR (native)** | ❌ Failed | 144% WER. Use Moonshine instead. |
| **Speed / Memory** | ✅ Excellent | 15.9K tok/s, 1.2GB VRAM, 40% KV reduction |

## Production Readiness: What's Missing & Cost

### Image Generation ($500-$5,000)
- **Data**: LAION-5B or COYO-700M (50M+ image-text pairs)
- **Training**: 20,000-100,000 steps (vs our 2,000)
- **Fix**: VAE latent normalization (std mismatch: 38× too large)
- **Add**: Classifier-free guidance training

### VQA ($5-$20 on L4)
- **Data**: ChartQA train (28K) + VQAv2 (200K) + DocVQA
- **Training**: 2-5 epochs LoRA instruction tuning (rank 8-16)

### Audio ASR (recommended: Moonshine — $0)
- **Moonshine-tiny**: 27M params, 3.5% WER, 0.1s latency
- **No training needed** — already working in `moonshine_integration.py`

## 🚀 Quick Start

```bash
pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
```

```python
import torch
from smolomni import SmolOmni

# Load 500M (auto-downloads 1.1GB)
model = SmolOmni.from_hub(
    "TinmanLabSL/SmolOmni-MLA-500M",
    checkpoint="stage2_final/model.pt",
    config="mla-hybrid-ar-flow-500M",
    device="cuda",
    dtype=torch.float16,
    strict=False,
)

# Audio → Moonshine ASR → VLM
from moonshine_integration import SmolOmniAudio
audio = SmolOmniAudio()
text = audio.transcribe("meeting.wav")
```

## 📦 Structure

| File | What it does |
|------|-------------|
| `smolomni/model.py` | `SmolOmni.from_hub()`, `load_checkpoint()`, `from_pretrained()` |
| `smolomni/config.py` | 256M/500M presets with **verified** layer assignments |
| `moonshine_integration.py` | Production-ready audio pipeline |
| `test_production_readiness.py` | 6-test validation suite (all passing) |

## 🏗️ Architecture

Verified from 500M checkpoint weight keys:
- **GQA layers**: 0-9, 30-31 (`q_proj`, `k_proj`, `v_proj`, `o_proj`)
- **MLA layers**: 10-29 (`q_a_proj`, `q_b_proj`, `kv_a_proj_with_mqa`, `kv_b_proj`)
- **SVD init**: 312 of 520 weights copied from pretrained GQA
- **Flow head**: 8-layer DiT, adaLN-Zero, 165M params

## 🔗 Related Models

- **256M (Research POC)**: [TinmanLabSL/SmolOmni-MLA-256M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M)
- **500M (Research POC)**: [TinmanLabSL/SmolOmni-MLA-500M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-500M)
- **Evaluation Results**: [evaluation_results.json](evaluation_results.json)

## 📝 Citation

```bibtex
@software{smolomni2025,
  title = {Tinman-SmolOmni-MLA Toolkit: Research POC for MLA in Multimodal VLM},
  author = {TinmanLabSL},
  year = {2025},
  url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit},
  note = {Research proof of concept. Not production ready.}
}
```

## License

Apache 2.0