README.md · Tinman-Lab/Tinman-SmolOmni-MLA-Toolkit at main

Tinman-SmolOmni-MLA-Toolkit / README.md

TinmanLabSL

HONEST README: Research POC with evaluation results, prod-readiness costs, verified architecture

dd85326 verified 14 days ago

preview code

raw

history blame contribute delete

4.2 kB

	---
	tags:
	- multimodal
	- vision-language-model
	- image-generation
	- mla-attention
	- multi-head-latent-attention
	- flow-matching
	- diffusion
	- rectified-flow
	- on-device
	- efficient-attention
	- smol-scale
	- next-gpt
	- show-o
	- janusflow
	- svd-initialization
	- mha2mla-vlm
	- x-ecomla
	- research
	- proof-of-concept
	license: apache-2.0
	language:
	- en
	---

	# ⚠️ Research Proof of Concept — NOT Production Ready

	Drop-in toolkit for building unified any-to-any multimodal models at smol scale (245M–586M parameters).

	Tinman-SmolOmni-MLA demonstrates Multi-Head Latent Attention (MLA) from DeepSeek-V2 adapted for multimodal VLMs. The architecture is novel, validated, and functional. The quality is not production-grade — image generation produces near-random outputs, VQA was never trained, and ASR was attempted and failed.

	See [evaluation_results.json](evaluation_results.json) for full validation data.

	## What Actually Works

	\| Component \| Status \| Evidence \|
	\|-----------\|--------\|----------\|
	\| Architecture \| ✅ Validated \| GQA/MLA layer assignments verified from checkpoint weight keys \|
	\| Checkpoint loading \| ✅ Verified \| `from_hub()` and `load_checkpoint()` both tested \|
	\| Text understanding \| ✅ Functional \| Coherent next-token predictions \|
	\| Image generation pipeline \| ⚠️ Runs but poor quality \| CLIP score 0.11 (random ~0.15, good ~0.30) \|
	\| VQA \| ❌ Not trained \| 0% on ChartQA — no instruction tuning performed \|
	\| ASR (native) \| ❌ Failed \| 144% WER. Use Moonshine instead. \|
	\| Speed / Memory \| ✅ Excellent \| 15.9K tok/s, 1.2GB VRAM, 40% KV reduction \|

	## Production Readiness: What's Missing & Cost

	### Image Generation ($500-$5,000)
	- Data: LAION-5B or COYO-700M (50M+ image-text pairs)
	- Training: 20,000-100,000 steps (vs our 2,000)
	- Fix: VAE latent normalization (std mismatch: 38× too large)
	- Add: Classifier-free guidance training

	### VQA ($5-$20 on L4)
	- Data: ChartQA train (28K) + VQAv2 (200K) + DocVQA
	- Training: 2-5 epochs LoRA instruction tuning (rank 8-16)

	### Audio ASR (recommended: Moonshine — $0)
	- Moonshine-tiny: 27M params, 3.5% WER, 0.1s latency
	- No training needed — already working in `moonshine_integration.py`

	## 🚀 Quick Start

	```bash
	pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
	```

	```python
	import torch
	from smolomni import SmolOmni

	# Load 500M (auto-downloads 1.1GB)
	model = SmolOmni.from_hub(
	"TinmanLabSL/SmolOmni-MLA-500M",
	checkpoint="stage2_final/model.pt",
	config="mla-hybrid-ar-flow-500M",
	device="cuda",
	dtype=torch.float16,
	strict=False,
	)

	# Audio → Moonshine ASR → VLM
	from moonshine_integration import SmolOmniAudio
	audio = SmolOmniAudio()
	text = audio.transcribe("meeting.wav")
	```

	## 📦 Structure

	\| File \| What it does \|
	\|------\|-------------\|
	\| `smolomni/model.py` \| `SmolOmni.from_hub()`, `load_checkpoint()`, `from_pretrained()` \|
	\| `smolomni/config.py` \| 256M/500M presets with verified layer assignments \|
	\| `moonshine_integration.py` \| Production-ready audio pipeline \|
	\| `test_production_readiness.py` \| 6-test validation suite (all passing) \|

	## 🏗️ Architecture

	Verified from 500M checkpoint weight keys:
	- GQA layers: 0-9, 30-31 (`q_proj`, `k_proj`, `v_proj`, `o_proj`)
	- MLA layers: 10-29 (`q_a_proj`, `q_b_proj`, `kv_a_proj_with_mqa`, `kv_b_proj`)
	- SVD init: 312 of 520 weights copied from pretrained GQA
	- Flow head: 8-layer DiT, adaLN-Zero, 165M params

	## 🔗 Related Models

	- 256M (Research POC): [TinmanLabSL/SmolOmni-MLA-256M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M)
	- 500M (Research POC): [TinmanLabSL/SmolOmni-MLA-500M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-500M)
	- Evaluation Results: [evaluation_results.json](evaluation_results.json)

	## 📝 Citation

	```bibtex
	@software{smolomni2025,
	title = {Tinman-SmolOmni-MLA Toolkit: Research POC for MLA in Multimodal VLM},
	author = {TinmanLabSL},
	year = {2025},
	url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit},
	note = {Research proof of concept. Not production ready.}
	}
	```

	## License

	Apache 2.0