TinmanLabSL's picture
HONEST README: Research POC with evaluation results, prod-readiness costs, verified architecture
dd85326 verified
---
tags:
- multimodal
- vision-language-model
- image-generation
- mla-attention
- multi-head-latent-attention
- flow-matching
- diffusion
- rectified-flow
- on-device
- efficient-attention
- smol-scale
- next-gpt
- show-o
- janusflow
- svd-initialization
- mha2mla-vlm
- x-ecomla
- research
- proof-of-concept
license: apache-2.0
language:
- en
---
# ⚠️ Research Proof of Concept β€” NOT Production Ready
**Drop-in toolkit for building unified any-to-any multimodal models at smol scale (245M–586M parameters).**
Tinman-SmolOmni-MLA demonstrates **Multi-Head Latent Attention (MLA)** from DeepSeek-V2 adapted for multimodal VLMs. The architecture is novel, validated, and functional. The **quality is not production-grade** β€” image generation produces near-random outputs, VQA was never trained, and ASR was attempted and failed.
See [evaluation_results.json](evaluation_results.json) for full validation data.
## What Actually Works
| Component | Status | Evidence |
|-----------|--------|----------|
| **Architecture** | βœ… Validated | GQA/MLA layer assignments verified from checkpoint weight keys |
| **Checkpoint loading** | βœ… Verified | `from_hub()` and `load_checkpoint()` both tested |
| **Text understanding** | βœ… Functional | Coherent next-token predictions |
| **Image generation pipeline** | ⚠️ Runs but poor quality | CLIP score 0.11 (random ~0.15, good ~0.30) |
| **VQA** | ❌ Not trained | 0% on ChartQA β€” no instruction tuning performed |
| **ASR (native)** | ❌ Failed | 144% WER. Use Moonshine instead. |
| **Speed / Memory** | βœ… Excellent | 15.9K tok/s, 1.2GB VRAM, 40% KV reduction |
## Production Readiness: What's Missing & Cost
### Image Generation ($500-$5,000)
- **Data**: LAION-5B or COYO-700M (50M+ image-text pairs)
- **Training**: 20,000-100,000 steps (vs our 2,000)
- **Fix**: VAE latent normalization (std mismatch: 38Γ— too large)
- **Add**: Classifier-free guidance training
### VQA ($5-$20 on L4)
- **Data**: ChartQA train (28K) + VQAv2 (200K) + DocVQA
- **Training**: 2-5 epochs LoRA instruction tuning (rank 8-16)
### Audio ASR (recommended: Moonshine β€” $0)
- **Moonshine-tiny**: 27M params, 3.5% WER, 0.1s latency
- **No training needed** β€” already working in `moonshine_integration.py`
## πŸš€ Quick Start
```bash
pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
```
```python
import torch
from smolomni import SmolOmni
# Load 500M (auto-downloads 1.1GB)
model = SmolOmni.from_hub(
"TinmanLabSL/SmolOmni-MLA-500M",
checkpoint="stage2_final/model.pt",
config="mla-hybrid-ar-flow-500M",
device="cuda",
dtype=torch.float16,
strict=False,
)
# Audio β†’ Moonshine ASR β†’ VLM
from moonshine_integration import SmolOmniAudio
audio = SmolOmniAudio()
text = audio.transcribe("meeting.wav")
```
## πŸ“¦ Structure
| File | What it does |
|------|-------------|
| `smolomni/model.py` | `SmolOmni.from_hub()`, `load_checkpoint()`, `from_pretrained()` |
| `smolomni/config.py` | 256M/500M presets with **verified** layer assignments |
| `moonshine_integration.py` | Production-ready audio pipeline |
| `test_production_readiness.py` | 6-test validation suite (all passing) |
## πŸ—οΈ Architecture
Verified from 500M checkpoint weight keys:
- **GQA layers**: 0-9, 30-31 (`q_proj`, `k_proj`, `v_proj`, `o_proj`)
- **MLA layers**: 10-29 (`q_a_proj`, `q_b_proj`, `kv_a_proj_with_mqa`, `kv_b_proj`)
- **SVD init**: 312 of 520 weights copied from pretrained GQA
- **Flow head**: 8-layer DiT, adaLN-Zero, 165M params
## πŸ”— Related Models
- **256M (Research POC)**: [TinmanLabSL/SmolOmni-MLA-256M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M)
- **500M (Research POC)**: [TinmanLabSL/SmolOmni-MLA-500M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-500M)
- **Evaluation Results**: [evaluation_results.json](evaluation_results.json)
## πŸ“ Citation
```bibtex
@software{smolomni2025,
title = {Tinman-SmolOmni-MLA Toolkit: Research POC for MLA in Multimodal VLM},
author = {TinmanLabSL},
year = {2025},
url = {https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit},
note = {Research proof of concept. Not production ready.}
}
```
## License
Apache 2.0