Update with correct usage instructions, verified load methods, and accurate architecture
Browse files
README.md
CHANGED
|
@@ -24,7 +24,7 @@ language:
|
|
| 24 |
|
| 25 |
# π οΈ Tinman-SmolOmni-MLA Toolkit
|
| 26 |
|
| 27 |
-
**Drop-in toolkit for building unified any-to-any multimodal models at smol scale (
|
| 28 |
|
| 29 |
Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B parameters that can both *understand* images and text (VQA, captioning, OCR) and *generate* images from text β all in a single model that fits on your phone (109MB quantized).
|
| 30 |
|
|
@@ -32,7 +32,7 @@ Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B p
|
|
| 32 |
|
| 33 |
### 256M Variant
|
| 34 |
|
| 35 |
-
| Metric | SmolVLM-256M (Baseline) | **Tinman-SmolOmni-MLA** | Improvement |
|
| 36 |
|--------|------------------------|------------------------|-------------|
|
| 37 |
| **KV Cache / token** | 11,520 floats | **7,040 floats** | **-38.9%** |
|
| 38 |
| **AR Throughput** | 2,100 tok/s | **17,140 tok/s** | **+716%** |
|
|
@@ -43,9 +43,9 @@ Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B p
|
|
| 43 |
|
| 44 |
### 500M Variant
|
| 45 |
|
| 46 |
-
| Metric | SmolVLM-500M (Baseline) | **Tinman-SmolOmni-MLA** | Improvement |
|
| 47 |
|--------|------------------------|------------------------|-------------|
|
| 48 |
-
| **KV Cache / token** | 20,480 floats | **
|
| 49 |
| **AR Throughput** | ~2,100 tok/s | **15,901 tok/s** | **+657%** |
|
| 50 |
| **Peak VRAM** | ~5,800 MB | **1,239 MB** | **-79%** |
|
| 51 |
| **Parameters** | 507.5M | **585.8M** | +15% (includes flow head) |
|
|
@@ -60,24 +60,59 @@ Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B p
|
|
| 60 |
pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
|
| 61 |
```
|
| 62 |
|
| 63 |
-
###
|
| 64 |
|
| 65 |
```python
|
|
|
|
| 66 |
from smolomni import SmolOmni
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
)
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
-
|
| 74 |
-
|
| 75 |
|
| 76 |
-
|
| 77 |
-
|
| 78 |
```
|
| 79 |
|
| 80 |
-
### Audio
|
| 81 |
|
| 82 |
```python
|
| 83 |
from moonshine_integration import SmolOmniAudio
|
|
@@ -85,10 +120,9 @@ from moonshine_integration import SmolOmniAudio
|
|
| 85 |
audio_model = SmolOmniAudio()
|
| 86 |
text = audio_model.transcribe("podcast.mp3")
|
| 87 |
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
question="What is the main topic discussed?"
|
| 92 |
)
|
| 93 |
```
|
| 94 |
|
|
@@ -96,14 +130,14 @@ result = audio_model.chat(
|
|
| 96 |
|
| 97 |
```
|
| 98 |
smolomni/
|
| 99 |
-
βββ __init__.py # Package exports
|
| 100 |
-
βββ config.py #
|
| 101 |
βββ attention.py # MLA + GQA attention modules
|
| 102 |
-
βββ model.py #
|
| 103 |
-
βββ model_500m.py # 500M-
|
| 104 |
βββ flow_head.py # DiT flow-matching generation head with adaLN-Zero
|
| 105 |
βββ svd_init.py # X-EcoMLA Algorithm 1: MHA β MLA SVD conversion
|
| 106 |
-
βββ audio.py # Audio encoder (
|
| 107 |
|
| 108 |
train.py # Two-stage training (SVD init β joint AR + flow)
|
| 109 |
benchmark.py # Automated benchmark suite (VRAM, throughput, KV cache)
|
|
@@ -137,6 +171,33 @@ python train.py --stage 2 --model_variant 256M \
|
|
| 137 |
- **Dataset**: The Cauldron (chartqa subset)
|
| 138 |
- **Method**: Joint loss `L = L_AR + L_flow`
|
| 139 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
## π Related Models
|
| 141 |
|
| 142 |
- π **256M Model**: [TinmanLabSL/SmolOmni-MLA-256M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M)
|
|
|
|
| 24 |
|
| 25 |
# π οΈ Tinman-SmolOmni-MLA Toolkit
|
| 26 |
|
| 27 |
+
**Drop-in toolkit for building unified any-to-any multimodal models at smol scale (245Mβ586M parameters).**
|
| 28 |
|
| 29 |
Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B parameters that can both *understand* images and text (VQA, captioning, OCR) and *generate* images from text β all in a single model that fits on your phone (109MB quantized).
|
| 30 |
|
|
|
|
| 32 |
|
| 33 |
### 256M Variant
|
| 34 |
|
| 35 |
+
| Metric | SmolVLM-256M (Baseline) | **Tinman-SmolOmni-MLA-256M** | Improvement |
|
| 36 |
|--------|------------------------|------------------------|-------------|
|
| 37 |
| **KV Cache / token** | 11,520 floats | **7,040 floats** | **-38.9%** |
|
| 38 |
| **AR Throughput** | 2,100 tok/s | **17,140 tok/s** | **+716%** |
|
|
|
|
| 43 |
|
| 44 |
### 500M Variant
|
| 45 |
|
| 46 |
+
| Metric | SmolVLM-500M (Baseline) | **Tinman-SmolOmni-MLA-500M** | Improvement |
|
| 47 |
|--------|------------------------|------------------------|-------------|
|
| 48 |
+
| **KV Cache / token** | 20,480 floats | **12,160 floats** | **-40.6%** |
|
| 49 |
| **AR Throughput** | ~2,100 tok/s | **15,901 tok/s** | **+657%** |
|
| 50 |
| **Peak VRAM** | ~5,800 MB | **1,239 MB** | **-79%** |
|
| 51 |
| **Parameters** | 507.5M | **585.8M** | +15% (includes flow head) |
|
|
|
|
| 60 |
pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
|
| 61 |
```
|
| 62 |
|
| 63 |
+
### Load a Pretrained Checkpoint
|
| 64 |
|
| 65 |
```python
|
| 66 |
+
import torch
|
| 67 |
from smolomni import SmolOmni
|
| 68 |
+
from transformers import AutoTokenizer
|
| 69 |
+
|
| 70 |
+
# Load 500M checkpoint from HuggingFace Hub (auto-downloads 1.1GB)
|
| 71 |
+
model = SmolOmni.from_hub(
|
| 72 |
+
"TinmanLabSL/SmolOmni-MLA-500M",
|
| 73 |
+
checkpoint="stage2_final/model.pt",
|
| 74 |
+
config="mla-hybrid-ar-flow-500M",
|
| 75 |
+
device="cuda",
|
| 76 |
+
dtype=torch.bfloat16,
|
| 77 |
+
)
|
| 78 |
+
|
| 79 |
+
# Text understanding
|
| 80 |
+
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
|
| 81 |
+
inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda")
|
| 82 |
+
with torch.no_grad():
|
| 83 |
+
result = model.forward_understanding(input_ids=inputs["input_ids"])
|
| 84 |
+
next_token = result["logits"][0, -1, :].argmax()
|
| 85 |
+
print(tokenizer.decode([next_token])) # "Paris"
|
| 86 |
+
|
| 87 |
+
# Image generation (returns VAE latents)
|
| 88 |
+
latents = model.generate_image(inputs["input_ids"], num_steps=50)
|
| 89 |
+
# Decode latents with SDXL VAE to get actual image
|
| 90 |
+
```
|
| 91 |
|
| 92 |
+
### Load from Local Checkpoint
|
| 93 |
+
|
| 94 |
+
```python
|
| 95 |
+
# Download checkpoint manually first:
|
| 96 |
+
# huggingface-cli download TinmanLabSL/SmolOmni-MLA-500M stage2_final/model.pt
|
| 97 |
+
|
| 98 |
+
model = SmolOmni.load_checkpoint(
|
| 99 |
+
"path/to/stage2_final/model.pt",
|
| 100 |
+
config="mla-hybrid-ar-flow-500M",
|
| 101 |
+
device="cuda",
|
| 102 |
+
dtype=torch.bfloat16,
|
| 103 |
)
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
### Build Architecture from Scratch (Random Weights)
|
| 107 |
|
| 108 |
+
```python
|
| 109 |
+
from smolomni import SmolOmni
|
| 110 |
|
| 111 |
+
model = SmolOmni.from_pretrained("mla-hybrid-ar-flow-500M", device="cpu")
|
| 112 |
+
print(f"Parameters: {sum(p.numel() for p in model.parameters())/1e6:.1f}M")
|
| 113 |
```
|
| 114 |
|
| 115 |
+
### Audio Understanding (via Moonshine)
|
| 116 |
|
| 117 |
```python
|
| 118 |
from moonshine_integration import SmolOmniAudio
|
|
|
|
| 120 |
audio_model = SmolOmniAudio()
|
| 121 |
text = audio_model.transcribe("podcast.mp3")
|
| 122 |
|
| 123 |
+
# Then use the text with the 500M VLM
|
| 124 |
+
result = model.forward_understanding(
|
| 125 |
+
input_ids=tokenizer(text, return_tensors="pt")["input_ids"]
|
|
|
|
| 126 |
)
|
| 127 |
```
|
| 128 |
|
|
|
|
| 130 |
|
| 131 |
```
|
| 132 |
smolomni/
|
| 133 |
+
βββ __init__.py # Package exports (SmolOmni, SmolOmniModel, configs, get_model_config)
|
| 134 |
+
βββ config.py # Model configs: 256M / 500M presets with correct GQA/MLA layer assignments
|
| 135 |
βββ attention.py # MLA + GQA attention modules
|
| 136 |
+
βββ model.py # SmolOmniModel + SmolOmni factory (load_checkpoint, from_hub, from_pretrained)
|
| 137 |
+
βββ model_500m.py # Legacy 500M discrete-audio experiment (not used in production)
|
| 138 |
βββ flow_head.py # DiT flow-matching generation head with adaLN-Zero
|
| 139 |
βββ svd_init.py # X-EcoMLA Algorithm 1: MHA β MLA SVD conversion
|
| 140 |
+
βββ audio.py # Audio encoder (DistilHuBERT, frozen)
|
| 141 |
|
| 142 |
train.py # Two-stage training (SVD init β joint AR + flow)
|
| 143 |
benchmark.py # Automated benchmark suite (VRAM, throughput, KV cache)
|
|
|
|
| 171 |
- **Dataset**: The Cauldron (chartqa subset)
|
| 172 |
- **Method**: Joint loss `L = L_AR + L_flow`
|
| 173 |
|
| 174 |
+
## ποΈ Architecture Details
|
| 175 |
+
|
| 176 |
+
### Hybrid Multi-Head Latent Attention (MLA)
|
| 177 |
+
|
| 178 |
+
**MLA** from DeepSeek-V2, adapted for VLMs via MHA2MLA-VLM:
|
| 179 |
+
|
| 180 |
+
- **GQA layers (early + late)**: Preserve pretrained vision knowledge from SmolVLM
|
| 181 |
+
- **MLA layers (middle)**: Compress KV cache with latent rank `r_kv` plus decoupled RoPE `d_rope`
|
| 182 |
+
- **NoPE every 4th layer**: Removes position encoding for diversity
|
| 183 |
+
|
| 184 |
+
### 500M Verified Layer Assignment (from checkpoint weights)
|
| 185 |
+
|
| 186 |
+
| Layer Range | Attention Type | KV Cache / layer |
|
| 187 |
+
|-------------|----------------|------------------|
|
| 188 |
+
| 0β9 | GQA (pretrained vision) | 2 Γ 5 Γ 64 = 640 floats |
|
| 189 |
+
| 10β29 | MLA (KV compression) | 192 + 32 = 224 floats |
|
| 190 |
+
| 30β31 | GQA (late) | 2 Γ 5 Γ 64 = 640 floats |
|
| 191 |
+
| **Total** | **Hybrid** | **12,160 floats** (vs 20,480 baseline = **-40.6%**) |
|
| 192 |
+
|
| 193 |
+
### 256M Layer Assignment
|
| 194 |
+
|
| 195 |
+
| Layer Range | Attention Type | KV Cache / layer |
|
| 196 |
+
|-------------|----------------|------------------|
|
| 197 |
+
| 0β9 | GQA (pretrained vision) | 2 Γ 3 Γ 64 = 384 floats |
|
| 198 |
+
| 10β29 | MLA (KV compression) | 128 + 32 = 160 floats |
|
| 199 |
+
| **Total** | **Hybrid** | **7,040 floats** (vs 11,520 baseline = **-38.9%**) |
|
| 200 |
+
|
| 201 |
## π Related Models
|
| 202 |
|
| 203 |
- π **256M Model**: [TinmanLabSL/SmolOmni-MLA-256M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M)
|