TinmanLabSL commited on
Commit
d83d26e
Β·
verified Β·
1 Parent(s): e90fa6e

Update with correct usage instructions, verified load methods, and accurate architecture

Browse files
Files changed (1) hide show
  1. README.md +83 -22
README.md CHANGED
@@ -24,7 +24,7 @@ language:
24
 
25
  # πŸ› οΈ Tinman-SmolOmni-MLA Toolkit
26
 
27
- **Drop-in toolkit for building unified any-to-any multimodal models at smol scale (256M–1B parameters).**
28
 
29
  Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B parameters that can both *understand* images and text (VQA, captioning, OCR) and *generate* images from text β€” all in a single model that fits on your phone (109MB quantized).
30
 
@@ -32,7 +32,7 @@ Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B p
32
 
33
  ### 256M Variant
34
 
35
- | Metric | SmolVLM-256M (Baseline) | **Tinman-SmolOmni-MLA** | Improvement |
36
  |--------|------------------------|------------------------|-------------|
37
  | **KV Cache / token** | 11,520 floats | **7,040 floats** | **-38.9%** |
38
  | **AR Throughput** | 2,100 tok/s | **17,140 tok/s** | **+716%** |
@@ -43,9 +43,9 @@ Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B p
43
 
44
  ### 500M Variant
45
 
46
- | Metric | SmolVLM-500M (Baseline) | **Tinman-SmolOmni-MLA** | Improvement |
47
  |--------|------------------------|------------------------|-------------|
48
- | **KV Cache / token** | 20,480 floats | **10,880 floats** | **-46.9%** |
49
  | **AR Throughput** | ~2,100 tok/s | **15,901 tok/s** | **+657%** |
50
  | **Peak VRAM** | ~5,800 MB | **1,239 MB** | **-79%** |
51
  | **Parameters** | 507.5M | **585.8M** | +15% (includes flow head) |
@@ -60,24 +60,59 @@ Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B p
60
  pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
61
  ```
62
 
63
- ### Vision + Text
64
 
65
  ```python
 
66
  from smolomni import SmolOmni
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
- model = SmolOmni.from_pretrained(
69
- "TinmanLabSL/SmolOmni-MLA-256M",
70
- config="mla-hybrid-ar-flow"
 
 
 
 
 
 
 
 
71
  )
 
 
 
72
 
73
- # Understanding mode
74
- logits = model(input_ids, images=pixel_values, mode="understanding")
75
 
76
- # Generation mode (rectified flow-matching)
77
- image = model.generate_image("a red sports car on a mountain road")
78
  ```
79
 
80
- ### Audio + Text (via Moonshine, on 500M)
81
 
82
  ```python
83
  from moonshine_integration import SmolOmniAudio
@@ -85,10 +120,9 @@ from moonshine_integration import SmolOmniAudio
85
  audio_model = SmolOmniAudio()
86
  text = audio_model.transcribe("podcast.mp3")
87
 
88
- result = audio_model.chat(
89
- audio="meeting.wav",
90
- image="slide.png",
91
- question="What is the main topic discussed?"
92
  )
93
  ```
94
 
@@ -96,14 +130,14 @@ result = audio_model.chat(
96
 
97
  ```
98
  smolomni/
99
- β”œβ”€β”€ __init__.py # Package exports
100
- β”œβ”€β”€ config.py # 5 model presets (256M / 500M Γ— variants)
101
  β”œβ”€β”€ attention.py # MLA + GQA attention modules
102
- β”œβ”€β”€ model.py # Full SmolOmniModel + factory
103
- β”œβ”€β”€ model_500m.py # 500M-specific model builder
104
  β”œβ”€β”€ flow_head.py # DiT flow-matching generation head with adaLN-Zero
105
  β”œβ”€β”€ svd_init.py # X-EcoMLA Algorithm 1: MHA β†’ MLA SVD conversion
106
- └── audio.py # Audio encoder (2M params, mel-spectrogram + CNN)
107
 
108
  train.py # Two-stage training (SVD init β†’ joint AR + flow)
109
  benchmark.py # Automated benchmark suite (VRAM, throughput, KV cache)
@@ -137,6 +171,33 @@ python train.py --stage 2 --model_variant 256M \
137
  - **Dataset**: The Cauldron (chartqa subset)
138
  - **Method**: Joint loss `L = L_AR + L_flow`
139
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  ## πŸ”— Related Models
141
 
142
  - πŸš€ **256M Model**: [TinmanLabSL/SmolOmni-MLA-256M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M)
 
24
 
25
  # πŸ› οΈ Tinman-SmolOmni-MLA Toolkit
26
 
27
+ **Drop-in toolkit for building unified any-to-any multimodal models at smol scale (245M–586M parameters).**
28
 
29
  Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B parameters that can both *understand* images and text (VQA, captioning, OCR) and *generate* images from text β€” all in a single model that fits on your phone (109MB quantized).
30
 
 
32
 
33
  ### 256M Variant
34
 
35
+ | Metric | SmolVLM-256M (Baseline) | **Tinman-SmolOmni-MLA-256M** | Improvement |
36
  |--------|------------------------|------------------------|-------------|
37
  | **KV Cache / token** | 11,520 floats | **7,040 floats** | **-38.9%** |
38
  | **AR Throughput** | 2,100 tok/s | **17,140 tok/s** | **+716%** |
 
43
 
44
  ### 500M Variant
45
 
46
+ | Metric | SmolVLM-500M (Baseline) | **Tinman-SmolOmni-MLA-500M** | Improvement |
47
  |--------|------------------------|------------------------|-------------|
48
+ | **KV Cache / token** | 20,480 floats | **12,160 floats** | **-40.6%** |
49
  | **AR Throughput** | ~2,100 tok/s | **15,901 tok/s** | **+657%** |
50
  | **Peak VRAM** | ~5,800 MB | **1,239 MB** | **-79%** |
51
  | **Parameters** | 507.5M | **585.8M** | +15% (includes flow head) |
 
60
  pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
61
  ```
62
 
63
+ ### Load a Pretrained Checkpoint
64
 
65
  ```python
66
+ import torch
67
  from smolomni import SmolOmni
68
+ from transformers import AutoTokenizer
69
+
70
+ # Load 500M checkpoint from HuggingFace Hub (auto-downloads 1.1GB)
71
+ model = SmolOmni.from_hub(
72
+ "TinmanLabSL/SmolOmni-MLA-500M",
73
+ checkpoint="stage2_final/model.pt",
74
+ config="mla-hybrid-ar-flow-500M",
75
+ device="cuda",
76
+ dtype=torch.bfloat16,
77
+ )
78
+
79
+ # Text understanding
80
+ tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
81
+ inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda")
82
+ with torch.no_grad():
83
+ result = model.forward_understanding(input_ids=inputs["input_ids"])
84
+ next_token = result["logits"][0, -1, :].argmax()
85
+ print(tokenizer.decode([next_token])) # "Paris"
86
+
87
+ # Image generation (returns VAE latents)
88
+ latents = model.generate_image(inputs["input_ids"], num_steps=50)
89
+ # Decode latents with SDXL VAE to get actual image
90
+ ```
91
 
92
+ ### Load from Local Checkpoint
93
+
94
+ ```python
95
+ # Download checkpoint manually first:
96
+ # huggingface-cli download TinmanLabSL/SmolOmni-MLA-500M stage2_final/model.pt
97
+
98
+ model = SmolOmni.load_checkpoint(
99
+ "path/to/stage2_final/model.pt",
100
+ config="mla-hybrid-ar-flow-500M",
101
+ device="cuda",
102
+ dtype=torch.bfloat16,
103
  )
104
+ ```
105
+
106
+ ### Build Architecture from Scratch (Random Weights)
107
 
108
+ ```python
109
+ from smolomni import SmolOmni
110
 
111
+ model = SmolOmni.from_pretrained("mla-hybrid-ar-flow-500M", device="cpu")
112
+ print(f"Parameters: {sum(p.numel() for p in model.parameters())/1e6:.1f}M")
113
  ```
114
 
115
+ ### Audio Understanding (via Moonshine)
116
 
117
  ```python
118
  from moonshine_integration import SmolOmniAudio
 
120
  audio_model = SmolOmniAudio()
121
  text = audio_model.transcribe("podcast.mp3")
122
 
123
+ # Then use the text with the 500M VLM
124
+ result = model.forward_understanding(
125
+ input_ids=tokenizer(text, return_tensors="pt")["input_ids"]
 
126
  )
127
  ```
128
 
 
130
 
131
  ```
132
  smolomni/
133
+ β”œβ”€β”€ __init__.py # Package exports (SmolOmni, SmolOmniModel, configs, get_model_config)
134
+ β”œβ”€β”€ config.py # Model configs: 256M / 500M presets with correct GQA/MLA layer assignments
135
  β”œβ”€β”€ attention.py # MLA + GQA attention modules
136
+ β”œβ”€β”€ model.py # SmolOmniModel + SmolOmni factory (load_checkpoint, from_hub, from_pretrained)
137
+ β”œβ”€β”€ model_500m.py # Legacy 500M discrete-audio experiment (not used in production)
138
  β”œβ”€β”€ flow_head.py # DiT flow-matching generation head with adaLN-Zero
139
  β”œβ”€β”€ svd_init.py # X-EcoMLA Algorithm 1: MHA β†’ MLA SVD conversion
140
+ └── audio.py # Audio encoder (DistilHuBERT, frozen)
141
 
142
  train.py # Two-stage training (SVD init β†’ joint AR + flow)
143
  benchmark.py # Automated benchmark suite (VRAM, throughput, KV cache)
 
171
  - **Dataset**: The Cauldron (chartqa subset)
172
  - **Method**: Joint loss `L = L_AR + L_flow`
173
 
174
+ ## πŸ—οΈ Architecture Details
175
+
176
+ ### Hybrid Multi-Head Latent Attention (MLA)
177
+
178
+ **MLA** from DeepSeek-V2, adapted for VLMs via MHA2MLA-VLM:
179
+
180
+ - **GQA layers (early + late)**: Preserve pretrained vision knowledge from SmolVLM
181
+ - **MLA layers (middle)**: Compress KV cache with latent rank `r_kv` plus decoupled RoPE `d_rope`
182
+ - **NoPE every 4th layer**: Removes position encoding for diversity
183
+
184
+ ### 500M Verified Layer Assignment (from checkpoint weights)
185
+
186
+ | Layer Range | Attention Type | KV Cache / layer |
187
+ |-------------|----------------|------------------|
188
+ | 0–9 | GQA (pretrained vision) | 2 Γ— 5 Γ— 64 = 640 floats |
189
+ | 10–29 | MLA (KV compression) | 192 + 32 = 224 floats |
190
+ | 30–31 | GQA (late) | 2 Γ— 5 Γ— 64 = 640 floats |
191
+ | **Total** | **Hybrid** | **12,160 floats** (vs 20,480 baseline = **-40.6%**) |
192
+
193
+ ### 256M Layer Assignment
194
+
195
+ | Layer Range | Attention Type | KV Cache / layer |
196
+ |-------------|----------------|------------------|
197
+ | 0–9 | GQA (pretrained vision) | 2 Γ— 3 Γ— 64 = 384 floats |
198
+ | 10–29 | MLA (KV compression) | 128 + 32 = 160 floats |
199
+ | **Total** | **Hybrid** | **7,040 floats** (vs 11,520 baseline = **-38.9%**) |
200
+
201
  ## πŸ”— Related Models
202
 
203
  - πŸš€ **256M Model**: [TinmanLabSL/SmolOmni-MLA-256M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M)