Update with correct usage instructions, verified load methods, and accurate architecture

Browse files

Files changed (1) hide show

README.md +83 -22

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ language:
 # 🛠️ Tinman-SmolOmni-MLA Toolkit
-**Drop-in toolkit for building unified any-to-any multimodal models at smol scale (256M–1B parameters).**
 Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B parameters that can both *understand* images and text (VQA, captioning, OCR) and *generate* images from text — all in a single model that fits on your phone (109MB quantized).
@@ -32,7 +32,7 @@ Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B p
 ### 256M Variant
-| Metric | SmolVLM-256M (Baseline) | **Tinman-SmolOmni-MLA** | Improvement |
 |--------|------------------------|------------------------|-------------|
 | **KV Cache / token** | 11,520 floats | **7,040 floats** | **-38.9%** |
 | **AR Throughput** | 2,100 tok/s | **17,140 tok/s** | **+716%** |
@@ -43,9 +43,9 @@ Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B p
 ### 500M Variant
-| Metric | SmolVLM-500M (Baseline) | **Tinman-SmolOmni-MLA** | Improvement |
 |--------|------------------------|------------------------|-------------|
-| **KV Cache / token** | 20,480 floats | **10,880 floats** | **-46.9%** |
 | **AR Throughput** | ~2,100 tok/s | **15,901 tok/s** | **+657%** |
 | **Peak VRAM** | ~5,800 MB | **1,239 MB** | **-79%** |
 | **Parameters** | 507.5M | **585.8M** | +15% (includes flow head) |
@@ -60,24 +60,59 @@ Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B p
 pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
 ```
-### Vision + Text
 ```python
 from smolomni import SmolOmni
-model = SmolOmni.from_pretrained(
-    "TinmanLabSL/SmolOmni-MLA-256M",
-    config="mla-hybrid-ar-flow"
 )
-# Understanding mode
-logits = model(input_ids, images=pixel_values, mode="understanding")
-# Generation mode (rectified flow-matching)
-image = model.generate_image("a red sports car on a mountain road")
 ```
-### Audio + Text (via Moonshine, on 500M)
 ```python
 from moonshine_integration import SmolOmniAudio
@@ -85,10 +120,9 @@ from moonshine_integration import SmolOmniAudio
 audio_model = SmolOmniAudio()
 text = audio_model.transcribe("podcast.mp3")
-result = audio_model.chat(
-    audio="meeting.wav",
-    image="slide.png",
-    question="What is the main topic discussed?"
 )
 ```
@@ -96,14 +130,14 @@ result = audio_model.chat(
 ```
 smolomni/
-├── __init__.py        # Package exports
-├── config.py          # 5 model presets (256M / 500M × variants)
 ├── attention.py       # MLA + GQA attention modules
-├── model.py           # Full SmolOmniModel + factory
-├── model_500m.py      # 500M-specific model builder
 ├── flow_head.py       # DiT flow-matching generation head with adaLN-Zero
 ├── svd_init.py        # X-EcoMLA Algorithm 1: MHA → MLA SVD conversion
-└── audio.py           # Audio encoder (2M params, mel-spectrogram + CNN)
 train.py               # Two-stage training (SVD init → joint AR + flow)
 benchmark.py           # Automated benchmark suite (VRAM, throughput, KV cache)
@@ -137,6 +171,33 @@ python train.py --stage 2 --model_variant 256M \
 - **Dataset**: The Cauldron (chartqa subset)
 - **Method**: Joint loss `L = L_AR + L_flow`
 ## 🔗 Related Models
 - 🚀 **256M Model**: [TinmanLabSL/SmolOmni-MLA-256M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M)

 # 🛠️ Tinman-SmolOmni-MLA Toolkit
+**Drop-in toolkit for building unified any-to-any multimodal models at smol scale (245M–586M parameters).**
 Tinman-SmolOmni-MLA is the first open-source unified multimodal model below 1B parameters that can both *understand* images and text (VQA, captioning, OCR) and *generate* images from text — all in a single model that fits on your phone (109MB quantized).
 ### 256M Variant
+| Metric | SmolVLM-256M (Baseline) | **Tinman-SmolOmni-MLA-256M** | Improvement |
 |--------|------------------------|------------------------|-------------|
 | **KV Cache / token** | 11,520 floats | **7,040 floats** | **-38.9%** |
 | **AR Throughput** | 2,100 tok/s | **17,140 tok/s** | **+716%** |
 ### 500M Variant
+| Metric | SmolVLM-500M (Baseline) | **Tinman-SmolOmni-MLA-500M** | Improvement |
 |--------|------------------------|------------------------|-------------|
+| **KV Cache / token** | 20,480 floats | **12,160 floats** | **-40.6%** |
 | **AR Throughput** | ~2,100 tok/s | **15,901 tok/s** | **+657%** |
 | **Peak VRAM** | ~5,800 MB | **1,239 MB** | **-79%** |
 | **Parameters** | 507.5M | **585.8M** | +15% (includes flow head) |
 pip install git+https://huggingface.co/TinmanLabSL/SmolOmni-MLA-Toolkit
 ```
+### Load a Pretrained Checkpoint
 ```python
+import torch
 from smolomni import SmolOmni
+from transformers import AutoTokenizer
+# Load 500M checkpoint from HuggingFace Hub (auto-downloads 1.1GB)
+model = SmolOmni.from_hub(
+    "TinmanLabSL/SmolOmni-MLA-500M",
+    checkpoint="stage2_final/model.pt",
+    config="mla-hybrid-ar-flow-500M",
+    device="cuda",
+    dtype=torch.bfloat16,
+)
+# Text understanding
+tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
+inputs = tokenizer("The capital of France is", return_tensors="pt").to("cuda")
+with torch.no_grad():
+    result = model.forward_understanding(input_ids=inputs["input_ids"])
+    next_token = result["logits"][0, -1, :].argmax()
+    print(tokenizer.decode([next_token]))  # "Paris"
+# Image generation (returns VAE latents)
+latents = model.generate_image(inputs["input_ids"], num_steps=50)
+# Decode latents with SDXL VAE to get actual image
+```
+### Load from Local Checkpoint
+```python
+# Download checkpoint manually first:
+# huggingface-cli download TinmanLabSL/SmolOmni-MLA-500M stage2_final/model.pt
+model = SmolOmni.load_checkpoint(
+    "path/to/stage2_final/model.pt",
+    config="mla-hybrid-ar-flow-500M",
+    device="cuda",
+    dtype=torch.bfloat16,
 )
+```
+### Build Architecture from Scratch (Random Weights)
+```python
+from smolomni import SmolOmni
+model = SmolOmni.from_pretrained("mla-hybrid-ar-flow-500M", device="cpu")
+print(f"Parameters: {sum(p.numel() for p in model.parameters())/1e6:.1f}M")
 ```
+### Audio Understanding (via Moonshine)
 ```python
 from moonshine_integration import SmolOmniAudio
 audio_model = SmolOmniAudio()
 text = audio_model.transcribe("podcast.mp3")
+# Then use the text with the 500M VLM
+result = model.forward_understanding(
+    input_ids=tokenizer(text, return_tensors="pt")["input_ids"]
 )
 ```
 ```
 smolomni/
+├── __init__.py        # Package exports (SmolOmni, SmolOmniModel, configs, get_model_config)
+├── config.py          # Model configs: 256M / 500M presets with correct GQA/MLA layer assignments
 ├── attention.py       # MLA + GQA attention modules
+├── model.py           # SmolOmniModel + SmolOmni factory (load_checkpoint, from_hub, from_pretrained)
+├── model_500m.py      # Legacy 500M discrete-audio experiment (not used in production)
 ├── flow_head.py       # DiT flow-matching generation head with adaLN-Zero
 ├── svd_init.py        # X-EcoMLA Algorithm 1: MHA → MLA SVD conversion
+└── audio.py           # Audio encoder (DistilHuBERT, frozen)
 train.py               # Two-stage training (SVD init → joint AR + flow)
 benchmark.py           # Automated benchmark suite (VRAM, throughput, KV cache)
 - **Dataset**: The Cauldron (chartqa subset)
 - **Method**: Joint loss `L = L_AR + L_flow`
+## 🏗️ Architecture Details
+### Hybrid Multi-Head Latent Attention (MLA)
+**MLA** from DeepSeek-V2, adapted for VLMs via MHA2MLA-VLM:
+- **GQA layers (early + late)**: Preserve pretrained vision knowledge from SmolVLM
+- **MLA layers (middle)**: Compress KV cache with latent rank `r_kv` plus decoupled RoPE `d_rope`
+- **NoPE every 4th layer**: Removes position encoding for diversity
+### 500M Verified Layer Assignment (from checkpoint weights)
+| Layer Range | Attention Type | KV Cache / layer |
+|-------------|----------------|------------------|
+| 0–9 | GQA (pretrained vision) | 2 × 5 × 64 = 640 floats |
+| 10–29 | MLA (KV compression) | 192 + 32 = 224 floats |
+| 30–31 | GQA (late) | 2 × 5 × 64 = 640 floats |
+| **Total** | **Hybrid** | **12,160 floats** (vs 20,480 baseline = **-40.6%**) |
+### 256M Layer Assignment
+| Layer Range | Attention Type | KV Cache / layer |
+|-------------|----------------|------------------|
+| 0–9 | GQA (pretrained vision) | 2 × 3 × 64 = 384 floats |
+| 10–29 | MLA (KV compression) | 128 + 32 = 160 floats |
+| **Total** | **Hybrid** | **7,040 floats** (vs 11,520 baseline = **-38.9%**) |
 ## 🔗 Related Models
 - 🚀 **256M Model**: [TinmanLabSL/SmolOmni-MLA-256M](https://huggingface.co/TinmanLabSL/SmolOmni-MLA-256M)