Add timestep-aware sparse KD weighting

25e4efd 13 days ago

8.62 kB

	---
	license: apache-2.0
	tags:
	- vision-language
	- diffusion
	- xlstm
	- vision-lstm
	- masked-diffusion
	- mdlm
	- multimodal
	language: en
	pipeline_tag: image-text-to-text
	---

	# ViL-DLM: Vision xLSTM Diffusion Language Model (~660M)

	The first vision-language model combining Vision xLSTM (ViL) with a discrete masked diffusion language model backbone.

	## Architecture

	```
	[Image] → ViL-S Encoder (57M) → MLP Projector (7M) → [196 Visual Tokens]
	[Visual Tokens] + [Masked Text Tokens] → Bidirectional Diffusion LM (596M) → Denoised Text
	```

	\| Component \| Model \| Params \| Key Innovation \|
	\|-----------\|-------\|--------\|----------------\|
	\| Vision Encoder \| Vision-xLSTM-S (ViL-S) \| ~57M \| O(N) linear complexity, alternating bidirectional mLSTM with Conv2D \|
	\| Projector \| 2-layer MLP (GELU) \| ~7M \| Maps ViL features (384d) → LM space (1024d) \|
	\| Language Backbone \| dLLM Qwen3-0.6B (MDLM) \| ~596M \| Bidirectional masked diffusion, non-autoregressive \|
	\| Total \| \| ~660M \| \|

	## How It Works

	### Vision Encoder: Vision xLSTM (ViL-S)
	- Based on [Vision-LSTM](https://arxiv.org/abs/2406.04303) (Alkin et al., 2024)
	- Processes 224×224 images into 196 patch tokens (16×16 patches)
	- Uses mLSTM blocks with matrix memory cells and exponential gating
	- Alternating bidirectional scanning: odd blocks scan top-left→bottom-right, even blocks reverse
	- Conv2D for spatial QK context: depthwise 3×3 convolution adds local spatial awareness
	- SwiGLU FFN after each mLSTM block
	- Linear O(N) complexity vs ViT's quadratic O(N²) — critical for scaling to many visual tokens

	### Diffusion Language Model: MDLM via dLLM
	- Based on [dLLM](https://arxiv.org/abs/2602.22661) (Berkeley, 2025) converting Qwen3-0.6B to diffusion
	- Training: Forward diffusion progressively masks tokens with cosine schedule → model predicts masked tokens
	- Inference: Start with all-masked output → iteratively unmask most-confident tokens
	- Key change from AR: replaces causal attention mask with bidirectional padding-only mask
	- Weighted cross-entropy loss on masked positions only (MDLM objective)

	### Knowledge Distillation (Stage 3)
	- Teacher: [Gemma 4 E2B](https://huggingface.co/google/gemma-4-E2B-it) (5.1B params, ~2B effective)
	- Sparse cross-tokenizer distillation: prepare a teacher-scored candidate bank in the student token space, then blend sparse KL with diffusion loss
	- Temperature τ=2.0, α_KD=0.5 (50% diffusion loss + 50% KD loss)

	## Training Recipe

	Multi-stage training inspired by LLaDA-V, LaViDa, LFM2, and Mistral/Pixtral:

	\| Stage \| Components Trained \| Dataset \| Learning Rate \| Epochs \|
	\|-------\|-------------------\|---------\|---------------\|--------\|
	\| 1 \| Projector only (ViL & LM frozen) \| [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) (558K) \| 1e-3 \| 1-2 \|
	\| 2 \| Full model (all components) \| [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) \| ViL:2e-6, Proj:1e-5, LM:1e-5 \| 3 \|
	\| 3 \| + KD from Gemma 4 E2B \| Stage 2 data mix + cached teacher bank \| Sparse cross-tokenizer KD (α=0.5) \| 2 \|

	### Efficiency Tricks Applied
	- Per-component learning rates (LLaDA-V recipe): vision encoder gets 5× lower LR
	- Gradient checkpointing on LM backbone to reduce VRAM
	- Cosine LR schedule with warmup
	- Gradient clipping at 1.0
	- AdamW optimizer with β=(0.9, 0.999), weight_decay=0.05

	## Why This Combination Matters

	This is a genuinely unexplored frontier in the literature:

	1. No published work combines Vision xLSTM with a diffusion language model
	2. ViL's linear complexity could be transformative for processing large numbers of visual tokens in multimodal diffusion models, where current Transformer-based approaches incur quadratic attention costs
	3. The bidirectional nature of both ViL (alternating scan) and diffusion LM (full attention) creates natural synergy — both architectures process information non-autoregressively
	4. Distillation from Gemma 4 E2B bridges the gap between the small student and a state-of-the-art multimodal teacher

	## Running Training

	```bash
	# CPU smoke: Stage 1 projector path
	python code/train_production.py --stage 1 --epochs 1 --batch_size 1 --grad_accum 1 --num_workers 0 --max_samples 1 --dry_run_batches 1

	# CPU smoke: Stage 2 subset path
	python code/train_production.py --stage 2 --resume_from ./vil-dlm-output/stage1_best --dataset_configs ai2d,aokvqa --epochs 1 --batch_size 1 --grad_accum 1 --num_workers 0 --max_samples 8 --dry_run_batches 1

	# Stage 1: projector-only alignment
	python code/train_production.py --stage 1 --require_cuda --epochs 1 --batch_size 8 --grad_accum 4

	# Stage 2: full-model finetune on the balanced Cauldron mix
	python code/train_production.py --stage 2 --require_cuda --epochs 3 --batch_size 2 --grad_accum 16

	# Stage 3a: build the Gemma teacher candidate bank from a Stage 2 checkpoint (GPU only)
	python code/train_production.py --stage 3a --require_cuda --resume_from ./vil-dlm-output/stage2_best --prepare_teacher_bank --teacher_batch_size 2 --kd_top_k 16 --kd_positions_per_sample 16 --kd_temperature 1.0

	# Stage 3b: timestep-aware sparse KD training from the cached teacher bank (GPU only)
	python code/train_production.py --stage 3b --require_cuda --resume_from ./vil-dlm-output/stage2_best --epochs 2 --batch_size 2 --grad_accum 16 --alpha_kd 0.5 --kd_temperature 1.0

	# Cheap validation gate for any stage
	python code/train_production.py --stage 1 --require_cuda --dry_run_batches 1 --max_samples 8

	# Cheap Stage 3 KD-consumption gate after Stage 3a wrote a teacher bank
	python code/train_production.py --stage 3b --require_cuda --resume_from ./vil-dlm-output/stage2_best --teacher_cache_dir ./vil-dlm-output/teacher-cache --epochs 1 --batch_size 1 --grad_accum 1 --dry_run_batches 1 --alpha_kd 0.5 --kd_temperature 1.0
	```

	Training now saves checkpoints locally by default. Add `--push_to_hub` only when you want to publish artifacts.
	Stage 3b uses timestep-aware sparse KD by default; pass `--no-kd_timestep_weighting` only to reproduce the pilot's fixed-alpha KD behavior.
	CPU sessions should stop after the Stage 2 subset smoke test. Stage 3 requires a CUDA GPU because Gemma 4 teacher-bank preparation uses quantized multimodal teacher inference.

	### Hardware Requirements
	- Stage 1: A10G (24GB) or T4 (16GB) — only projector gradients (~7M params)
	- Stage 2: A10G (24GB) recommended — full model gradients (~660M params)
	- Stage 3: H100 / A100 (80GB) recommended — Gemma 4 teacher bank prep + student distillation

	### Dependencies
	```
	torch>=2.0
	transformers>=5.5.0
	datasets
	trackio
	accelerate
	einops
	pillow
	huggingface_hub
	```

	## Pretrained Components Used

	\| Component \| Source \| License \|
	\|-----------\|--------\|---------\|
	\| Diffusion LM backbone \| [dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1](https://huggingface.co/dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1) \| Apache 2.0 \|
	\| Teacher (Stage 3) \| [google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it) \| Apache 2.0 \|
	\| Pretraining data \| [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) \| CC BY 4.0 \|
	\| Instruction data \| [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) \| Mixed \|

	## Full Literature Context

	This model was designed based on a comprehensive literature review covering:
	- Language Diffusion Models: MDLM, LLaDA, Plaid, BD3-LM, EDLM, ADLM
	- Vision-Language Diffusion Models: LaViDa, LLaDA-V, MMaDA, Muddit, LMFusion, DEEM
	- Vision xLSTM: ViL (arxiv:2406.04303)
	- Knowledge Distillation for DLMs: TCS distillation, DiffuGPT/DiffuLLaMA, dLLM
	- JEPA family: I-JEPA, D-JEPA, VL-JEPA, LLM-JEPA
	- Efficiency recipes: LFM2 (Liquid AI), Mistral/Pixtral, Minitron

	## References

	\| Paper \| arxiv \| Role \|
	\|-------\|-------\|------\|
	\| Vision-LSTM (ViL) \| [2406.04303](https://arxiv.org/abs/2406.04303) \| Vision encoder architecture \|
	\| dLLM \| [2602.22661](https://arxiv.org/abs/2602.22661) \| Diffusion LM backbone \|
	\| MDLM \| [2406.07524](https://arxiv.org/abs/2406.07524) \| Core diffusion objective \|
	\| LLaDA-V \| [2505.16933](https://arxiv.org/abs/2505.16933) \| VLM training recipe \|
	\| LaViDa \| — \| Complementary masking, prefix KV cache \|
	\| LFM2 \| [2511.23404](https://arxiv.org/abs/2511.23404) \| Top-K distillation recipe \|
	\| Gemma 4 \| [blog](https://huggingface.co/blog/gemma4) \| Teacher model \|
	\| Pixtral \| [2410.07073](https://arxiv.org/abs/2410.07073) \| RoPE-2D, variable-res ViT \|
	\| Minitron \| [2407.14679](https://arxiv.org/abs/2407.14679) \| Prune+distill best practices \|