streaming-vlm-qwen3-rocm / MIGRATION_GUIDE.md

s23deepak

Add MIGRATION_GUIDE.md

48b62e4 verified 5 days ago

preview code

raw

history blame contribute delete

9.36 kB

Migration Guide: StreamingVLM → Qwen3-VL on ROCm

Overview of Changes

This document details every change made to port StreamingVLM from its original implementation (Qwen2.5-VL-7B + flash-attn + CUDA) to (Qwen3-VL-4B + SDPA + ROCm).

1. Dependency Changes

Original (`infer_requirements.txt`)

transformers==4.52.4
flash_attn==2.8.0.post2
torch==2.7.1
liger_kernel==0.6.1
qwen-vl-utils==0.0.11

Updated (`requirements.txt`)

transformers>=4.57.0  (install from source: pip install git+https://github.com/huggingface/transformers)
torch>=2.4.0          (ROCm or CUDA build)
qwen-vl-utils>=0.0.14
# NO flash-attn required!

Why: Qwen3-VL requires transformers 4.57.0+ (unreleased, must install from source). Flash-attn is completely eliminated — replaced with torch.nn.functional.scaled_dot_product_attention.

2. Model Class Changes

Original	Updated
`Qwen2_5_VLForConditionalGeneration`	`Qwen3VLForConditionalGeneration`
`from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import ...`	No private imports needed
`attn_implementation="flash_attention_2"`	`attn_implementation="sdpa"`

Loading Code

# BEFORE (original)
from transformers import Qwen2_5_VLForConditionalGeneration
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="cuda",
    attn_implementation="flash_attention_2"
)

# AFTER (this port)  
from transformers import Qwen3VLForConditionalGeneration
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-4B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="sdpa"  # Works on ROCm + CUDA
)

3. Attention Mechanism Replacement

Language Model Attention

Original: Used _flash_attention_forward() (private transformers API, CUDA-only)

from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import _flash_attention_forward
attn_output = _flash_attention_forward(query, key, value, attention_mask, ...)

Updated: Uses F.scaled_dot_product_attention() (PyTorch native, ROCm+CUDA)

import torch.nn.functional as F
attn_output = F.scaled_dot_product_attention(
    query_states, key_states, value_states,
    attn_mask=attention_mask,
    dropout_p=0.0,
    is_causal=(attention_mask is None and q_len > 1),
)

Vision Encoder Attention

Original: Used flash_attn_varlen_func for packed variable-length sequences

from flash_attn import flash_attn_varlen_func
attn_output = flash_attn_varlen_func(q, k, v, cu_seqlens, cu_seqlens, max_seqlen, max_seqlen)

Updated: Chunked SDPA — splits at sequence boundaries, applies SDPA per chunk

def _sdpa_varlen(query, key, value, cu_seqlens, max_seqlen):
    outputs = []
    for i in range(cu_seqlens.shape[0] - 1):
        start, end = cu_seqlens[i].item(), cu_seqlens[i+1].item()
        q_i = query[start:end].unsqueeze(0).transpose(1, 2)
        k_i = key[start:end].unsqueeze(0).transpose(1, 2)
        v_i = value[start:end].unsqueeze(0).transpose(1, 2)
        out = F.scaled_dot_product_attention(q_i, k_i, v_i, is_causal=False)
        outputs.append(out.transpose(1, 2).squeeze(0))
    return torch.cat(outputs, dim=0)

4. Architecture Differences (Qwen2.5-VL → Qwen3-VL)

Feature	Qwen2.5-VL-7B	Qwen3-VL-4B
ViT patch size	14×14	16×16
ViT depth	32 layers	24 layers
ViT hidden	1280	1024
LM hidden	3584	2560
LM layers	28	36
KV heads	4	8
Max context	128K	256K
RoPE theta	1M	5M
MRoPE	standard	interleaved
QK-Norm	❌	✅
DeepStack	❌	✅ (layers 5,11,17)
ViT window attn	Yes	No (full attn)

Impact on StreamingVLM:

QK-Norm: Added self.q_norm(query) and self.k_norm(key) before RoPE in attention
DeepStack: Vision encoder extracts features from intermediate layers for richer representation
Interleaved MRoPE: mrope_section=[24,20,20] (temporal/height/width) with interleaved application
No ViT window attention: Simpler vision encoder — no fullatt_block_indexes logic needed

5. Monkey-Patch Target Changes

Original (10 patches on Qwen2.5-VL):

model.generate = streaming_generate
model.prepare_inputs_for_generation = ...
model._sample = ...
model.forward = qwen2_5_vl_forward
model.model.forward = model_forward
model.model.language_model.forward = streaming_language_model_forward  
model.model.language_model._update_causal_mask = ...
for layer in model.model.language_model.layers:  # "language_model" submodule
    layer.forward = streaming_text_decoder_layer_forward
    layer.self_attn.forward = streaming_text_flash_attn_forward
model.model.visual.forward = streaming_visual_encoder_forward

Updated (simplified patches on Qwen3-VL):

model.forward = streaming_qwen3_vl_forward
model.model.forward = streaming_model_forward
for layer in model.model.layers:  # directly on model.model (no "language_model"!)
    layer.forward = streaming_text_decoder_layer_forward
    layer.self_attn.forward = streaming_text_sdpa_forward
model.model.visual.forward = streaming_visual_encoder_forward
for blk in model.model.visual.blocks:
    blk.forward = streaming_visual_block_forward
    blk.attn.forward = streaming_visual_attention_forward
model.model.get_rope_index = get_rope_index_streaming

Key structural difference: In Qwen2.5-VL, text layers are at model.model.language_model.layers. In Qwen3-VL, they are at model.model.layers (no intermediate language_model wrapper).

6. KV Cache API Changes

transformers 4.50-4.52 (original):

cache.key_cache[layer_idx]   # List[Tensor]
cache.value_cache[layer_idx]

transformers 5.8.0+ (updated):

cache.layers[layer_idx].keys   # DynamicLayer objects
cache.layers[layer_idx].values

Our StreamingCache handles both via _get_layer_keys()/_set_layer_keys() compat methods.

7. Token IDs (Verified for Qwen3-VL)

Token	ID	Same as 2.5?
`<\|im_start\|>`	151644	✅
`<\|im_end\|>`	151645	✅
`<\|vision_start\|>`	151652	✅
`<\|vision_end\|>`	151653	✅
`<\|image_pad\|>`	151655	✅
`<\|video_pad\|>`	151656	✅

Token IDs are identical — no changes needed for token-level logic.

8. ROCm-Specific Configuration

Performance tiers (recommended):

Priority	Implementation	Install	Performance	Platform
1	`flash_attention_2` + CK backend	Build flash-attn from source	Best	ROCm ≥5.7
2	`sdpa` (default)	None (PyTorch built-in)	Good	Any
3	`eager`	None	Slowest	Any

For ROCm flash-attn (optional, for maximum throughput):

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
pip install .
# Then use: attn_implementation="flash_attention_2"

Environment variables:

export PYTORCH_ROCM_ARCH="gfx942"    # MI300X
# export PYTORCH_ROCM_ARCH="gfx90a"  # MI250X
# export PYTORCH_ROCM_ARCH="gfx1100" # RX 7900 XTX

# Flash-attn backend selection:
export FLASH_ATTENTION_TRITON_AMD_ENABLE="FALSE"  # CK (default, recommended)
# export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" # Triton (alternative)

9. Eliminated Private API Dependencies

The original code imported many private symbols from transformers internals. These are all eliminated in this port:

# REMOVED — These broke between transformers versions:
from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import (
    _flash_attention_forward,      # Private, CUDA-only
    flash_attn_varlen_func,        # Private re-export of flash_attn
    rotate_half,                   # Reimplemented locally
    repeat_kv,                     # Reimplemented locally
    apply_rotary_pos_emb_flashatt, # Not needed with SDPA
    StaticCache,                   # Using DynamicCache directly
    SlidingWindowCache,            # Not used
    AttentionMaskConverter,        # Not used
    make_flex_block_causal_mask,   # Not used
    BlockMask,                     # Not used
)

# KEPT — Only stable public APIs:
from transformers import (
    Qwen3VLForConditionalGeneration,  # Public model class
    AutoProcessor,                     # Public processor
    DynamicCache,                      # Public cache class
)
from transformers.modeling_outputs import (
    BaseModelOutputWithPast,           # Public output type
    CausalLMOutputWithPast,            # Public output type
)

10. Running the Updated Code

Inference (streaming commentary):

python -m streaming_vlm.inference.inference \
    --model_path Qwen/Qwen3-VL-4B-Instruct \
    --video_path match.mp4 \
    --output_path commentary.vtt \
    --fps 2 \
    --attn_implementation sdpa

Training (Stage 1):

# Download data
# huggingface-cli download mit-han-lab/Inf-Stream-Train --local-dir ./data

bash scripts/sft_stage_1.sh

Tests:

python test_imports.py
# Expected: 6/6 tests pass with no CUDA/flash-attn dependency