StreamingVLM: Real-Time Understanding for Infinite Video Streams
Paper β’ 2510.09608 β’ Published β’ 53
PitchAI: Your AI Broadcast Companion β Real-time AI-powered football commentary with live tactical vision.
This is an updated port of StreamingVLM (arXiv:2510.09608) adapted for:
flash_attention_2 (maximum throughput)sdpa (no flash-attn build needed)| Model | GPU | Attention | VRAM | Speed | Use Case |
|---|---|---|---|---|---|
| Qwen3-VL-2B | RTX 3090/4090 | flash_attention_2 |
~6 GB | ~12 FPS | Real-time on consumer HW |
| Qwen3-VL-2B | MI250X/MI300X | sdpa |
~6 GB | ~8 FPS | ROCm deployment |
| Qwen3-VL-4B | A100/H100 | flash_attention_2 |
~12 GB | ~8 FPS | Production quality |
| Qwen3-VL-4B | MI300X | sdpa |
~12 GB | ~6 FPS | ROCm high quality |
| Component | Original | This Port |
|---|---|---|
| Model | Qwen2.5-VL-7B-Instruct | Qwen3-VL-2B/4B-Instruct |
| Attention | flash_attention_2 (CUDA only) | SDPA + flash_attention_2 (both!) |
| Transformers | 4.50.0β4.52.4 | 4.57.0+ (from source) |
| Vision Attn | flash_attn_varlen_func | chunked SDPA |
| ViT Patch Size | 14Γ14 | 16Γ16 |
| RoPE | mrope (standard) | mrope_interleaved |
| DeepStack | β | β (layers 5, 11, 17) |
| Max Context | 128K | 256K |
conda create -n streaming_qwen3 python=3.11 -y
conda activate streaming_qwen3
# PyTorch with CUDA
pip install torch torchvision
# Transformers from source (Qwen3-VL requires 4.57.0+)
pip install git+https://github.com/huggingface/transformers
# Flash Attention 2 (recommended for CUDA)
pip install flash-attn --no-build-isolation
# Other deps
pip install -r requirements.txt
conda create -n streaming_qwen3 python=3.11 -y
conda activate streaming_qwen3
# PyTorch with ROCm
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2
# Transformers from source
pip install git+https://github.com/huggingface/transformers
# No flash-attn needed! SDPA works out of the box.
pip install -r requirements.txt
python -m streaming_vlm.inference.inference \
--model_path Qwen/Qwen3-VL-2B-Instruct \
--video_path /path/to/football_match.mp4 \
--output_path ./commentary.vtt \
--fps 4 \
--attn_implementation flash_attention_2 \
--window_size 24 \
--text_sliding_window 768
python -m streaming_vlm.inference.inference \
--model_path Qwen/Qwen3-VL-4B-Instruct \
--video_path /path/to/football_match.mp4 \
--output_path ./commentary.vtt \
--fps 2 \
--attn_implementation flash_attention_2
python -m streaming_vlm.inference.inference \
--model_path Qwen/Qwen3-VL-2B-Instruct \
--video_path /path/to/match.mp4 \
--output_path ./commentary.vtt \
--fps 2 \
--attn_implementation sdpa
python -m streaming_vlm.inference.inference \
--model_path Qwen/Qwen3-VL-2B-Instruct \
--video_path /path/to/match.mp4 \
--interactive \
--attn_implementation flash_attention_2
from streaming_vlm.cuda_config import get_cuda_config, PRESET_2B_CUDA
from streaming_vlm.inference.inference import streaming_inference
# Quick preset
config = get_cuda_config(model_size="2b", video_path="match.mp4")
results = streaming_inference(config)
# Or use presets directly
print(PRESET_2B_CUDA)
# {'model_path': 'Qwen/Qwen3-VL-2B-Instruct', 'attn_implementation': 'flash_attention_2', ...}
StreamingVLM maintains a compact KV cache for infinite video streams:
| Component | 2B | 4B |
|---|---|---|
| Vision Encoder | Same (24L, 1024h, patch 16) | Same |
| DeepStack | Same [5, 11, 17] | Same |
| LM Hidden | 2048 | 2560 |
| LM Layers | 28 | 36 |
| Attention Heads | 16 | 32 |
| KV Heads | 8 | 8 |
| Parameters | 2.1B | 4.4B |
# Stage 1: Streaming pattern training (update MODEL_PATH in script for 2B)
bash scripts/sft_stage_1.sh
# Stage 2: Fine-grained annealing
bash scripts/sft_stage_2.sh
| Implementation | Platform | Install | When to use |
|---|---|---|---|
flash_attention_2 |
CUDA | pip install flash-attn |
Default for NVIDIA β fastest |
sdpa |
Any | None | Default for ROCm / fallback for CUDA |
eager |
Any | None | Debugging only (slowest) |
--fps 4 with 2B for higher temporal resolution (fast enough)git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention && pip install .
PYTORCH_ROCM_ARCH for your GPU: gfx942 (MI300X), gfx90a (MI250X)streaming-vlm-qwen3-rocm/
βββ README.md # This file
βββ MIGRATION_GUIDE.md # Detailed diff from original
βββ requirements.txt
βββ setup.py
βββ train.py # SFT training entry point
βββ test_imports.py # Unit tests (no GPU needed)
βββ scripts/
β βββ sft_stage_1.sh
β βββ sft_stage_2.sh
β βββ zero3.json
βββ streaming_vlm/
βββ cuda_config.py # CUDA/ROCm presets for 2B & 4B
βββ inference/
β βββ inference.py # Main streaming loop
β βββ streaming_args.py # Streaming config dataclass
β βββ generate/
β β βββ streaming_cache.py # KV cache with pruning
β βββ qwen3/ # Qwen3-VL specific patches
β βββ patch_model.py # Apply all monkey-patches
β βββ language_forward.py # SDPA text attention
β βββ vision_forward.py # Chunked SDPA vision
β βββ model_forward.py # Top-level forwards
β βββ pos_emb.py # Contiguous RoPE
βββ utils/
βββ get_qwen_range.py # Token ID utilities
@article{xu2025streamingvlm,
title={StreamingVLM: Real-Time Understanding for Infinite Video Streams},
author={Xu, Jiayi and Xiao, Guangxuan and others},
journal={arXiv preprint arXiv:2510.09608},
year={2025}
}
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "s23deepak/streaming-vlm-qwen3-rocm"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.