StreamingVLM β€” Qwen3-VL (2B & 4B) for CUDA + ROCm

PitchAI: Your AI Broadcast Companion β€” Real-time AI-powered football commentary with live tactical vision.

This is an updated port of StreamingVLM (arXiv:2510.09608) adapted for:

  • Qwen3-VL-2B-Instruct ⚑ (lightweight, runs on consumer GPUs)
  • Qwen3-VL-4B-Instruct πŸ‹οΈ (higher quality commentary)
  • NVIDIA CUDA with flash_attention_2 (maximum throughput)
  • AMD ROCm with sdpa (no flash-attn build needed)
  • Latest Transformers (4.57.0+ from source)

Supported Configurations

Model GPU Attention VRAM Speed Use Case
Qwen3-VL-2B RTX 3090/4090 flash_attention_2 ~6 GB ~12 FPS Real-time on consumer HW
Qwen3-VL-2B MI250X/MI300X sdpa ~6 GB ~8 FPS ROCm deployment
Qwen3-VL-4B A100/H100 flash_attention_2 ~12 GB ~8 FPS Production quality
Qwen3-VL-4B MI300X sdpa ~12 GB ~6 FPS ROCm high quality

Key Changes from Original StreamingVLM

Component Original This Port
Model Qwen2.5-VL-7B-Instruct Qwen3-VL-2B/4B-Instruct
Attention flash_attention_2 (CUDA only) SDPA + flash_attention_2 (both!)
Transformers 4.50.0–4.52.4 4.57.0+ (from source)
Vision Attn flash_attn_varlen_func chunked SDPA
ViT Patch Size 14Γ—14 16Γ—16
RoPE mrope (standard) mrope_interleaved
DeepStack ❌ βœ… (layers 5, 11, 17)
Max Context 128K 256K

Installation

For CUDA (NVIDIA GPUs)

conda create -n streaming_qwen3 python=3.11 -y
conda activate streaming_qwen3

# PyTorch with CUDA
pip install torch torchvision

# Transformers from source (Qwen3-VL requires 4.57.0+)
pip install git+https://github.com/huggingface/transformers

# Flash Attention 2 (recommended for CUDA)
pip install flash-attn --no-build-isolation

# Other deps
pip install -r requirements.txt

For ROCm (AMD GPUs)

conda create -n streaming_qwen3 python=3.11 -y
conda activate streaming_qwen3

# PyTorch with ROCm
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2

# Transformers from source
pip install git+https://github.com/huggingface/transformers

# No flash-attn needed! SDPA works out of the box.
pip install -r requirements.txt

Quick Start

πŸš€ 2B Model on CUDA (fastest, consumer GPU friendly)

python -m streaming_vlm.inference.inference \
    --model_path Qwen/Qwen3-VL-2B-Instruct \
    --video_path /path/to/football_match.mp4 \
    --output_path ./commentary.vtt \
    --fps 4 \
    --attn_implementation flash_attention_2 \
    --window_size 24 \
    --text_sliding_window 768

πŸ‹οΈ 4B Model on CUDA (best quality)

python -m streaming_vlm.inference.inference \
    --model_path Qwen/Qwen3-VL-4B-Instruct \
    --video_path /path/to/football_match.mp4 \
    --output_path ./commentary.vtt \
    --fps 2 \
    --attn_implementation flash_attention_2

πŸ”΄ Any model on ROCm (no flash-attn needed)

python -m streaming_vlm.inference.inference \
    --model_path Qwen/Qwen3-VL-2B-Instruct \
    --video_path /path/to/match.mp4 \
    --output_path ./commentary.vtt \
    --fps 2 \
    --attn_implementation sdpa

πŸ’¬ Interactive Q&A Mode

python -m streaming_vlm.inference.inference \
    --model_path Qwen/Qwen3-VL-2B-Instruct \
    --video_path /path/to/match.mp4 \
    --interactive \
    --attn_implementation flash_attention_2

Python API

from streaming_vlm.cuda_config import get_cuda_config, PRESET_2B_CUDA
from streaming_vlm.inference.inference import streaming_inference

# Quick preset
config = get_cuda_config(model_size="2b", video_path="match.mp4")
results = streaming_inference(config)

# Or use presets directly
print(PRESET_2B_CUDA)
# {'model_path': 'Qwen/Qwen3-VL-2B-Instruct', 'attn_implementation': 'flash_attention_2', ...}

Architecture Overview

StreamingVLM maintains a compact KV cache for infinite video streams:

  • Attention Sinks (512 tokens): Stabilize attention over long sequences
  • Visual Window (16–24s): Recent video frames kept in KV cache
  • Text Window (512–768 tokens): Recent text context maintained
  • Contiguous RoPE: Keeps position indices bounded for infinite streams

2B vs 4B Architecture (same vision encoder!)

Component 2B 4B
Vision Encoder Same (24L, 1024h, patch 16) Same
DeepStack Same [5, 11, 17] Same
LM Hidden 2048 2560
LM Layers 28 36
Attention Heads 16 32
KV Heads 8 8
Parameters 2.1B 4.4B

Training (SFT)

# Stage 1: Streaming pattern training (update MODEL_PATH in script for 2B)
bash scripts/sft_stage_1.sh

# Stage 2: Fine-grained annealing
bash scripts/sft_stage_2.sh

Attention Implementation Guide

Implementation Platform Install When to use
flash_attention_2 CUDA pip install flash-attn Default for NVIDIA β€” fastest
sdpa Any None Default for ROCm / fallback for CUDA
eager Any None Debugging only (slowest)

CUDA Tips

  • Flash Attention 2 gives 2-3x speedup over SDPA for long sequences
  • The 2B model fits in 6 GB VRAM (bf16) β€” works on RTX 3060!
  • Use --fps 4 with 2B for higher temporal resolution (fast enough)
  • For multi-GPU: model fits single GPU, use data parallelism for batch processing

ROCm Tips

  • SDPA on ROCm auto-selects the best kernel (Flash or Memory-Efficient)
  • Optional: build flash-attn with CK backend for max perf:
    git clone https://github.com/Dao-AILab/flash-attention.git
    cd flash-attention && pip install .
    
  • Set PYTORCH_ROCM_ARCH for your GPU: gfx942 (MI300X), gfx90a (MI250X)

Project Structure

streaming-vlm-qwen3-rocm/
β”œβ”€β”€ README.md                       # This file
β”œβ”€β”€ MIGRATION_GUIDE.md              # Detailed diff from original
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ setup.py
β”œβ”€β”€ train.py                        # SFT training entry point
β”œβ”€β”€ test_imports.py                 # Unit tests (no GPU needed)
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ sft_stage_1.sh
β”‚   β”œβ”€β”€ sft_stage_2.sh
β”‚   └── zero3.json
└── streaming_vlm/
    β”œβ”€β”€ cuda_config.py              # CUDA/ROCm presets for 2B & 4B
    β”œβ”€β”€ inference/
    β”‚   β”œβ”€β”€ inference.py            # Main streaming loop
    β”‚   β”œβ”€β”€ streaming_args.py       # Streaming config dataclass
    β”‚   β”œβ”€β”€ generate/
    β”‚   β”‚   └── streaming_cache.py  # KV cache with pruning
    β”‚   └── qwen3/                  # Qwen3-VL specific patches
    β”‚       β”œβ”€β”€ patch_model.py      # Apply all monkey-patches
    β”‚       β”œβ”€β”€ language_forward.py # SDPA text attention
    β”‚       β”œβ”€β”€ vision_forward.py   # Chunked SDPA vision
    β”‚       β”œβ”€β”€ model_forward.py    # Top-level forwards
    β”‚       └── pos_emb.py          # Contiguous RoPE
    └── utils/
        └── get_qwen_range.py       # Token ID utilities

Citation

@article{xu2025streamingvlm,
  title={StreamingVLM: Real-Time Understanding for Infinite Video Streams},
  author={Xu, Jiayi and Xiao, Guangxuan and others},
  journal={arXiv preprint arXiv:2510.09608},
  year={2025}
}

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "s23deepak/streaming-vlm-qwen3-rocm"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for s23deepak/streaming-vlm-qwen3-rocm