StreamingVLM — Qwen3-VL (2B & 4B) for CUDA + ROCm

PitchAI: Your AI Broadcast Companion — Real-time AI-powered football commentary with live tactical vision.

This is an updated port of StreamingVLM (arXiv:2510.09608) adapted for:

Qwen3-VL-2B-Instruct ⚡ (lightweight, runs on consumer GPUs)
Qwen3-VL-4B-Instruct 🏋️ (higher quality commentary)
NVIDIA CUDA with flash_attention_2 (maximum throughput)
AMD ROCm with sdpa (no flash-attn build needed)
Latest Transformers (4.57.0+ from source)

Supported Configurations

Model	GPU	Attention	VRAM	Speed	Use Case
Qwen3-VL-2B	RTX 3090/4090	`flash_attention_2`	~6 GB	~12 FPS	Real-time on consumer HW
Qwen3-VL-2B	MI250X/MI300X	`sdpa`	~6 GB	~8 FPS	ROCm deployment
Qwen3-VL-4B	A100/H100	`flash_attention_2`	~12 GB	~8 FPS	Production quality
Qwen3-VL-4B	MI300X	`sdpa`	~12 GB	~6 FPS	ROCm high quality

Key Changes from Original StreamingVLM

Component	Original	This Port
Model	Qwen2.5-VL-7B-Instruct	Qwen3-VL-2B/4B-Instruct
Attention	flash_attention_2 (CUDA only)	SDPA + flash_attention_2 (both!)
Transformers	4.50.0–4.52.4	4.57.0+ (from source)
Vision Attn	flash_attn_varlen_func	chunked SDPA
ViT Patch Size	14×14	16×16
RoPE	mrope (standard)	mrope_interleaved
DeepStack	❌	✅ (layers 5, 11, 17)
Max Context	128K	256K

Installation

For CUDA (NVIDIA GPUs)

conda create -n streaming_qwen3 python=3.11 -y
conda activate streaming_qwen3

# PyTorch with CUDA
pip install torch torchvision

# Transformers from source (Qwen3-VL requires 4.57.0+)
pip install git+https://github.com/huggingface/transformers

# Flash Attention 2 (recommended for CUDA)
pip install flash-attn --no-build-isolation

# Other deps
pip install -r requirements.txt

For ROCm (AMD GPUs)

conda create -n streaming_qwen3 python=3.11 -y
conda activate streaming_qwen3

# PyTorch with ROCm
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2

# Transformers from source
pip install git+https://github.com/huggingface/transformers

# No flash-attn needed! SDPA works out of the box.
pip install -r requirements.txt

Quick Start

🚀 2B Model on CUDA (fastest, consumer GPU friendly)

python -m streaming_vlm.inference.inference \
    --model_path Qwen/Qwen3-VL-2B-Instruct \
    --video_path /path/to/football_match.mp4 \
    --output_path ./commentary.vtt \
    --fps 4 \
    --attn_implementation flash_attention_2 \
    --window_size 24 \
    --text_sliding_window 768

🏋️ 4B Model on CUDA (best quality)

python -m streaming_vlm.inference.inference \
    --model_path Qwen/Qwen3-VL-4B-Instruct \
    --video_path /path/to/football_match.mp4 \
    --output_path ./commentary.vtt \
    --fps 2 \
    --attn_implementation flash_attention_2

🔴 Any model on ROCm (no flash-attn needed)

python -m streaming_vlm.inference.inference \
    --model_path Qwen/Qwen3-VL-2B-Instruct \
    --video_path /path/to/match.mp4 \
    --output_path ./commentary.vtt \
    --fps 2 \
    --attn_implementation sdpa

💬 Interactive Q&A Mode

python -m streaming_vlm.inference.inference \
    --model_path Qwen/Qwen3-VL-2B-Instruct \
    --video_path /path/to/match.mp4 \
    --interactive \
    --attn_implementation flash_attention_2

Python API

from streaming_vlm.cuda_config import get_cuda_config, PRESET_2B_CUDA
from streaming_vlm.inference.inference import streaming_inference

# Quick preset
config = get_cuda_config(model_size="2b", video_path="match.mp4")
results = streaming_inference(config)

# Or use presets directly
print(PRESET_2B_CUDA)
# {'model_path': 'Qwen/Qwen3-VL-2B-Instruct', 'attn_implementation': 'flash_attention_2', ...}

Architecture Overview

StreamingVLM maintains a compact KV cache for infinite video streams:

Attention Sinks (512 tokens): Stabilize attention over long sequences
Visual Window (16–24s): Recent video frames kept in KV cache
Text Window (512–768 tokens): Recent text context maintained
Contiguous RoPE: Keeps position indices bounded for infinite streams

2B vs 4B Architecture (same vision encoder!)

Component	2B	4B
Vision Encoder	Same (24L, 1024h, patch 16)	Same
DeepStack	Same [5, 11, 17]	Same
LM Hidden	2048	2560
LM Layers	28	36
Attention Heads	16	32
KV Heads	8	8
Parameters	2.1B	4.4B

Training (SFT)

# Stage 1: Streaming pattern training (update MODEL_PATH in script for 2B)
bash scripts/sft_stage_1.sh

# Stage 2: Fine-grained annealing
bash scripts/sft_stage_2.sh

Attention Implementation Guide

Implementation	Platform	Install	When to use
`flash_attention_2`	CUDA	`pip install flash-attn`	Default for NVIDIA — fastest
`sdpa`	Any	None	Default for ROCm / fallback for CUDA
`eager`	Any	None	Debugging only (slowest)

CUDA Tips

Flash Attention 2 gives 2-3x speedup over SDPA for long sequences
The 2B model fits in 6 GB VRAM (bf16) — works on RTX 3060!
Use --fps 4 with 2B for higher temporal resolution (fast enough)
For multi-GPU: model fits single GPU, use data parallelism for batch processing

ROCm Tips

SDPA on ROCm auto-selects the best kernel (Flash or Memory-Efficient)

Optional: build flash-attn with CK backend for max perf:

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention && pip install .

Set PYTORCH_ROCM_ARCH for your GPU: gfx942 (MI300X), gfx90a (MI250X)

Project Structure

streaming-vlm-qwen3-rocm/
├── README.md                       # This file
├── MIGRATION_GUIDE.md              # Detailed diff from original
├── requirements.txt
├── setup.py
├── train.py                        # SFT training entry point
├── test_imports.py                 # Unit tests (no GPU needed)
├── scripts/
│   ├── sft_stage_1.sh
│   ├── sft_stage_2.sh
│   └── zero3.json
└── streaming_vlm/
    ├── cuda_config.py              # CUDA/ROCm presets for 2B & 4B
    ├── inference/
    │   ├── inference.py            # Main streaming loop
    │   ├── streaming_args.py       # Streaming config dataclass
    │   ├── generate/
    │   │   └── streaming_cache.py  # KV cache with pruning
    │   └── qwen3/                  # Qwen3-VL specific patches
    │       ├── patch_model.py      # Apply all monkey-patches
    │       ├── language_forward.py # SDPA text attention
    │       ├── vision_forward.py   # Chunked SDPA vision
    │       ├── model_forward.py    # Top-level forwards
    │       └── pos_emb.py          # Contiguous RoPE
    └── utils/
        └── get_qwen_range.py       # Token ID utilities

Citation

@article{xu2025streamingvlm,
  title={StreamingVLM: Real-Time Understanding for Infinite Video Streams},
  author={Xu, Jiayi and Xiao, Guangxuan and others},
  journal={arXiv preprint arXiv:2510.09608},
  year={2025}
}

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "s23deepak/streaming-vlm-qwen3-rocm"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for s23deepak/streaming-vlm-qwen3-rocm

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Paper • 2510.09608 • Published Oct 10, 2025 • 53