uAI-NEXUS-MedVLM-1.0a-7B-RL
Accepted at CVPR 2026 π
Base Model: Qwen2.5-VL-7B-Instruct
uAI-NEXUS-MedVLM-1.0a-7B-RL is a medical-video understanding model fine-tuned from Qwen2.5-VL-7B-Instruct. It is the 7B-RL member of the uAI-NEXUS-MedVLM 1.0 family (variant a = Qwen2.5-VL base; variants b / c use Qwen3-VL-4B and Qwen3.5-4B respectively). Training uses a two-stage pipeline:
- Supervised Fine-Tuning (SFT) on medical video QA data.
- Group Relative Policy Optimization (GRPO) with task-specific rewards for temporal precision and clinical semantics.
It achieves state-of-the-art performance on medical video understanding across temporal action localization, spatiotemporal grounding, video summarization, region captioning, and surgical skill/CVS assessment.
- π Paper: arXiv:2512.06581
- π Project Page: uii-ai.github.io/MedGRPO
- π» Code: github.com/UII-AI/MedGRPO-Code
- π€ Dataset: UII-AI/MedVidBench
- π Leaderboard: UII-AI/MedVidBench-Leaderboard
Model Details
- Architecture: Qwen2.5-VL (7B parameters) β video + text
- Base Model: Qwen/Qwen2.5-VL-7B-Instruct
- Training: SFT β GRPO
- Domain: Medical and surgical video understanding
- License: Apache 2.0
Supported Tasks
The model handles 8 medical video understanding tasks (11 variants):
| Task Category | Tasks |
|---|---|
| Temporal Understanding | Temporal Action Localization (TAL), Spatiotemporal Grounding (STG), Next Action Prediction |
| Captioning | Dense Captioning (GPT / Gemini), Video Summary (GPT / Gemini), Region Caption (GPT / Gemini) |
| Assessment | Skill Assessment, CVS (Critical View of Safety) |
Training Data
Trained on 51,505 balanced video-instruction pairs (the MedVidBench Standard split), spanning 8 source datasets: AVOS, CholecT50, CholecTrack20, Cholec80-CVS, CoPESD, EgoSurgery, JIGSAWS, NurViD.
Stage 2 (GRPO) uses task-balanced subsets of the Standard split (detailed in the paper).
Training Details
Stage 1 β Supervised Fine-Tuning
- Objective: Learn medical video understanding from human-annotated QA pairs.
- Optimizer: AdamW with linear learning-rate schedule.
Stage 2 β Group Relative Policy Optimization (GRPO)
- Objective: Improve temporal precision and clinical semantic quality with RL.
- Reward functions:
- TAL / STG: Logistic-normalized IoU (dataset-fair, IQR-based).
- Video Summary / Region Caption: Semantic similarity (SentenceBERT).
- Next Action: Exact-match reward.
- Skill / CVS Assessment: Score-based reward.
- Framework: EasyR1 (built on verl by ByteDance).
Usage
Install
pip install transformers accelerate torch pillow qwen-vl-utils
Inference with Transformers
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL",
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL")
video_frames = ["frame_0001.jpg", "frame_0002.jpg", "frame_0003.jpg"] # list of frame paths
messages = [{
"role": "user",
"content": [
{"type": "video", "video": video_frames},
{"type": "text", "text": "When does the surgeon grasp the gallbladder? Provide start and end times in seconds."},
],
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text], images=image_inputs, videos=video_inputs,
padding=True, return_tensors="pt",
).to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, output_ids)]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
# Example: "The surgeon grasps the gallbladder from 45.2 to 58.7 seconds."
Batch Inference with VLLM
For full batch inference with correct video frame handling, use the reference pipeline at UII-AI/MedGRPO-Code:
git clone https://github.com/UII-AI/MedGRPO-Code
cd MedGRPO-Code
pip install -r requirements.txt
bash run_inference.sh
Performance
Evaluated on MedVidBench (6,245 test samples across 8 tasks). GRPO consistently improves the SFT baseline on:
- Temporal precision for TAL / STG (higher IoU).
- Semantic quality for video summaries and region captions.
- Alignment with expert annotations for skill / CVS assessment.
Submit predictions to the MedVidBench Leaderboard to benchmark your own models.
Limitations
- Domain: Optimized for medical / surgical videos; may not generalize to other domains.
- Temporal Resolution: Best on videos sampled at 0.1β1.0 FPS.
- Language: Trained primarily on English medical terminology.
- Video Length: Optimal for videos of a few minutes; longer videos rely on frame sub-sampling.
License
Released under the Apache 2.0 License.
Citation
If you use this model or the MedVidBench benchmark, please cite:
@inproceedings{su2026medgrpo,
title = {{MedGRPO}: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
author = {Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and
Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and
Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
Acknowledgments
- Base model: Qwen2.5-VL-7B-Instruct by Alibaba Cloud.
- Training framework: EasyR1 (built on verl).
Contact
Open an issue on the GitHub repository.
- Downloads last month
- 65
Model tree for UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL
Base model
Qwen/Qwen2.5-VL-7B-Instruct