uAI-NEXUS-MedVLM-1.0a-7B-RL

Accepted at CVPR 2026 🎉

uAI-NEXUS-MedVLM-1.0a-7B-RL is a medical-video understanding model fine-tuned from Qwen2.5-VL-7B-Instruct. It is the 7B-RL member of the uAI-NEXUS-MedVLM 1.0 family (variant a = Qwen2.5-VL base; variants b / c use Qwen3-VL-4B and Qwen3.5-4B respectively). Training uses a two-stage pipeline:

Supervised Fine-Tuning (SFT) on medical video QA data.
Group Relative Policy Optimization (GRPO) with task-specific rewards for temporal precision and clinical semantics.

It achieves state-of-the-art performance on medical video understanding across temporal action localization, spatiotemporal grounding, video summarization, region captioning, and surgical skill/CVS assessment.

📄 Paper: arXiv:2512.06581
🌐 Project Page: uii-ai.github.io/MedGRPO
💻 Code: github.com/UII-AI/MedGRPO-Code
🤗 Dataset: UII-AI/MedVidBench
📊 Leaderboard: UII-AI/MedVidBench-Leaderboard

Model Details

Architecture: Qwen2.5-VL (7B parameters) — video + text
Base Model: Qwen/Qwen2.5-VL-7B-Instruct
Training: SFT → GRPO
Domain: Medical and surgical video understanding
License: Apache 2.0

Supported Tasks

The model handles 8 medical video understanding tasks (11 variants):

Task Category	Tasks
Temporal Understanding	Temporal Action Localization (TAL), Spatiotemporal Grounding (STG), Next Action Prediction
Captioning	Dense Captioning (GPT / Gemini), Video Summary (GPT / Gemini), Region Caption (GPT / Gemini)
Assessment	Skill Assessment, CVS (Critical View of Safety)

Training Data

Trained on 51,505 balanced video-instruction pairs (the MedVidBench Standard split), spanning 8 source datasets: AVOS, CholecT50, CholecTrack20, Cholec80-CVS, CoPESD, EgoSurgery, JIGSAWS, NurViD.

Stage 2 (GRPO) uses task-balanced subsets of the Standard split (detailed in the paper).

Training Details

Stage 1 — Supervised Fine-Tuning

Objective: Learn medical video understanding from human-annotated QA pairs.
Optimizer: AdamW with linear learning-rate schedule.

Stage 2 — Group Relative Policy Optimization (GRPO)

Objective: Improve temporal precision and clinical semantic quality with RL.
Reward functions:
- TAL / STG: Logistic-normalized IoU (dataset-fair, IQR-based).
- Video Summary / Region Caption: Semantic similarity (SentenceBERT).
- Next Action: Exact-match reward.
- Skill / CVS Assessment: Score-based reward.
Framework: EasyR1 (built on verl by ByteDance).

Usage

Install

pip install transformers accelerate torch pillow qwen-vl-utils

Inference with Transformers

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL")

video_frames = ["frame_0001.jpg", "frame_0002.jpg", "frame_0003.jpg"]  # list of frame paths

messages = [{
    "role": "user",
    "content": [
        {"type": "video", "video": video_frames},
        {"type": "text", "text": "When does the surgeon grasp the gallbladder? Provide start and end times in seconds."},
    ],
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text], images=image_inputs, videos=video_inputs,
    padding=True, return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=256)
    generated_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, output_ids)]
    response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)
# Example: "The surgeon grasps the gallbladder from 45.2 to 58.7 seconds."

Batch Inference with VLLM

For full batch inference with correct video frame handling, use the reference pipeline at UII-AI/MedGRPO-Code:

git clone https://github.com/UII-AI/MedGRPO-Code
cd MedGRPO-Code
pip install -r requirements.txt
bash run_inference.sh

Performance

Evaluated on MedVidBench (6,245 test samples across 8 tasks). GRPO consistently improves the SFT baseline on:

Temporal precision for TAL / STG (higher IoU).
Semantic quality for video summaries and region captions.
Alignment with expert annotations for skill / CVS assessment.

Submit predictions to the MedVidBench Leaderboard to benchmark your own models.

Limitations

Domain: Optimized for medical / surgical videos; may not generalize to other domains.
Temporal Resolution: Best on videos sampled at 0.1–1.0 FPS.
Language: Trained primarily on English medical terminology.
Video Length: Optimal for videos of a few minutes; longer videos rely on frame sub-sampling.

License

Released under the Apache 2.0 License.

Citation

If you use this model or the MedVidBench benchmark, please cite:

@inproceedings{su2026medgrpo,
  title     = {{MedGRPO}: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
  author    = {Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and
               Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and
               Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgments

Base model: Qwen2.5-VL-7B-Instruct by Alibaba Cloud.
Training framework: EasyR1 (built on verl).

Contact

Open an issue on the GitHub repository.

Downloads last month: 65

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(1041)

this model

Space using UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL 1

Paper for UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL

MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

Paper • 2512.06581 • Published Dec 6, 2025 • 2