uAI-NEXUS-MedVLM-1.0a-7B-RL

Accepted at CVPR 2026 πŸŽ‰

Base Model: Qwen2.5-VL-7B-Instruct

uAI-NEXUS-MedVLM-1.0a-7B-RL is a medical-video understanding model fine-tuned from Qwen2.5-VL-7B-Instruct. It is the 7B-RL member of the uAI-NEXUS-MedVLM 1.0 family (variant a = Qwen2.5-VL base; variants b / c use Qwen3-VL-4B and Qwen3.5-4B respectively). Training uses a two-stage pipeline:

  1. Supervised Fine-Tuning (SFT) on medical video QA data.
  2. Group Relative Policy Optimization (GRPO) with task-specific rewards for temporal precision and clinical semantics.

It achieves state-of-the-art performance on medical video understanding across temporal action localization, spatiotemporal grounding, video summarization, region captioning, and surgical skill/CVS assessment.

Model Details

  • Architecture: Qwen2.5-VL (7B parameters) β€” video + text
  • Base Model: Qwen/Qwen2.5-VL-7B-Instruct
  • Training: SFT β†’ GRPO
  • Domain: Medical and surgical video understanding
  • License: Apache 2.0

Supported Tasks

The model handles 8 medical video understanding tasks (11 variants):

Task Category Tasks
Temporal Understanding Temporal Action Localization (TAL), Spatiotemporal Grounding (STG), Next Action Prediction
Captioning Dense Captioning (GPT / Gemini), Video Summary (GPT / Gemini), Region Caption (GPT / Gemini)
Assessment Skill Assessment, CVS (Critical View of Safety)

Training Data

Trained on 51,505 balanced video-instruction pairs (the MedVidBench Standard split), spanning 8 source datasets: AVOS, CholecT50, CholecTrack20, Cholec80-CVS, CoPESD, EgoSurgery, JIGSAWS, NurViD.

Stage 2 (GRPO) uses task-balanced subsets of the Standard split (detailed in the paper).

Training Details

Stage 1 β€” Supervised Fine-Tuning

  • Objective: Learn medical video understanding from human-annotated QA pairs.
  • Optimizer: AdamW with linear learning-rate schedule.

Stage 2 β€” Group Relative Policy Optimization (GRPO)

  • Objective: Improve temporal precision and clinical semantic quality with RL.
  • Reward functions:
    • TAL / STG: Logistic-normalized IoU (dataset-fair, IQR-based).
    • Video Summary / Region Caption: Semantic similarity (SentenceBERT).
    • Next Action: Exact-match reward.
    • Skill / CVS Assessment: Score-based reward.
  • Framework: EasyR1 (built on verl by ByteDance).

Usage

Install

pip install transformers accelerate torch pillow qwen-vl-utils

Inference with Transformers

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL")

video_frames = ["frame_0001.jpg", "frame_0002.jpg", "frame_0003.jpg"]  # list of frame paths

messages = [{
    "role": "user",
    "content": [
        {"type": "video", "video": video_frames},
        {"type": "text", "text": "When does the surgeon grasp the gallbladder? Provide start and end times in seconds."},
    ],
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text], images=image_inputs, videos=video_inputs,
    padding=True, return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=256)
    generated_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, output_ids)]
    response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)
# Example: "The surgeon grasps the gallbladder from 45.2 to 58.7 seconds."

Batch Inference with VLLM

For full batch inference with correct video frame handling, use the reference pipeline at UII-AI/MedGRPO-Code:

git clone https://github.com/UII-AI/MedGRPO-Code
cd MedGRPO-Code
pip install -r requirements.txt
bash run_inference.sh

Performance

Evaluated on MedVidBench (6,245 test samples across 8 tasks). GRPO consistently improves the SFT baseline on:

  • Temporal precision for TAL / STG (higher IoU).
  • Semantic quality for video summaries and region captions.
  • Alignment with expert annotations for skill / CVS assessment.

Submit predictions to the MedVidBench Leaderboard to benchmark your own models.

Limitations

  • Domain: Optimized for medical / surgical videos; may not generalize to other domains.
  • Temporal Resolution: Best on videos sampled at 0.1–1.0 FPS.
  • Language: Trained primarily on English medical terminology.
  • Video Length: Optimal for videos of a few minutes; longer videos rely on frame sub-sampling.

License

Released under the Apache 2.0 License.

Citation

If you use this model or the MedVidBench benchmark, please cite:

@inproceedings{su2026medgrpo,
  title     = {{MedGRPO}: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
  author    = {Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and
               Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and
               Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgments

Contact

Open an issue on the GitHub repository.

Downloads last month
65
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL

Finetuned
(1041)
this model

Space using UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL 1

Paper for UII-AI/uAI-NEXUS-MedVLM-1.0a-7B-RL