PitVQA Merged Model
A ready-to-use version of PitVQA with the Stage 4 (Unified) adapter merged into the base weights. No adapter loading required - just load and run.
Why Use This Model?
| Feature | This (Merged) | Adapter Version |
|---|---|---|
| Setup complexity | Simple | Requires PEFT |
| Load time | Faster | Slower |
| Flexibility | Single task mode | Switch adapters |
| Best for | Production deployment | Research/experimentation |
Quick Start
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
# Load merged model directly - no adapter loading needed!
model = Qwen2VLForConditionalGeneration.from_pretrained(
"mmrech/pitvqa-qwen2vl-merged",
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained("mmrech/pitvqa-qwen2vl-merged")
# Run inference
from PIL import Image
image = Image.open("surgical_frame.jpg")
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Point to the suction device in this surgical image."}
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=128)
print(processor.decode(output[0], skip_special_tokens=True))
# Output: <point x='75.8' y='75.1'>suction device</point>
With 4-bit Quantization (Lower Memory)
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"mmrech/pitvqa-qwen2vl-merged",
quantization_config=bnb_config,
device_map="auto"
)
Supported Tasks
This merged model supports ALL tasks from Stage 4 training:
Point Localization
Prompt: "Point to the suction device in this surgical image."
Output: <point x='75.8' y='75.1'>suction device</point>
Bounding Box Detection
Prompt: "Draw a bounding box around the tumor region."
Output: <box x1='30' y1='30' x2='70' y2='70'>tumor region</box>
Phase Classification
Prompt: "What surgical phase is shown?"
Output: sellar phase
Supported phases: nasal, sellar, tumor_removal, closure
Free-form Queries
Prompt: "What instruments are visible in this surgical scene?"
Output: The image shows a suction device positioned in the surgical field...
Model Details
- Source: Merged from mmrech/pitvqa-qwen2vl-unified-v2 Stage 4 adapter
- Base: Qwen/Qwen2-VL-2B-Instruct
- Parameters: ~2B (full model, adapter merged)
- Training: SFT with LoRA, then merged with
merge_and_unload()
Related Resources
- Adapter Version: pitvqa-qwen2vl-unified-v2 (for multi-stage switching)
- Dataset: pitvqa-comprehensive-spatial
- Demo: HuggingFace Space
Citation
@misc{pitvqa2026,
title={PitVQA: Multi-Task Vision-Language Model for Pituitary Surgery},
author={Matheus Rech},
year={2026},
url={https://huggingface.co/mmrech/pitvqa-qwen2vl-merged}
}
License
Apache 2.0
- Downloads last month
- 3