You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

[LoRA Adapter] MedGemma 1.5 4B IT — Cataract Surgical Analysis

This is a LoRA adapter, not a standalone model. It requires the base model google/medgemma-1.5-4b-it to be loaded separately (see the Usage section below). Accepting Google's Health AI Developer Foundations license is required to access the base model.

This adapter fine-tunes Google's MedGemma 1.5 4B IT for cataract surgery analysis. It was trained on the Cataract-1K dataset (a component of the LMOD benchmark) using a Chain-of-Thought (CoT) approach.

📂 Source code: github.com/b5y/medgemma-impact-challenge

Model Description

This LoRA adapter enhances MedGemma's ability to interpret surgical video frames. Unlike general-purpose models, it is trained to "think before it speaks," providing:

Thinking Process: An expert-level reasoning trace that analyzes the surgical phase, identifies instrument-anatomy relationships, and assesses safety margins.
Final Answer: A clear, actionable instruction suitable for a surgical resident.

Output Format

The model generates a structured response:

Thinking Process:
The frame shows the phacoemulsification phase with the ultrasonic tip positioned within the lens nucleus. The corneal incision margins appear well-maintained. Safety margins around the posterior capsule are adequate but require continuous monitoring.

Final Answer:
Maintain steady foot pedal pressure and keep the phaco tip centered within the nuclear material to avoid inadvertent contact with the posterior capsule.

Training Data

The training data consists of surgical frames from the Cataract-1K dataset. To enable high-quality instruction tuning, reasoning traces were distilled from Qwen3-VL-30B-A3B-Thinking.

Raw Source Dataset: mehti/LMOD-Cataract-1K — original Cataract-1K frames with bounding box and segmentation annotations from the LMOD benchmark.
Generated CoT Dataset: mehti/LMOD-Cataract-1K-surgical-analysis-cot — synthetic chain-of-thought reasoning traces and surgical instructions distilled from Qwen3-VL-30B-A3B-Thinking, used directly for fine-tuning this model.

Training Procedure

The model was fine-tuned using QLoRA (4-bit quantization with LoRA) to ensure efficiency while maintaining performance.

Parameter	Value
Base Model	`google/medgemma-1.5-4b-it`
Fine-tuning Method	QLoRA (4-bit NF4, double quantization)
LoRA Rank (r)	16
LoRA Alpha	16
LoRA Dropout	0.05
Target Modules	`all-linear`
Quantization	`4bit-nf4-double_quant`
Image Resolution	896 x 896
Optimizer	AdamW (fused)
Learning Rate	2e-4
Scheduler	Linear
Warmup Ratio	0.03
Max Gradient Norm	0.3
Epochs	3 (with early stopping)
Effective Batch Size	24
Precision	`bfloat16`
Cross-Validation	5-fold GroupKFold (grouped by surgery case)

Training Results

Evaluation Loss — All Folds

All five folds show consistent and monotonically decreasing evaluation loss throughout training. By step 200, every fold converges to a final eval loss in the range of 0.19–0.24, demonstrating stable learning without signs of overfitting across different data splits. Fold 1 achieves the lowest final eval loss (0.196), while Folds 2 and 4 converge slightly higher (0.241). The tight grouping of curves confirms that the 5-fold GroupKFold split (grouped by surgery case) produces representative and comparable validation sets.

Loss Curves — Fold 1

For Fold 1 (the best-performing fold), training loss drops steeply from ~2.4 at initialization and quickly converges near the evaluation loss by step 10. Both train and eval loss then decrease together steadily, with no divergence — indicating no overfitting. The rapid early drop reflects effective adaptation of the base model's visual-language representations to the surgical domain.

Token Accuracy — Fold 1

Token-level accuracy for Fold 1 climbs from ~0.51 at the start to 0.94 by the final step. Train and eval accuracy track each other closely throughout, with eval accuracy slightly above train accuracy in the later steps — a positive indicator of good generalization. This high token accuracy reflects that the model reliably reproduces the structured output format (Thinking Process / Final Answer) as well as medically appropriate content.

Usage

With 🤗 Transformers

from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import PeftModel
from PIL import Image
import torch
import requests

# Load the base model and processor
base_model_id = "google/medgemma-1.5-4b-it"
adapter_id = "mehti/medgemma-cataract-surgical-analysis-peft-lora"

processor = AutoProcessor.from_pretrained(base_model_id)
base_model = AutoModelForImageTextToText.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_id)
model.eval()

# Load a surgical frame (replace with your image)
image = Image.open("surgical_frame.jpg").convert("RGB")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {
                "type": "text",
                "text": (
                    "You are an expert ophthalmic surgeon reviewing a frame from a cataract surgery video. "
                    "Analyze the surgical scene and provide a chain-of-thought reasoning followed by a "
                    "clear instruction for a surgical resident.\n\n"
                    "Format your response as:\nThinking Process:\n<your reasoning>\n\nFinal Answer:\n<your instruction>"
                ),
            },
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=512, do_sample=False)

decoded = processor.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(decoded)

With vLLM (OpenAI-compatible server)

First, merge the LoRA adapter into the base weights (recommended for vLLM deployment):

python -c "
from transformers import AutoProcessor
from peft import AutoPeftModelForCausalLM
import torch

model = AutoPeftModelForCausalLM.from_pretrained(
    'mehti/medgemma-cataract-surgical-analysis-peft-lora',
    torch_dtype=torch.bfloat16,
)
merged = model.merge_and_unload()
merged.save_pretrained('./medgemma-cataract-merged')
AutoProcessor.from_pretrained('google/medgemma-1.5-4b-it').save_pretrained('./medgemma-cataract-merged')
print('Merge complete.')
"

Then serve with vLLM:

vllm serve ./medgemma-cataract-merged \
    --host 0.0.0.0 \
    --port 8000 \
    --dtype bfloat16 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90

Query the server using the OpenAI client (with base64-encoded image):

import base64
import httpx
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# Load and encode your surgical frame
with open("surgical_frame.jpg", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="medgemma-cataract-merged",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"},
                },
                {
                    "type": "text",
                    "text": (
                        "You are an expert ophthalmic surgeon reviewing a frame from a cataract surgery video. "
                        "Analyze the surgical scene and provide a chain-of-thought reasoning followed by a "
                        "clear instruction for a surgical resident.\n\n"
                        "Format your response as:\nThinking Process:\n<your reasoning>\n\nFinal Answer:\n<your instruction>"
                    ),
                },
            ],
        }
    ],
    max_tokens=512,
    temperature=0.0,
)

print(response.choices[0].message.content)

Intended Use

Research: For analyzing multimodal medical AI capabilities in surgical domains.
Education: As a prototype for AI-assisted surgical training systems.
Limitations: This model is for research purposes only and is not intended for clinical decision-making or direct patient care.

Citations

If you use this model in your research, please cite the following:

@article{medgemma2025,
  title={MedGemma Technical Report},
  author={Sellergren, Andrew and Kazemzadeh, Sahar and Jaroensri, Tiam and Kiraly, Atilla and Traverse, Madeleine and Kohlberger, Timo and Xu, Shawn and Jamil, Fayaz and Hughes, C{\'\i}an and Lau, Charles and Chen, Justin and Mahvar, Fereshteh and Yatziv, Liron and Chen, Tiffany and Sterling, Bram and others},
  journal={arXiv preprint arXiv:2507.05201},
  year={2025}
}

@misc{qin2025lmodlargemultimodalophthalmology,
      title={LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models},
      author={Zhenyue Qin and Yu Yin and Dylan Campbell and Xuansheng Wu and Ke Zou and Yih-Chung Tham and Ninghao Liu and Xiuzhen Zhang and Qingyu Chen},
      year={2025},
      eprint={2410.01620},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.01620},
}

NOTE: This documentation is generated using Gemini 3 Pro and verified by human.

Downloads last month: -

Model tree for mehti/medgemma-cataract-surgical-analysis-peft-lora

Base model

google/medgemma-1.5-4b-it

Adapter

(37)

this model

Papers for mehti/medgemma-cataract-surgical-analysis-peft-lora

MedGemma Technical Report

Paper • 2507.05201 • Published Jul 7, 2025 • 16

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models

Paper • 2410.01620 • Published Oct 2, 2024