Medical Prescription OCR — Qwen2.5-VL-3B Fine-tuned
Model Details
Model Description
A vision-language model fine-tuned to extract structured information from handwritten and printed Indian medical prescriptions. Given a prescription image, it returns a JSON object containing doctor details, patient details, medications (with dosage, frequency, duration), diagnosis, and notes.
- Developed by: Kushagra Wadhwa
- Model type: Vision-Language Model (VLM) — Image-Text to JSON
- Language(s): English, Hindi (mixed-language Indian prescriptions)
- License: Apache 2.0
- Finetuned from model: Qwen/Qwen2.5-VL-3B-Instruct
Model Sources
Uses
Direct Use
Extract structured data from medical prescription images in healthcare applications, EMR (Electronic Medical Record) systems, or patient monitoring dashboards. Input a prescription image, receive a structured JSON output.
Downstream Use
- Automated pharmacy data entry
- AI-powered patient monitoring systems
- Hospital EMR digitisation pipelines
- Medical audit and compliance tooling
Out-of-Scope Use
- Diagnosing patients or making clinical decisions — this model reads and structures text, it does not provide medical advice
- Prescriptions from non-Indian contexts (the training data is India-specific; handwriting styles and drug naming conventions may differ)
- Real-time critical care decisions
Bias, Risks, and Limitations
- Trained primarily on Indian prescription formats; performance may degrade on prescriptions from other regions
- Handwriting quality significantly affects accuracy — illegible prescriptions will produce empty or incomplete fields
- Drug names may be partially or incorrectly transcribed for unusual handwriting; always verify medication details before clinical use
- Hindi/mixed-language prescriptions may have lower accuracy than purely English ones
- Not a substitute for a qualified pharmacist or clinician review
Recommendations
Always have a qualified healthcare professional verify the model's output before acting on it. Do not use this model as the sole source of truth for medication dispensing or clinical workflows.
How to Get Started with the Model
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
model_id = "KushagraWadhwa/medical-prescription-ocr-india"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained(model_id)
SYSTEM_PROMPT = """You are a medical prescription OCR assistant.
Extract all information from the prescription image and return it as JSON.
Return ONLY valid JSON, no explanation text.
Use this exact schema:
{"doctor_name": "", "clinic_name": "", "patient_name": "", "patient_age": "",
"date": "", "medications": [{"drug_name": "", "dosage": "", "frequency": "",
"duration": "", "instructions": ""}], "diagnosis": "", "notes": ""}
If a field is not visible, use empty string."""
image = Image.open("prescription.jpg").convert("RGB")
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Extract all prescription information from this image and return as JSON."}
]}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
generated = outputs[0][inputs["input_ids"].shape[1]:]
print(processor.decode(generated, skip_special_tokens=True))
Example output:
{
"doctor_name": "Dr. R. K. Sharma",
"clinic_name": "Sharma Clinic",
"patient_name": "Suresh Patel",
"patient_age": "52",
"date": "12/03/24",
"medications": [
{
"drug_name": "Tab. Metformin 500mg",
"dosage": "500mg",
"frequency": "1-0-1",
"duration": "30 days",
"instructions": "After food"
},
{
"drug_name": "Tab. Amlodipine 5mg",
"dosage": "5mg",
"frequency": "OD",
"duration": "30 days",
"instructions": "Morning"
}
],
"diagnosis": "T2DM, HTN",
"notes": "Review after 1 month. RBS fasting."
}
Training Details
Training Data
- Dataset:
naazimsnh02/medocr-vision-dataset - Split: 1,969 training samples / 246 validation samples
- Content: Scanned and photographed Indian medical prescriptions (handwritten and printed), labelled with structured JSON ground truth
Preprocessing
- Images converted to RGB
- Labels parsed from
<s_ocr>tag format into structured JSON with fields:doctor_name,clinic_name,patient_name,patient_age,date,medications,diagnosis,notes - Medications parsed into per-drug entries:
drug_name,dosage,frequency,duration,instructions - Sequences truncated to max length 512 tokens
Training Hyperparameters
| Parameter | Value |
|---|---|
| Training regime | bf16 mixed precision |
| Epochs | 3 |
| Per-device train batch size | 1 |
| Gradient accumulation steps | 8 (effective batch size: 8) |
| Learning rate | 2e-4 |
| LR scheduler | Cosine |
| Warmup ratio | 0.05 |
| Max sequence length | 512 |
| Optimizer | AdamW |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| LoRA target modules | q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj |
| Quantization | 4-bit NF4 (bitsandbytes) with double quantization |
Speeds, Sizes, Times
- Hardware: NVIDIA Tesla T4 × 2
- Cloud Provider: Kaggle (Google Cloud)
- Training time: ~3 hours (3 epochs, 1,969 samples)
Evaluation
Testing Data
246 samples from the validation split of naazimsnh02/medocr-vision-dataset.
Factors
Performance varies by:
- Handwriting legibility (printed > clear handwritten > cursive/illegible)
- Language mix (English-only > mixed Hindi-English)
- Number of medications on the prescription
Metrics
- Primary: Evaluation loss (used for best checkpoint selection)
- Qualitative: JSON field accuracy on held-out prescription samples
Results
The model accurately extracts structured fields from both handwritten and printed Indian prescriptions. Fields like doctor_name, clinic_name, and date are extracted reliably. Medication list accuracy depends on handwriting legibility. Hindi/mixed-language content is supported but may have lower accuracy than English-only prescriptions.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: NVIDIA Tesla T4 × 2
- Hours used: ~3
- Cloud Provider: Google (Kaggle)
- Compute Region: US (Kaggle default)
Technical Specifications
Model Architecture and Objective
- Base: Qwen2.5-VL-3B-Instruct — 3B parameter multimodal vision-language transformer
- Fine-tuning method: QLoRA (Quantized Low-Rank Adaptation) via PEFT
- Quantization: 4-bit NF4 base weights; LoRA adapters trained in bfloat16
- Objective: Causal language modelling on image+text → structured JSON sequences
Software
transformerspefttrl(SFTTrainer / SFTConfig)bitsandbytesacceleratedatasetstorch(cu124)
Model Card Authors
Kushagra Wadhwa
Model Card Contact
- Downloads last month
- 79
Model tree for KushagraWadhwa/medical-prescription-ocr-india
Base model
Qwen/Qwen2.5-VL-3B-Instruct