Medical Prescription OCR — Qwen2.5-VL-3B Fine-tuned

Model Details

Model Description

A vision-language model fine-tuned to extract structured information from handwritten and printed Indian medical prescriptions. Given a prescription image, it returns a JSON object containing doctor details, patient details, medications (with dosage, frequency, duration), diagnosis, and notes.

  • Developed by: Kushagra Wadhwa
  • Model type: Vision-Language Model (VLM) — Image-Text to JSON
  • Language(s): English, Hindi (mixed-language Indian prescriptions)
  • License: Apache 2.0
  • Finetuned from model: Qwen/Qwen2.5-VL-3B-Instruct

Model Sources


Uses

Direct Use

Extract structured data from medical prescription images in healthcare applications, EMR (Electronic Medical Record) systems, or patient monitoring dashboards. Input a prescription image, receive a structured JSON output.

Downstream Use

  • Automated pharmacy data entry
  • AI-powered patient monitoring systems
  • Hospital EMR digitisation pipelines
  • Medical audit and compliance tooling

Out-of-Scope Use

  • Diagnosing patients or making clinical decisions — this model reads and structures text, it does not provide medical advice
  • Prescriptions from non-Indian contexts (the training data is India-specific; handwriting styles and drug naming conventions may differ)
  • Real-time critical care decisions

Bias, Risks, and Limitations

  • Trained primarily on Indian prescription formats; performance may degrade on prescriptions from other regions
  • Handwriting quality significantly affects accuracy — illegible prescriptions will produce empty or incomplete fields
  • Drug names may be partially or incorrectly transcribed for unusual handwriting; always verify medication details before clinical use
  • Hindi/mixed-language prescriptions may have lower accuracy than purely English ones
  • Not a substitute for a qualified pharmacist or clinician review

Recommendations

Always have a qualified healthcare professional verify the model's output before acting on it. Do not use this model as the sole source of truth for medication dispensing or clinical workflows.


How to Get Started with the Model

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

model_id = "KushagraWadhwa/medical-prescription-ocr-india"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
processor = AutoProcessor.from_pretrained(model_id)

SYSTEM_PROMPT = """You are a medical prescription OCR assistant.
Extract all information from the prescription image and return it as JSON.
Return ONLY valid JSON, no explanation text.
Use this exact schema:
{"doctor_name": "", "clinic_name": "", "patient_name": "", "patient_age": "",
 "date": "", "medications": [{"drug_name": "", "dosage": "", "frequency": "",
 "duration": "", "instructions": ""}], "diagnosis": "", "notes": ""}
If a field is not visible, use empty string."""

image = Image.open("prescription.jpg").convert("RGB")

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text",  "text": "Extract all prescription information from this image and return as JSON."}
    ]}
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)

generated = outputs[0][inputs["input_ids"].shape[1]:]
print(processor.decode(generated, skip_special_tokens=True))

Example output:

{
  "doctor_name": "Dr. R. K. Sharma",
  "clinic_name": "Sharma Clinic",
  "patient_name": "Suresh Patel",
  "patient_age": "52",
  "date": "12/03/24",
  "medications": [
    {
      "drug_name": "Tab. Metformin 500mg",
      "dosage": "500mg",
      "frequency": "1-0-1",
      "duration": "30 days",
      "instructions": "After food"
    },
    {
      "drug_name": "Tab. Amlodipine 5mg",
      "dosage": "5mg",
      "frequency": "OD",
      "duration": "30 days",
      "instructions": "Morning"
    }
  ],
  "diagnosis": "T2DM, HTN",
  "notes": "Review after 1 month. RBS fasting."
}

Training Details

Training Data

  • Dataset: naazimsnh02/medocr-vision-dataset
  • Split: 1,969 training samples / 246 validation samples
  • Content: Scanned and photographed Indian medical prescriptions (handwritten and printed), labelled with structured JSON ground truth

Preprocessing

  • Images converted to RGB
  • Labels parsed from <s_ocr> tag format into structured JSON with fields: doctor_name, clinic_name, patient_name, patient_age, date, medications, diagnosis, notes
  • Medications parsed into per-drug entries: drug_name, dosage, frequency, duration, instructions
  • Sequences truncated to max length 512 tokens

Training Hyperparameters

Parameter Value
Training regime bf16 mixed precision
Epochs 3
Per-device train batch size 1
Gradient accumulation steps 8 (effective batch size: 8)
Learning rate 2e-4
LR scheduler Cosine
Warmup ratio 0.05
Max sequence length 512
Optimizer AdamW
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
LoRA target modules q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj
Quantization 4-bit NF4 (bitsandbytes) with double quantization

Speeds, Sizes, Times

  • Hardware: NVIDIA Tesla T4 × 2
  • Cloud Provider: Kaggle (Google Cloud)
  • Training time: ~3 hours (3 epochs, 1,969 samples)

Evaluation

Testing Data

246 samples from the validation split of naazimsnh02/medocr-vision-dataset.

Factors

Performance varies by:

  • Handwriting legibility (printed > clear handwritten > cursive/illegible)
  • Language mix (English-only > mixed Hindi-English)
  • Number of medications on the prescription

Metrics

  • Primary: Evaluation loss (used for best checkpoint selection)
  • Qualitative: JSON field accuracy on held-out prescription samples

Results

The model accurately extracts structured fields from both handwritten and printed Indian prescriptions. Fields like doctor_name, clinic_name, and date are extracted reliably. Medication list accuracy depends on handwriting legibility. Hindi/mixed-language content is supported but may have lower accuracy than English-only prescriptions.


Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: NVIDIA Tesla T4 × 2
  • Hours used: ~3
  • Cloud Provider: Google (Kaggle)
  • Compute Region: US (Kaggle default)

Technical Specifications

Model Architecture and Objective

  • Base: Qwen2.5-VL-3B-Instruct — 3B parameter multimodal vision-language transformer
  • Fine-tuning method: QLoRA (Quantized Low-Rank Adaptation) via PEFT
  • Quantization: 4-bit NF4 base weights; LoRA adapters trained in bfloat16
  • Objective: Causal language modelling on image+text → structured JSON sequences

Software

  • transformers
  • peft
  • trl (SFTTrainer / SFTConfig)
  • bitsandbytes
  • accelerate
  • datasets
  • torch (cu124)

Model Card Authors

Kushagra Wadhwa

Model Card Contact

https://github.com/KushagraaWadhwa

Downloads last month
79
Safetensors
Model size
4B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KushagraWadhwa/medical-prescription-ocr-india

Adapter
(124)
this model

Dataset used to train KushagraWadhwa/medical-prescription-ocr-india