Odia OCR — Qwen2.5-VL-3B Fine-tuned (v2)
Fine-tuned version of Qwen/Qwen2.5-VL-3B-Instruct for Optical Character Recognition (OCR) of Odia script using LoRA adapters.
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen2.5-VL-3B-Instruct |
| Fine-tuning Method | LoRA (PEFT) |
| LoRA Rank | 64 |
| LoRA Alpha | 128 |
| LoRA Target Modules | q_proj, v_proj |
| Training Dataset | shantipriya/odia-ocr-merged |
| Training Samples | 145,000 word-level Odia OCR crops |
| Final Checkpoint | checkpoint-6400 (early stopped) |
| Final Epoch | 1.50 |
| Final Train Loss | ~4.83 |
| Best Eval Loss | 5.454 |
| Training Hardware | NVIDIA H100 80GB |
| Training Duration | ~12.7 hours |
| Learning Rate | 3e-4 (cosine decay to 2.7e-5) |
| Batch Size | 8 (per device 2 × grad accum 4) |
Training Notes
Training was early stopped at step 6,400 (of 12,387 planned) due to confirmed loss plateau:
- Train loss converged to ~4.83–5.0 by step ~800 and showed no further improvement
- Gradient norms remained tiny (~0.014–0.024) indicating saturated word-level learning
- Eval loss plateau: 5.512 → 5.454 (only 1% delta across 6,000 steps)
For further gains, Phase 3 with mixed paragraph + word samples is recommended.
Sample Predictions
Each row shows the original crop image from shantipriya/odia-ocr-merged, the ground truth label,
the model-extracted text, and a quality remark.
✅ Good — clean, high-contrast printed crops
| Image | Ground Truth | Extracted Text | Remark |
|---|---|---|---|
![]() |
ଫୁଲି | ଫୁଲି | ✅ Exact match |
![]() |
ସିମିତ | ସିମିତ | ✅ Exact match |
![]() |
କାବୁ | କାବୁ | ✅ Exact match |
![]() |
ସେରେସ | ସେରେସ | ✅ Exact match |
![]() |
କଳାଭାଲୁ | କଳାଭାଲୁ | ✅ Exact match |
Majority case (~65–70%) for well-segmented printed word crops.
⚠️ Mixed — partial errors, diacritic / conjunct substitutions
Mixed cases (~20–25%) mostly involve complex conjuncts and long-vowel matras.
❌ Bad — degraded, truncated, or low-resolution outputs
Bad cases (~10–15%): very low resolution (<20 px height), heavy degradation, or long compound words.
Summary
| Category | Approx. Share | Typical Cause |
|---|---|---|
| ✅ Good (exact match) | ~65–70% | Clean, well-segmented printed crops |
| ⚠️ Mixed (1–2 char errors) | ~20–25% | Complex conjuncts, long-vowel matras |
| ❌ Bad (heavily wrong) | ~10–15% | Degraded scans, compound words, low-res |
Note: CER/WER metrics on a curated test split are pending. Percentages are estimated from qualitative review of ~200 samples.
Usage
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from peft import PeftModel
import torch
from PIL import Image
base_model = "Qwen/Qwen2.5-VL-3B-Instruct"
adapter_model = "shantipriya/odia-ocr-qwen-finetuned_v2"
processor = AutoProcessor.from_pretrained(base_model, trust_remote_code=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
base_model, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(model, adapter_model)
model.eval()
def ocr_image(image_path: str) -> str:
image = Image.open(image_path).convert("RGB")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Extract the Odia text from this image. Return only the text."}
]
}]
text_prompt = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=[text_prompt], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False, temperature=1.0)
generated = output_ids[:, inputs["input_ids"].shape[1]:]
return processor.batch_decode(generated, skip_special_tokens=True)[0].strip()
print(ocr_image("odia_word.png"))
Training Data
The model was trained on shantipriya/odia-ocr-merged:
- 145,000 word-level Odia script image crops
- Diverse fonts, sizes, and print quality
- Sourced from multiple Odia OCR corpora and merged/deduplicated
Available Checkpoints
| Checkpoint | Step | Epoch | Train Loss |
|---|---|---|---|
| checkpoint-3200 | 3,200 | 0.77 | ~5.2 |
| checkpoint-6000 | 6,000 | 1.45 | ~4.85 |
| checkpoint-6200 | 6,200 | 1.50 | ~4.92 |
| checkpoint-6400 ← Final | 6,400 | 1.51 | ~4.83 |
Limitations
- Optimized for printed Odia word-level crops; handwritten or degraded images may need further fine-tuning
- Complex conjunct characters and long compound words are main error sources
- Not tested on mixed-language (Odia + English) documents
Citation
@misc{parida2026odiaocr,
author = {Shantipriya Parida and OdiaGenAI Team},
title = {Odia OCR: Fine-tuned Qwen2.5-VL for Odia Script Recognition},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/shantipriya/odia-ocr-qwen-finetuned_v2}},
note = {LoRA fine-tune of Qwen2.5-VL-3B-Instruct on 145K Odia OCR word crops}
}
If using the training dataset, also cite:
@misc{parida2026odiadataset,
author = {Shantipriya Parida},
title = {Odia OCR Merged Dataset},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/datasets/shantipriya/odia-ocr-merged}}
}
License
Apache 2.0
Contact
- Author: Shantipriya Parida
- Organization: OdiaGenAI
- Mirror: OdiaGenAIOCR/odia-ocr-qwen-finetuned
- Downloads last month
- 124
Model tree for OdiaGenAIOCR/odia-ocr-qwen-finetuned
Base model
Qwen/Qwen2.5-VL-3B-Instruct













