Odia OCR — Qwen2.5-VL-3B Fine-tuned (v2)

Fine-tuned version of Qwen/Qwen2.5-VL-3B-Instruct for Optical Character Recognition (OCR) of Odia script using LoRA adapters.

Model Details

Property	Value
Base Model	Qwen/Qwen2.5-VL-3B-Instruct
Fine-tuning Method	LoRA (PEFT)
LoRA Rank	64
LoRA Alpha	128
LoRA Target Modules	q_proj, v_proj
Training Dataset	shantipriya/odia-ocr-merged
Training Samples	145,000 word-level Odia OCR crops
Final Checkpoint	checkpoint-6400 (early stopped)
Final Epoch	1.50
Final Train Loss	~4.83
Best Eval Loss	5.454
Training Hardware	NVIDIA H100 80GB
Training Duration	~12.7 hours
Learning Rate	3e-4 (cosine decay to 2.7e-5)
Batch Size	8 (per device 2 × grad accum 4)

Training Notes

Training was early stopped at step 6,400 (of 12,387 planned) due to confirmed loss plateau:

Train loss converged to ~4.83–5.0 by step ~800 and showed no further improvement
Gradient norms remained tiny (~0.014–0.024) indicating saturated word-level learning
Eval loss plateau: 5.512 → 5.454 (only 1% delta across 6,000 steps)

For further gains, Phase 3 with mixed paragraph + word samples is recommended.

Sample Predictions

Each row shows the original crop image from shantipriya/odia-ocr-merged, the ground truth label, the model-extracted text, and a quality remark.

✅ Good — clean, high-contrast printed crops

Ground Truth	Extracted Text	Remark
ଫୁଲି	ଫୁଲି	✅ Exact match
ସିମିତ	ସିମିତ	✅ Exact match
କାବୁ	କାବୁ	✅ Exact match
ସେରେସ	ସେରେସ	✅ Exact match
କଳାଭାଲୁ	କଳାଭାଲୁ	✅ Exact match

Majority case (~65–70%) for well-segmented printed word crops.

⚠️ Mixed — partial errors, diacritic / conjunct substitutions

Ground Truth	Extracted Text	Remark
ପ୍ରେରଣର	ପ୍ରେରଣର	⚠️ Diacritic or conjunct substitution
ମୈସ୍ଚୁସେଟ୍ସ	ମୈସ୍ଚୁସେଟ୍ସ	⚠️ Diacritic or conjunct substitution
ଜ୍ୱରଜାତ	ଜ୍ୱରଜାତ	⚠️ Diacritic or conjunct substitution
ସ୍ୱର୍ଣ	ସ୍ୱର୍ଣ	⚠️ Diacritic or conjunct substitution
ରାଜବଂଶର	ରାଜବଂଶର	⚠️ Diacritic or conjunct substitution

Mixed cases (~20–25%) mostly involve complex conjuncts and long-vowel matras.

❌ Bad — degraded, truncated, or low-resolution outputs

Ground Truth	Extracted Text	Remark
ଜାତି-ଧର୍ମ-ବର୍ଣ୍ଣ-ସଂପ୍ରଦାୟାଦିର	ଜାତି-ଧର୍ମ-ବର୍ଣ	❌ Truncated — long compound word or low-res image
୬-୦-ମିଥାଇଲଏରିଥ୍ରୋମାଇସିନ	୬-୦-ମିଥାଇଲଏ	❌ Truncated — long compound word or low-res image
ପ୍ରକାଶକ-ଜ୍ୟୋତିଷ-ବାସ୍ତୁ	ପ୍ରକାଶକ-ଜ୍ୟ	❌ Truncated — long compound word or low-res image
ଚନ୍ଦ୍ରଗିରି-ପଟ୍ଟାଙ୍ଗୀ	ଚନ୍ଦ୍ରଗିରି	❌ Truncated — long compound word or low-res image
ଶ୍ଵେତଚମ୍ପକବର୍ଣ୍ଣାଭା	ଶ୍ଵେତଚମ୍ପ	❌ Truncated — long compound word or low-res image

Bad cases (~10–15%): very low resolution (<20 px height), heavy degradation, or long compound words.

Summary

Category	Approx. Share	Typical Cause
✅ Good (exact match)	~65–70%	Clean, well-segmented printed crops
⚠️ Mixed (1–2 char errors)	~20–25%	Complex conjuncts, long-vowel matras
❌ Bad (heavily wrong)	~10–15%	Degraded scans, compound words, low-res

Note: CER/WER metrics on a curated test split are pending. Percentages are estimated from qualitative review of ~200 samples.

Usage

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from peft import PeftModel
import torch
from PIL import Image

base_model    = "Qwen/Qwen2.5-VL-3B-Instruct"
adapter_model = "shantipriya/odia-ocr-qwen-finetuned_v2"

processor = AutoProcessor.from_pretrained(base_model, trust_remote_code=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    base_model, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(model, adapter_model)
model.eval()

def ocr_image(image_path: str) -> str:
    image = Image.open(image_path).convert("RGB")
    messages = [{
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text",  "text": "Extract the Odia text from this image. Return only the text."}
        ]
    }]
    text_prompt = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = processor(text=[text_prompt], images=[image], return_tensors="pt").to(model.device)
    with torch.no_grad():
        output_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False, temperature=1.0)
    generated = output_ids[:, inputs["input_ids"].shape[1]:]
    return processor.batch_decode(generated, skip_special_tokens=True)[0].strip()

print(ocr_image("odia_word.png"))

Training Data

The model was trained on shantipriya/odia-ocr-merged:

145,000 word-level Odia script image crops
Diverse fonts, sizes, and print quality
Sourced from multiple Odia OCR corpora and merged/deduplicated

Available Checkpoints

Checkpoint	Step	Epoch	Train Loss
checkpoint-3200	3,200	0.77	~5.2
checkpoint-6000	6,000	1.45	~4.85
checkpoint-6200	6,200	1.50	~4.92
checkpoint-6400 ← Final	6,400	1.51	~4.83

Limitations

Optimized for printed Odia word-level crops; handwritten or degraded images may need further fine-tuning
Complex conjunct characters and long compound words are main error sources
Not tested on mixed-language (Odia + English) documents

Citation

@misc{parida2026odiaocr,
  author       = {Shantipriya Parida and OdiaGenAI Team},
  title        = {Odia OCR: Fine-tuned Qwen2.5-VL for Odia Script Recognition},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/shantipriya/odia-ocr-qwen-finetuned_v2}},
  note         = {LoRA fine-tune of Qwen2.5-VL-3B-Instruct on 145K Odia OCR word crops}
}

If using the training dataset, also cite:

@misc{parida2026odiadataset,
  author       = {Shantipriya Parida},
  title        = {Odia OCR Merged Dataset},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/datasets/shantipriya/odia-ocr-merged}}
}

License

Apache 2.0

Contact

Author: Shantipriya Parida
Organization: OdiaGenAI
Mirror: OdiaGenAIOCR/odia-ocr-qwen-finetuned

Downloads last month: 124

Safetensors

Model size

4B params

Tensor type

F32

Model tree for OdiaGenAIOCR/odia-ocr-qwen-finetuned

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Adapter

(121)

this model

OdiaGenAIOCR
/

odia-ocr-qwen-finetuned