HunyuanOCR Fine-tuned for Odia OCR

Fine-tuned tencent/HunyuanOCR on the OdiaGenAIOCR/odia-ocr-merged dataset using LoRA (r=64, alpha=128).

GitHub: shantipriyap/hunyuan_odia_ocr


Evaluation Results

Checkpoint Steps CER↓ WER↓ Notes
Baseline (zero-shot) 0 0.9111 0.9467 HunyuanOCR, no fine-tuning
v5 (r=32) 1000 0.7577 0.846 Best word-level CER so far
v7 (r=32) 3200 0.7909 0.941 r=32 capacity ceiling
v8 baseline 0 1.1188 1.4385 Before v8 training
v8 ckpt-3250 (latest) 3250 in training Loss ~0.93 best; 67% done

Note on evaluation: Training uses word-level crops (OdiaGenAIOCR/odia-ocr-merged). The Iftesha/odia-ocr-benchmark dataset contains paragraph-level images — a different domain where this model scores CER ~0.99 (expected, not trained on paragraphs).


Inference Samples (checkpoint-4000, step 80% of training)

Evaluated on 60 word-crop samples from OdiaGenAIOCR/odia-ocr-merged test split. Avg CER: 1.16 | Best CER: 0.64 (60 samples, ckpt-4000).

Note: Training is 80% complete (4000/5000 steps). Mode collapse persists — model outputs a small set of common Odia words. Expected to improve in final steps.

🟡 Best Available (CER 0.64–0.70)

Image Ground Truth Prediction CER
ବାକିମାନଙ୍କୁ ବାଲିକା 0.64
ନିର୍ଦ୍ଧାରଣ ବିଶ୍ଵାସ 0.70

🟠 Partial (CER 1.0)

Image Ground Truth Prediction CER
ଲବଙ୍ଗକୁ ମୁଖ୍ୟସ୍ଥ 1.00
ଗ୍ରାଫ୍ ବିଶ୍ୱର 1.00

🔴 Poor (CER > 3.0)

Image Ground Truth Prediction CER
୫୦ ବିଶ୍ୱର 3.00
୫୨ ବିଶ୍ୱାସ 3.50

Training Loss Curve (v8, r=64)

Step Loss
10 2.3695
500 ~1.18
910 1.0948
1500 ~1.11
2100 0.9964 ← first sub-1.0
2580 0.9339 ← best so far
2750 1.0291
3000 ~0.979
3250 ~0.979 (67% done, in training)

Training Configuration

Parameter Value
Base model tencent/HunyuanOCR
LoRA rank 64
LoRA alpha 128
Learning rate 2e-4
Warmup steps 100
Max steps 5000
Batch size 1 (grad_accum=4)
Max seq len 2048

Quick Start

import torch
from PIL import Image
from transformers import HunYuanVLForConditionalGeneration, AutoProcessor
from peft import PeftModel

BASE  = "tencent/HunyuanOCR"
CKPT  = "shantipriya/hunyuan-ocr-odia"

base  = HunYuanVLForConditionalGeneration.from_pretrained(
    BASE, torch_dtype=torch.bfloat16,
    attn_implementation="eager", device_map="auto")
model = PeftModel.from_pretrained(base, CKPT)
model.eval()
proc  = AutoProcessor.from_pretrained(BASE, use_fast=False)

img   = Image.open("odia_image.jpg").convert("RGB")
msgs  = [
    {"role": "system", "content": ""},        # required
    {"role": "user", "content": [
        {"type": "image", "image": img},
        {"type": "text",  "text": "Extract all Odia text from this image. Return only the Odia text."},
    ]},
]
text   = proc.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
inputs = proc(text=[text], images=[img], return_tensors="pt").to("cuda")
with torch.no_grad():
    gen = model.generate(**inputs, max_new_tokens=256, do_sample=False)
result = proc.batch_decode(
    [gen[0][inputs["input_ids"].shape[1]:]], skip_special_tokens=True
)[0].strip()
print(result)

Note: The empty system message is required — omitting it causes a position_ids dimension error.


License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shantipriya/hunyuan-ocr-odia

Unable to build the model tree, the base model loops to the model itself. Learn more.

Dataset used to train shantipriya/hunyuan-ocr-odia