rubai-corrector-ocr-books-uz

Uzbek OCR correction model for old books. It is built to take noisy scanned book text, usually in Cyrillic, and turn it into cleaner Uzbek Latin text.

Authors

Sardor Islomov — lead author
Davron Ibrokhimov

This model is focused on a specific problem: many older Uzbek books survive as scans of printed Cyrillic books, and raw OCR from those scans is often hard to read, hard to search, and hard to reuse. Since Latin is the official script used in Uzbekistan today, this model is intended to help turn damaged OCR output into more usable Latin Uzbek text and increase accessible Latin book content.

This is the OCR-specialized variant of the rubai-corrector family.

rubai-corrector-base: general fine-tuning base
rubai-corrector-transcript-uz: ASR transcript display cleanup
rubai-corrector-ocr-books-uz (this model): old-book OCR correction

What It Is Good At

Use this model when your input looks like OCR from scanned Uzbek books, especially older Cyrillic books.

It is good at:

Cyrillic Uzbek -> Latin Uzbek conversion
OCR letter damage repair
apostrophe and quote normalization
fixing split or merged words caused by OCR
restoring cleaner punctuation and hyphenation
preserving multiline text reasonably well for poetry and book passages

Typical OCR issues it can repair:

савод чнқаришга -> savod chiqarishga
кйлдан -> qo'ldan
козон-позон -> qozon-pozon
суриштнриб -> surishtirib

It is not the right model for:

ASR transcript display polishing
Russian recovery from Latin ASR text
generic denormalization tasks outside OCR books

Base And Data

Base model: rubai-corrector-base
Fine-tuned on a curated OCR correction dataset built from scanned Uzbek books
Fine-tuning train split size: 10,730 pairs
Full prepared dataset size: 12,642 pairs from 54 books
Each training pair maps noisy OCR text to corrected Uzbek Latin text
The dataset includes multiple OCR views of sampled book content and a normalized Latin target style
Targets were standardized to keep apostrophes, quotes, and dashes consistent

Quick Start

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_path = "islomov/rubai-corrector-ocr-books-uz"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

text = "Бошим айланиб, кўз олдим айланади. Ана шаҳар, ҳамма қўлида алифбе китоби, савод чнқаришга туртинмоқдалар."
inputs = tokenizer(f"correct: {text}", return_tensors="pt")
output_ids = model.generate(**inputs, max_new_tokens=512)
prediction = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(prediction)

Expected output:

Boshim aylanib, ko'z oldim aylanadi. Ana shahar, hamma qo'lida alifbe kitobi, savod chiqarishga turtinmoqdalar.

The model expects the correct: prefix at inference time.

Real Example Outputs

These examples are taken from the actual held-out OCR evaluation for this checkpoint.

OCR damage: broken letters

Input:     Бошим айланиб, кўз олдим айланади. Ана шаҳар, ҳамма қўлида алифбе китоби, савод чнқаришга туртинмоқдалар.
Output:    Boshim aylanib, ko'z oldim aylanadi. Ana shahar, hamma qo'lida alifbe kitobi, savod chiqarishga turtinmoqdalar.

Corrects:

чнқаришга -> chiqarishga
full Cyrillic -> Latin conversion
apostrophe restoration in ko'z, qo'l

OCR damage: broken words

Input:     уриб турмаса кўнгли жойига тушмас, каерга борса шундан сўз очишни, кимни кўрмасин шу хакда бирон янги гап эшитишни, кандай мажлнс булмасин шу тўгрида масала кўйилишини хохлар ва ўнг келганда фурсатни кйлдан бермас эди.
Output:    urib turmasa ko'ngli joyiga tushmas, qayerga borsa shundan so'z ochishni, kimni ko'rmasin shu haqda biron yangi gap eshitishni, qanday majlis bo'lmasin shu to'g'rida masala qo'yilishini xohlar va o'ng kelganda fursatni qo'ldan bermas edi.

Corrects:

каерга -> qayerga
мажлнс -> majlis
кйлдан -> qo'ldan
apostrophes across the line
multiple OCR-damaged words in one sentence

OCR damage: mixed lexical corruption

Input:     "Сотасанми?", дедим, "Сотаман", деди. Пул билан савдо келишмагандан кейин, ош билан айирбош килмокчи бўлди. Кўндим. Ошни козон-позон, ўчок дамгирлари билан шу унга алишдим.
Output:    "Sotasanmi?", dedim, "Sotaman", dedi. Pul bilan savdo kelishmagandan keyin, osh bilan ayirbosh qilmoqchi bo'ldi. Ko'ndim. Oshni qozon-pozon, o'choq damgirlari bilan shu unga alishdim.

Corrects:

килмокчи -> qilmoqchi
козон-позон -> qozon-pozon
ўчок -> o'choq
quote and apostrophe normalization

OCR damage: names and rare words

Input:     Малика овозаларнинг қанчалик ростлигини Бобир саройида хизматда юрган невараси Баҳодирдан суриштнриб билмоҳчи бўлди.
Output:    Malika ovozalarning qanchalik rostligini Bobir saroyida xizmatda yurgan nevarasi Bahodirdan surishtirib bilmoqchi bo'ldi.

Corrects:

суриштнриб -> surishtirib
named entities and historical names
full Cyrillic -> Latin conversion without losing meaning

OCR damage: poetry and layout

Input:     Айт-чн, йигит, Кимни ўйладинг, Куйлаганинг ким эди, Айт, ким? Ким ўртади ҳажрида сени, Кимга орзунг бўлмоқ муяссар?
Output:    Ayt-chi, yigit,
           Kimni o'ylading,
           Kuylaganing kim edi,
           Ayt, kim?
           Kim o'rtadi hajrida seni,
           Kimga orzung bo'lmoq muyassar?

Corrects:

Айт-чн -> Ayt-chi
line segmentation for verse-like text
apostrophes in o'ylading, o'rtadi

Cyrillic book text -> clean Latin

Input:     Фронтдан эсиб турган ел охирги мартаба ўлароқ ўзи билан бомба, шрапнель жаҳаннам машиналари ва оғир тўплардан портлаган порохларнинг заҳар тутунларини улаштириб, Оқ денгизнинг зилол қўйнига тўкди.
Output:    Frontdan esib turgan yel oxirgi martaba o'laroq o'zi bilan bomba, shrapnel jahannam mashinalari va og'ir to'plardan portlagan poroxlarning zahar tutunlarini ulashtirib, Oq dengizning zilol qo'yniga to'kdi.

Corrects:

long continuous Cyrillic prose
apostrophes in o'laroq, og'ir, qo'yniga, to'kdi
punctuation and readability preserved across a long sentence

Challenging example: many OCR fixes in one long passage

Input:     Уни хар кандай от хам бир козон кайновича муддатдан ортик кўтариб юролмас, унга савашда тенг келгувчи топилмас эди, ана шу ботирлар ботирини охири Жалолиддинга рўбарў килди. "Дев" ўз одатича чинкириб ракибига ташланди, аммо дам ўтмай овози ўчди.
Output:    Uni har qanday ot ham bir qozon qaynovicha muddatdan ortiq ko'tarib yurolmas, unga savashda teng kelguvchi topilmas edi, ana shu botirlar botirini oxiri Jaloliddinga ro'baro' qildi. "Dev" o'z odaticha chinqirib raqibiga tashlandi, ammo dam o'tmay ovozi o'chdi.

Corrects:

хар -> har
кандай -> qanday
козон -> qozon
орт ик -> ortiq
чинкириб -> chinqirib
ракибига -> raqibiga
long multi-clause OCR text with quotes and apostrophes preserved

Why This Model Matters

For old Uzbek books, the hard part is often not just script conversion. It is OCR damage correction:

dropped or corrupted letters
broken apostrophes
bad word boundaries
damaged names
noisy punctuation
messy multiline book text

This model is designed for that exact problem.

Practical Limits

It still struggles more on:

title pages
bibliographic blocks
repeated headers and page furniture
glossary and footnote-heavy lines
extremely damaged OCR with severe spacing noise

For those cases, manual review is still recommended.

Files

test_model.py: runnable local example script

Acknowledgements

Special thanks to Davron Ibrokhimov for sponsoring this work and making it possible to keep these models open.

Thank you to the community that supports Uzbek language technology. In particular:

MetaSell for support and resources
Kotib for their support and collaboration on Uzbek STT
Global Move for backing open Uzbek NLP work

Thanks to Arofat, Gulimshaxnoz, and many others who contributed in ways big and small. The list is too long to fit here, but every contribution matters and is appreciated.

Support my works and open-source movement: https://tirikchilik.uz/islomovs

Downloads last month: 368

Safetensors

Model size

0.3B params

Tensor type

F32

islomov
/

rubai-corrector-ocr-books-uz

rubai-corrector-ocr-books-uz

Authors

What It Is Good At

Base And Data

Quick Start

Real Example Outputs

OCR damage: broken letters

OCR damage: broken words

OCR damage: mixed lexical corruption

OCR damage: names and rare words

OCR damage: poetry and layout

Cyrillic book text -> clean Latin

Challenging example: many OCR fixes in one long passage

Why This Model Matters

Practical Limits

Files

Acknowledgements

Space using islomov/rubai-corrector-ocr-books-uz 1