rubai-corrector-ocr-books-uz
Uzbek OCR correction model for old books. It is built to take noisy scanned book text, usually in Cyrillic, and turn it into cleaner Uzbek Latin text.
Authors
- Sardor Islomov — lead author
- Davron Ibrokhimov
This model is focused on a specific problem: many older Uzbek books survive as scans of printed Cyrillic books, and raw OCR from those scans is often hard to read, hard to search, and hard to reuse. Since Latin is the official script used in Uzbekistan today, this model is intended to help turn damaged OCR output into more usable Latin Uzbek text and increase accessible Latin book content.
This is the OCR-specialized variant of the rubai-corrector family.
rubai-corrector-base: general fine-tuning baserubai-corrector-transcript-uz: ASR transcript display cleanuprubai-corrector-ocr-books-uz(this model): old-book OCR correction
What It Is Good At
Use this model when your input looks like OCR from scanned Uzbek books, especially older Cyrillic books.
It is good at:
- Cyrillic Uzbek -> Latin Uzbek conversion
- OCR letter damage repair
- apostrophe and quote normalization
- fixing split or merged words caused by OCR
- restoring cleaner punctuation and hyphenation
- preserving multiline text reasonably well for poetry and book passages
Typical OCR issues it can repair:
савод чнқаришга->savod chiqarishgaкйлдан->qo'ldanкозон-позон->qozon-pozonсуриштнриб->surishtirib
It is not the right model for:
- ASR transcript display polishing
- Russian recovery from Latin ASR text
- generic denormalization tasks outside OCR books
Base And Data
- Base model:
rubai-corrector-base - Fine-tuned on a curated OCR correction dataset built from scanned Uzbek books
- Fine-tuning train split size:
10,730pairs - Full prepared dataset size:
12,642pairs from54books - Each training pair maps noisy OCR text to corrected Uzbek Latin text
- The dataset includes multiple OCR views of sampled book content and a normalized Latin target style
- Targets were standardized to keep apostrophes, quotes, and dashes consistent
Quick Start
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_path = "islomov/rubai-corrector-ocr-books-uz"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
text = "Бошим айланиб, кўз олдим айланади. Ана шаҳар, ҳамма қўлида алифбе китоби, савод чнқаришга туртинмоқдалар."
inputs = tokenizer(f"correct: {text}", return_tensors="pt")
output_ids = model.generate(**inputs, max_new_tokens=512)
prediction = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(prediction)
Expected output:
Boshim aylanib, ko'z oldim aylanadi. Ana shahar, hamma qo'lida alifbe kitobi, savod chiqarishga turtinmoqdalar.
The model expects the correct: prefix at inference time.
Real Example Outputs
These examples are taken from the actual held-out OCR evaluation for this checkpoint.
OCR damage: broken letters
Input: Бошим айланиб, кўз олдим айланади. Ана шаҳар, ҳамма қўлида алифбе китоби, савод чнқаришга туртинмоқдалар.
Output: Boshim aylanib, ko'z oldim aylanadi. Ana shahar, hamma qo'lida alifbe kitobi, savod chiqarishga turtinmoqdalar.
Corrects:
чнқаришга->chiqarishga- full Cyrillic -> Latin conversion
- apostrophe restoration in
ko'z,qo'l
OCR damage: broken words
Input: уриб турмаса кўнгли жойига тушмас, каерга борса шундан сўз очишни, кимни кўрмасин шу хакда бирон янги гап эшитишни, кандай мажлнс булмасин шу тўгрида масала кўйилишини хохлар ва ўнг келганда фурсатни кйлдан бермас эди.
Output: urib turmasa ko'ngli joyiga tushmas, qayerga borsa shundan so'z ochishni, kimni ko'rmasin shu haqda biron yangi gap eshitishni, qanday majlis bo'lmasin shu to'g'rida masala qo'yilishini xohlar va o'ng kelganda fursatni qo'ldan bermas edi.
Corrects:
каерга->qayergaмажлнс->majlisкйлдан->qo'ldan- apostrophes across the line
- multiple OCR-damaged words in one sentence
OCR damage: mixed lexical corruption
Input: "Сотасанми?", дедим, "Сотаман", деди. Пул билан савдо келишмагандан кейин, ош билан айирбош килмокчи бўлди. Кўндим. Ошни козон-позон, ўчок дамгирлари билан шу унга алишдим.
Output: "Sotasanmi?", dedim, "Sotaman", dedi. Pul bilan savdo kelishmagandan keyin, osh bilan ayirbosh qilmoqchi bo'ldi. Ko'ndim. Oshni qozon-pozon, o'choq damgirlari bilan shu unga alishdim.
Corrects:
килмокчи->qilmoqchiкозон-позон->qozon-pozonўчок->o'choq- quote and apostrophe normalization
OCR damage: names and rare words
Input: Малика овозаларнинг қанчалик ростлигини Бобир саройида хизматда юрган невараси Баҳодирдан суриштнриб билмоҳчи бўлди.
Output: Malika ovozalarning qanchalik rostligini Bobir saroyida xizmatda yurgan nevarasi Bahodirdan surishtirib bilmoqchi bo'ldi.
Corrects:
суриштнриб->surishtirib- named entities and historical names
- full Cyrillic -> Latin conversion without losing meaning
OCR damage: poetry and layout
Input: Айт-чн, йигит, Кимни ўйладинг, Куйлаганинг ким эди, Айт, ким? Ким ўртади ҳажрида сени, Кимга орзунг бўлмоқ муяссар?
Output: Ayt-chi, yigit,
Kimni o'ylading,
Kuylaganing kim edi,
Ayt, kim?
Kim o'rtadi hajrida seni,
Kimga orzung bo'lmoq muyassar?
Corrects:
Айт-чн->Ayt-chi- line segmentation for verse-like text
- apostrophes in
o'ylading,o'rtadi
Cyrillic book text -> clean Latin
Input: Фронтдан эсиб турган ел охирги мартаба ўлароқ ўзи билан бомба, шрапнель жаҳаннам машиналари ва оғир тўплардан портлаган порохларнинг заҳар тутунларини улаштириб, Оқ денгизнинг зилол қўйнига тўкди.
Output: Frontdan esib turgan yel oxirgi martaba o'laroq o'zi bilan bomba, shrapnel jahannam mashinalari va og'ir to'plardan portlagan poroxlarning zahar tutunlarini ulashtirib, Oq dengizning zilol qo'yniga to'kdi.
Corrects:
- long continuous Cyrillic prose
- apostrophes in
o'laroq,og'ir,qo'yniga,to'kdi - punctuation and readability preserved across a long sentence
Challenging example: many OCR fixes in one long passage
Input: Уни хар кандай от хам бир козон кайновича муддатдан ортик кўтариб юролмас, унга савашда тенг келгувчи топилмас эди, ана шу ботирлар ботирини охири Жалолиддинга рўбарў килди. "Дев" ўз одатича чинкириб ракибига ташланди, аммо дам ўтмай овози ўчди.
Output: Uni har qanday ot ham bir qozon qaynovicha muddatdan ortiq ko'tarib yurolmas, unga savashda teng kelguvchi topilmas edi, ana shu botirlar botirini oxiri Jaloliddinga ro'baro' qildi. "Dev" o'z odaticha chinqirib raqibiga tashlandi, ammo dam o'tmay ovozi o'chdi.
Corrects:
хар->harкандай->qandayкозон->qozonорт ик->ortiqчинкириб->chinqiribракибига->raqibiga- long multi-clause OCR text with quotes and apostrophes preserved
Why This Model Matters
For old Uzbek books, the hard part is often not just script conversion. It is OCR damage correction:
- dropped or corrupted letters
- broken apostrophes
- bad word boundaries
- damaged names
- noisy punctuation
- messy multiline book text
This model is designed for that exact problem.
Practical Limits
It still struggles more on:
- title pages
- bibliographic blocks
- repeated headers and page furniture
- glossary and footnote-heavy lines
- extremely damaged OCR with severe spacing noise
For those cases, manual review is still recommended.
Files
test_model.py: runnable local example script
Acknowledgements
Special thanks to Davron Ibrokhimov for sponsoring this work and making it possible to keep these models open.
Thank you to the community that supports Uzbek language technology. In particular:
- MetaSell for support and resources
- Kotib for their support and collaboration on Uzbek STT
- Global Move for backing open Uzbek NLP work
Thanks to Arofat, Gulimshaxnoz, and many others who contributed in ways big and small. The list is too long to fit here, but every contribution matters and is appreciated.
Support my works and open-source movement: https://tirikchilik.uz/islomovs
- Downloads last month
- 368