rubai-corrector-ocr-books-uz

Uzbek OCR correction model for old books. It is built to take noisy scanned book text, usually in Cyrillic, and turn it into cleaner Uzbek Latin text.

Authors

This model is focused on a specific problem: many older Uzbek books survive as scans of printed Cyrillic books, and raw OCR from those scans is often hard to read, hard to search, and hard to reuse. Since Latin is the official script used in Uzbekistan today, this model is intended to help turn damaged OCR output into more usable Latin Uzbek text and increase accessible Latin book content.

This is the OCR-specialized variant of the rubai-corrector family.

  • rubai-corrector-base: general fine-tuning base
  • rubai-corrector-transcript-uz: ASR transcript display cleanup
  • rubai-corrector-ocr-books-uz (this model): old-book OCR correction

What It Is Good At

Use this model when your input looks like OCR from scanned Uzbek books, especially older Cyrillic books.

It is good at:

  • Cyrillic Uzbek -> Latin Uzbek conversion
  • OCR letter damage repair
  • apostrophe and quote normalization
  • fixing split or merged words caused by OCR
  • restoring cleaner punctuation and hyphenation
  • preserving multiline text reasonably well for poetry and book passages

Typical OCR issues it can repair:

  • савод чнқаришга -> savod chiqarishga
  • кйлдан -> qo'ldan
  • козон-позон -> qozon-pozon
  • суриштнриб -> surishtirib

It is not the right model for:

  • ASR transcript display polishing
  • Russian recovery from Latin ASR text
  • generic denormalization tasks outside OCR books

Base And Data

  • Base model: rubai-corrector-base
  • Fine-tuned on a curated OCR correction dataset built from scanned Uzbek books
  • Fine-tuning train split size: 10,730 pairs
  • Full prepared dataset size: 12,642 pairs from 54 books
  • Each training pair maps noisy OCR text to corrected Uzbek Latin text
  • The dataset includes multiple OCR views of sampled book content and a normalized Latin target style
  • Targets were standardized to keep apostrophes, quotes, and dashes consistent

Quick Start

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_path = "islomov/rubai-corrector-ocr-books-uz"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

text = "Бошим айланиб, кўз олдим айланади. Ана шаҳар, ҳамма қўлида алифбе китоби, савод чнқаришга туртинмоқдалар."
inputs = tokenizer(f"correct: {text}", return_tensors="pt")
output_ids = model.generate(**inputs, max_new_tokens=512)
prediction = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(prediction)

Expected output:

Boshim aylanib, ko'z oldim aylanadi. Ana shahar, hamma qo'lida alifbe kitobi, savod chiqarishga turtinmoqdalar.

The model expects the correct: prefix at inference time.

Real Example Outputs

These examples are taken from the actual held-out OCR evaluation for this checkpoint.

OCR damage: broken letters

Input:     Бошим айланиб, кўз олдим айланади. Ана шаҳар, ҳамма қўлида алифбе китоби, савод чнқаришга туртинмоқдалар.
Output:    Boshim aylanib, ko'z oldim aylanadi. Ana shahar, hamma qo'lida alifbe kitobi, savod chiqarishga turtinmoqdalar.

Corrects:

  • чнқаришга -> chiqarishga
  • full Cyrillic -> Latin conversion
  • apostrophe restoration in ko'z, qo'l

OCR damage: broken words

Input:     уриб турмаса кўнгли жойига тушмас, каерга борса шундан сўз очишни, кимни кўрмасин шу хакда бирон янги гап эшитишни, кандай мажлнс булмасин шу тўгрида масала кўйилишини хохлар ва ўнг келганда фурсатни кйлдан бермас эди.
Output:    urib turmasa ko'ngli joyiga tushmas, qayerga borsa shundan so'z ochishni, kimni ko'rmasin shu haqda biron yangi gap eshitishni, qanday majlis bo'lmasin shu to'g'rida masala qo'yilishini xohlar va o'ng kelganda fursatni qo'ldan bermas edi.

Corrects:

  • каерга -> qayerga
  • мажлнс -> majlis
  • кйлдан -> qo'ldan
  • apostrophes across the line
  • multiple OCR-damaged words in one sentence

OCR damage: mixed lexical corruption

Input:     "Сотасанми?", дедим, "Сотаман", деди. Пул билан савдо келишмагандан кейин, ош билан айирбош килмокчи бўлди. Кўндим. Ошни козон-позон, ўчок дамгирлари билан шу унга алишдим.
Output:    "Sotasanmi?", dedim, "Sotaman", dedi. Pul bilan savdo kelishmagandan keyin, osh bilan ayirbosh qilmoqchi bo'ldi. Ko'ndim. Oshni qozon-pozon, o'choq damgirlari bilan shu unga alishdim.

Corrects:

  • килмокчи -> qilmoqchi
  • козон-позон -> qozon-pozon
  • ўчок -> o'choq
  • quote and apostrophe normalization

OCR damage: names and rare words

Input:     Малика овозаларнинг қанчалик ростлигини Бобир саройида хизматда юрган невараси Баҳодирдан суриштнриб билмоҳчи бўлди.
Output:    Malika ovozalarning qanchalik rostligini Bobir saroyida xizmatda yurgan nevarasi Bahodirdan surishtirib bilmoqchi bo'ldi.

Corrects:

  • суриштнриб -> surishtirib
  • named entities and historical names
  • full Cyrillic -> Latin conversion without losing meaning

OCR damage: poetry and layout

Input:     Айт-чн, йигит, Кимни ўйладинг, Куйлаганинг ким эди, Айт, ким? Ким ўртади ҳажрида сени, Кимга орзунг бўлмоқ муяссар?
Output:    Ayt-chi, yigit,
           Kimni o'ylading,
           Kuylaganing kim edi,
           Ayt, kim?
           Kim o'rtadi hajrida seni,
           Kimga orzung bo'lmoq muyassar?

Corrects:

  • Айт-чн -> Ayt-chi
  • line segmentation for verse-like text
  • apostrophes in o'ylading, o'rtadi

Cyrillic book text -> clean Latin

Input:     Фронтдан эсиб турган ел охирги мартаба ўлароқ ўзи билан бомба, шрапнель жаҳаннам машиналари ва оғир тўплардан портлаган порохларнинг заҳар тутунларини улаштириб, Оқ денгизнинг зилол қўйнига тўкди.
Output:    Frontdan esib turgan yel oxirgi martaba o'laroq o'zi bilan bomba, shrapnel jahannam mashinalari va og'ir to'plardan portlagan poroxlarning zahar tutunlarini ulashtirib, Oq dengizning zilol qo'yniga to'kdi.

Corrects:

  • long continuous Cyrillic prose
  • apostrophes in o'laroq, og'ir, qo'yniga, to'kdi
  • punctuation and readability preserved across a long sentence

Challenging example: many OCR fixes in one long passage

Input:     Уни хар кандай от хам бир козон кайновича муддатдан ортик кўтариб юролмас, унга савашда тенг келгувчи топилмас эди, ана шу ботирлар ботирини охири Жалолиддинга рўбарў килди. "Дев" ўз одатича чинкириб ракибига ташланди, аммо дам ўтмай овози ўчди.
Output:    Uni har qanday ot ham bir qozon qaynovicha muddatdan ortiq ko'tarib yurolmas, unga savashda teng kelguvchi topilmas edi, ana shu botirlar botirini oxiri Jaloliddinga ro'baro' qildi. "Dev" o'z odaticha chinqirib raqibiga tashlandi, ammo dam o'tmay ovozi o'chdi.

Corrects:

  • хар -> har
  • кандай -> qanday
  • козон -> qozon
  • орт ик -> ortiq
  • чинкириб -> chinqirib
  • ракибига -> raqibiga
  • long multi-clause OCR text with quotes and apostrophes preserved

Why This Model Matters

For old Uzbek books, the hard part is often not just script conversion. It is OCR damage correction:

  • dropped or corrupted letters
  • broken apostrophes
  • bad word boundaries
  • damaged names
  • noisy punctuation
  • messy multiline book text

This model is designed for that exact problem.

Practical Limits

It still struggles more on:

  • title pages
  • bibliographic blocks
  • repeated headers and page furniture
  • glossary and footnote-heavy lines
  • extremely damaged OCR with severe spacing noise

For those cases, manual review is still recommended.

Files

  • test_model.py: runnable local example script

Acknowledgements

Special thanks to Davron Ibrokhimov for sponsoring this work and making it possible to keep these models open.

Thank you to the community that supports Uzbek language technology. In particular:

  • MetaSell for support and resources
  • Kotib for their support and collaboration on Uzbek STT
  • Global Move for backing open Uzbek NLP work

Thanks to Arofat, Gulimshaxnoz, and many others who contributed in ways big and small. The list is too long to fit here, but every contribution matters and is appreciated.

Support my works and open-source movement: https://tirikchilik.uz/islomovs

Downloads last month
368
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using islomov/rubai-corrector-ocr-books-uz 1