Jawi OCR — Kraken Recognition Model
A Kraken recognition model for historical Jawi (Arabic-script Malay) text, fine-tuned for newspaper material from 1930s–1960s Singapore and Malaya. Developed by culturalheritagenus.
Model Description
This model recognises printed Jawi script — the Arabic-alphabet writing system historically used for Malay — as found in newspapers such as Utusan Melayu (1960s) and Warta Malaya (1930s) from the National Library of Singapore. It handles the extended Jawi character set (including ڬ, چ, ڠ, ڽ, ݢ and other characters not present in standard Arabic), as well as embedded Latin text, Eastern Arabic numerals, and common punctuation.
The model operates on individual line images: it expects a pre-cropped image containing a single line of text and returns the recognised characters with per-character coordinates and confidence scores.
Performance
| Split | CER |
|---|---|
| Validation | 2.7% |
| Hold-out test | 3.7% |
Training
The model was fine-tuned from the Kraken printed Arabic base model (10.5281/zenodo.7050270), which achieves roughly 20% CER on this material out of the box. Fine-tuning was done using resize='both' to extend the output layer for Jawi-specific characters absent from the base model's alphabet.
Training data consisted of approximately 1,800 lines total: 1,600 gold-standard lines from Utusan Melayu and Warta Malaya (each line independently transcribed by two annotators, with disagreements resolved through normalisation or manual adjudication), plus 200 synthetic lines generated from digital Jawi text and typefaces.
Hyperparameters: 100 epochs, learning rate 0.0001, data augmentation enabled. Trained on a single NVIDIA H100 GPU for approximately 10 hours.
Usage
This is a Kraken model, not a Hugging Face transformers model. Use it via the Kraken library with huggingface_hub to download the weights.
Installation
pip install "kraken>=6.0.0,<7.0.0" huggingface_hub Pillow
Recognising a Single Line Image
from huggingface_hub import hf_hub_download
from PIL import Image
from kraken.lib import models
from kraken import rpred
from kraken.containers import Segmentation, BaselineLine
# Download the model from HuggingFace
model_path = hf_hub_download(
repo_id="culturalheritagenus/jawi-ocr", # replace with actual repo ID
filename="jawi_rec.mlmodel", # replace with actual filename
)
# Load the Kraken recognition model
rec_model = models.load_any(model_path)
# Open a pre-cropped single-line image
im = Image.open("line_image.png")
if im.mode not in ("RGB", "L", "1"):
im = im.convert("RGB")
w, h = im.size
# Construct a synthetic single-line segmentation container.
# The baseline sits at ~75% of the image height (appropriate for
# Arabic-script text where characters hang from above).
# The boundary covers the full image, inset by 1px to avoid edge rejection.
baseline_y = int(h * 0.75)
line = BaselineLine(
id="line_0",
baseline=[(1, baseline_y), (w - 1, baseline_y)],
boundary=[(1, 1), (w - 1, 1), (w - 1, h - 1), (1, h - 1), (1, 1)],
text=None,
base_dir="R",
)
seg = Segmentation(
type="baselines",
imagename="",
text_direction="horizontal-rl",
script_detection=False,
lines=[line],
regions={},
line_orders=[],
)
# Run recognition with LTR BiDi base direction
pred_it = rpred.rpred(rec_model, im, seg, bidi_reordering="L")
for record in pred_it:
print(record.prediction)
for char, polygon, conf in zip(record.prediction, record.cuts, record.confidences):
print(f" {char} conf={conf:.3f} polygon={polygon}")
Output
The record object contains three aligned lists: record.prediction (the full recognised string), record.cuts (per-character bounding polygons in pixel coordinates relative to the input image), and record.confidences (per-character confidence values between 0 and 1).
The text direction settings deserve a brief explanation. The segmentation container uses text_direction='horizontal-rl' and base_dir='R' because these control how Kraken feeds the line image to the neural network — the LSTM must process the pixels right-to-left for Arabic-script text. The bidi_reordering='L' parameter is a separate post-processing step that reorders the output string using the Unicode BiDi algorithm with LTR as the paragraph base direction, so that Arabic/Jawi runs appear as embedded RTL spans. This makes the output suitable for direct display in any Unicode-aware environment.
Segmentation
This model performs recognition only — it does not segment pages into lines. You must provide pre-cropped single-line images. For full-page OCR, pair this model with a segmentation step (e.g. Kraken's baseline segmenter, YOLO, or Surya) to extract line images first.
Fine-tuning
To fine-tune this model on your own Jawi data, download the .mlmodel file and pass it as the base model to Kraken's training API:
from kraken.lib.train import RecognitionModel, KrakenTrainer
model = RecognitionModel(
training_data=train_files, # list of PAGE XML paths
evaluation_data=eval_files,
format_type="page",
resize="both", # extend alphabet for new characters
model=model_path, # path to the downloaded .mlmodel
hyper_params={"augment": True},
)
trainer = KrakenTrainer(accelerator="gpu", devices=1)
trainer.fit(model)
See the Kraken training documentation for details on preparing PAGE XML ground truth.
Repository Contents
| File | Description |
|---|---|
*.mlmodel |
Kraken model file (neural network weights, codec, and metadata) |
Limitations
The model was trained on printed Jawi newspaper text from a specific era and set of publications. Performance may degrade on handwritten text, significantly different typefaces, or heavily degraded/blurry images. The character set covers standard Jawi plus common Arabic and Latin characters found in Malayan newspapers, but rare characters or unusual orthographic conventions may not be recognised.
Since the model operates on individual lines, any errors in the upstream segmentation step (line detection, baseline placement, boundary polygon extraction) will directly affect recognition quality.
Acknowledgements
Developed by culturalheritagenus. Source material from the National Library of Singapore. Built with Kraken (v6.x), developed at Inria and the École Pratique des Hautes Études, Université PSL.
- Downloads last month
- 42