Model description

Model Name: cyrillic-large-handwritten

Model Version: cyrillic-large-fromsynthetic-freezedec-v5-dinov2

Model Type: Transformer-based encoder-decoder for OCR

Base Model: Kansallisarkisto/cyrillic-large-stage1

Purpose: Handwritten text recognition

Languages: Russian

License: Apache 2.0

This is a fine-tuned model for Cyrillic handwriting recognition, trained with computing resources generously provided by CSC – IT Center for Science on the LUMI supercomputer.

This model was developed in the ArchXAI project funded by the Central Baltic Programme.

Model Architecture

The model is based on a Transformer architecture with an encoder-decoder setup, similar to TrOCR from Li et al. (2023):

The encoder processes an image of a single line of text into a sequence of hidden states.
The decoder attends to the hidden states (including the CLS token) from the encoder using cross-attention, to generate the corresponding text output.

This model is a fine-tuned version of our base model, for handwritten text recognition in primarily Russian-language historical documents. The decoder layers (including cross-attention layers) were frozen for the first 11 epochs of fine-tuning to enable easier adaptation of the base model's encoder to handwriting while preserving the decoder’s language modeling capabilities.

Intended Use

Document digitization (e.g., archival work, historical manuscripts)
Handwritten notes transcription

Training data

The training data consists of human-annotated samples of mainly handwritten text lines from historical documents in the collections of the national archives of Estonia, Finland and Latvia, with some printed and typed text lines included. Certain character normalizations were made to the labels only to enforce our guidelines for human annotations (the same normalization was also applied in pre-training for the base model):

trans = str.maketrans({
        "ѣ": "е", "Ѣ": "Е",
        "і": "и", "І": "И"
    })

text = text.translate(trans)

Training set: 56615 text lines

Validation set: 7052 text lines

Test set: 7320 text lines

Evaluation

The following metrics were calculated on the test set (in-domain evaluation) using the evaluate library with default settings:

CER (character error rate): 0.0406

WER (word error rate): 0.1732

Used Hyperparameters

Train batch size per device: 8

Number of devices: 256

Learning rate: 5e-5

Warmup ratio: 0.05

Scheduler: linear

Optimizer: AdamW

Number of epochs: 80

First n epochs where decoder was frozen: 11

FP16 mixed precision training: False

Input image size: 182 x 1022

How to Use the Model

You can use the model for inference by loading the processor and model.

from transformers import TrOCRProcessor, VisionEncoderDecoderModel

class TrOCRProcessorCustom(TrOCRProcessor):
    def __init__(self, image_processor, tokenizer):
        self.image_processor = image_processor
        self.tokenizer = tokenizer
        self.current_processor = self.image_processor
        self.chat_template = None

processor = TrOCRProcessorCustom.from_pretrained("Kansallisarkisto/cyrillic-large-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("Kansallisarkisto/cyrillic-large-handwritten")

# Open an image of handwritten text
image = Image.open("path_to_image.jpg")

# Preprocess and predict
pixel_values = processor(image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(generated_text)

Limitations and Biases

The model was trained on human-annotated text lines that use Cyrillic characters. It has not been trained on non-Cyrillic alphabets, such as significant amounts of text lines with Latin or Chinese characters, or other writing systems like Arabic or Hebrew. The model may not generalize well to any other language than Russian.

This model is a fine-tuned version of cyrillic-large-stage1 and thus may inherit some of its limitations and biases. Out-of-domain generalization is heavily affected by the training data composition and configuration of the pre-training and fine-tuning.

Future Work

Potential improvements for this model include:

Expanding training data: incorporating more ground truth data.
Optimizing for specific domains: fine-tuning the model on domain-specific handwriting.
Out-of-domain generalization: studying how pre-training and fine-tuning could be optimized to maximize out-of-domain generalization of the fine-tuned model

Citation

If you use this model in your work, please cite it as:

@misc{cyrillic-large-handwritten,
  author = {Kansallisarkisto},
  title = {NAF Cyrillic OCR Model: fine-tuned handwriting checkpoint},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Kansallisarkisto/cyrillic-large-handwritten/}},
}

References

Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z. and Wei, F. 2023. TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models. Proceedings of the AAAI Conference on Artificial Intelligence. 37, 11 (Jun. 2023), 13094-13102. DOI:https://doi.org/10.1609/aaai.v37i11.26538.

Model Card Authors

Author: Kansallisarkisto

Contact Information: john.makela@kansallisarkisto.fi, ilkka.jokipii@kansallisarkisto.fi

Downloads last month: 94

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for Kansallisarkisto/cyrillic-large-handwritten

Base model

Kansallisarkisto/cyrillic-large-stage1

Finetuned

(1)

this model

Quantizations

1 model