Model description
Model Name: cyrillic-large-handwritten
Model Version: cyrillic-large-fromsynthetic-freezedec-v5-dinov2
Model Type: Transformer-based encoder-decoder for OCR
Base Model: Kansallisarkisto/cyrillic-large-stage1
Purpose: Handwritten text recognition
Languages: Russian
License: Apache 2.0
This is a fine-tuned model for Cyrillic handwriting recognition, trained with computing resources generously provided by CSC – IT Center for Science on the LUMI supercomputer.
This model was developed in the ArchXAI project funded by the Central Baltic Programme.
Model Architecture
The model is based on a Transformer architecture with an encoder-decoder setup, similar to TrOCR from Li et al. (2023):
- The encoder processes an image of a single line of text into a sequence of hidden states.
- The decoder attends to the hidden states (including the CLS token) from the encoder using cross-attention, to generate the corresponding text output.
This model is a fine-tuned version of our base model, for handwritten text recognition in primarily Russian-language historical documents. The decoder layers (including cross-attention layers) were frozen for the first 11 epochs of fine-tuning to enable easier adaptation of the base model's encoder to handwriting while preserving the decoder’s language modeling capabilities.
Intended Use
- Document digitization (e.g., archival work, historical manuscripts)
- Handwritten notes transcription
Training data
The training data consists of human-annotated samples of mainly handwritten text lines from historical documents in the collections of the national archives of Estonia, Finland and Latvia, with some printed and typed text lines included. Certain character normalizations were made to the labels only to enforce our guidelines for human annotations (the same normalization was also applied in pre-training for the base model):
trans = str.maketrans({
"ѣ": "е", "Ѣ": "Е",
"і": "и", "І": "И"
})
text = text.translate(trans)
Training set: 56615 text lines
Validation set: 7052 text lines
Test set: 7320 text lines
Evaluation
The following metrics were calculated on the test set (in-domain evaluation) using the evaluate library with default settings:
CER (character error rate): 0.0406
WER (word error rate): 0.1732
Used Hyperparameters
Train batch size per device: 8
Number of devices: 256
Learning rate: 5e-5
Warmup ratio: 0.05
Scheduler: linear
Optimizer: AdamW
Number of epochs: 80
First n epochs where decoder was frozen: 11
FP16 mixed precision training: False
Input image size: 182 x 1022
How to Use the Model
You can use the model for inference by loading the processor and model.
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
class TrOCRProcessorCustom(TrOCRProcessor):
def __init__(self, image_processor, tokenizer):
self.image_processor = image_processor
self.tokenizer = tokenizer
self.current_processor = self.image_processor
self.chat_template = None
processor = TrOCRProcessorCustom.from_pretrained("Kansallisarkisto/cyrillic-large-handwritten")
model = VisionEncoderDecoderModel.from_pretrained("Kansallisarkisto/cyrillic-large-handwritten")
# Open an image of handwritten text
image = Image.open("path_to_image.jpg")
# Preprocess and predict
pixel_values = processor(image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
Limitations and Biases
The model was trained on human-annotated text lines that use Cyrillic characters. It has not been trained on non-Cyrillic alphabets, such as significant amounts of text lines with Latin or Chinese characters, or other writing systems like Arabic or Hebrew. The model may not generalize well to any other language than Russian.
This model is a fine-tuned version of cyrillic-large-stage1 and thus may inherit some of its limitations and biases. Out-of-domain generalization is heavily affected by the training data composition and configuration of the pre-training and fine-tuning.
Future Work
Potential improvements for this model include:
- Expanding training data: incorporating more ground truth data.
- Optimizing for specific domains: fine-tuning the model on domain-specific handwriting.
- Out-of-domain generalization: studying how pre-training and fine-tuning could be optimized to maximize out-of-domain generalization of the fine-tuned model
Citation
If you use this model in your work, please cite it as:
@misc{cyrillic-large-handwritten,
author = {Kansallisarkisto},
title = {NAF Cyrillic OCR Model: fine-tuned handwriting checkpoint},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Kansallisarkisto/cyrillic-large-handwritten/}},
}
References
Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z. and Wei, F. 2023. TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models. Proceedings of the AAAI Conference on Artificial Intelligence. 37, 11 (Jun. 2023), 13094-13102. DOI:https://doi.org/10.1609/aaai.v37i11.26538.
Model Card Authors
Author: Kansallisarkisto
Contact Information: john.makela@kansallisarkisto.fi, ilkka.jokipii@kansallisarkisto.fi
- Downloads last month
- 94
