Model description
Model Name: cyrillic-large-stage1
Model Version: proper-cyrillic-synthetic-rurobertalarge-138mil-32node-5e05-dinov2large
Model Type: Transformer-based encoder-decoder for OCR
Purpose: Printed, typed and handwritten text recognition
Languages: Russian
License: Apache 2.0
This is a pre-trained base model meant for fine-tuning application-specific text recognition models for languages written in the Cyrillic alphabet, especially Russian. It has been trained on over 138 million synthetically generated text line images based on historical documents and contemporary web corpora, with computing resources generously provided by CSC – IT Center for Science on the LUMI supercomputer.
This model was developed in the ArchXAI project funded by the Central Baltic Programme.
Model Architecture
The model is based on a Transformer architecture with an encoder-decoder setup, similar to TrOCR from Li et al. (2023):
- The encoder processes an image of a single line of text into a sequence of hidden states.
- The decoder attends to the hidden states (including the CLS token) from the encoder using cross-attention, to generate the corresponding text output.
The encoder was initialized from DINOv2 (Oquab et al., 2023) and the decoder was initialized from the last 12 layers of ruRoBERTa (Zmitrovich et al., 2024). The cross-attention layers between the encoder and decoder were initialized randomly.
Intended Use
This model is suitable as a base model for producing fine-tuned models for applications such as:
- Document digitization (e.g., archival work, historical manuscripts)
- Handwritten notes transcription
- Optical character recognition
Similarly to how the pre-trained English-language TrOCR checkpoints can be fine-tuned for other languages, this Russian-language checkpoint could be fine-tuned for other languages written in Cyrillic script.
Training data
The training images were synthesized on-the-fly during training. The images were created by a custom text line renderer (will be released later), using a curated collection of 898 Cyrillic font families under open-source or other suitable licenses and a large Russian text-only dataset consisting of historical documents and publications (primarily 18th to 20th century) and contemporary web corpora. The font collection consists of mostly print fonts, with a small number of typewriter and handwriting fonts included.
The training data includes a mix of both pre-1917 Imperial Russian and modern Russian orthography. Therefore, the model should be suitable for downstream applications in both. The texts in pre-1917 orthography and Soviet-era documents are mostly OCR results rather than human transcriptions. A large proportion of the pre-1917 texts coming from OCR have been spell checked using the rule-based ru-petr1708-hunspell-3.1 package from here.
Certain character normalizations were made to the labels only (but not the training images) due to specific requirements of downstream HTR fine-tuning:
trans = str.maketrans({
"ѣ": "е", "Ѣ": "Е",
"і": "и", "І": "И"
})
text_for_label = text_for_image.translate(trans)
Total training set: 138 589 047 synthetic text line images
Evaluation
Because this is a pre-trained model meant to be fine-tuned for downstream tasks, test metrics on the synthetic data were not calculated. CER (character error rate) on a small 10 000-line validation set used for tracking model convergence during training was approximately 0.8% (below one percent) at the end of training. More meaningful test metrics can be calculated on models which have been fine-tuned on downstream tasks.
Used Hyperparameters
Train batch size per device: 8
Number of devices: 256
Learning rate: 5e-5
Warmup ratio: 0.005
Scheduler: linear
Optimizer: AdamW
Number of epochs: 1
FP16 mixed precision training: False
Input image size: 182 x 1022
How to Use the Model
You can fine-tune the model on your own dataset. Note that some details differ from TrOCR due to using a different encoder (DINOv2). Consider freezing the decoder for the first few epochs of fine-tuning.
from transformers import TrOCRProcessor, VisionEncoderDecoderModel, Seq2SeqTrainer, Seq2SeqTrainingArguments, default_data_collator, TrainerCallback, TrainingArguments, Trainer, TrainerState, TrainerControl
import argparse
import torch
class TrOCRProcessorCustom(TrOCRProcessor):
def __init__(self, image_processor, tokenizer):
self.image_processor = image_processor
self.tokenizer = tokenizer
self.current_processor = self.image_processor
self.chat_template = None
# various arguments used below
args = ...
# possible to modify image size here
# we disable center crop for DINOv2 image processor
processor = TrOCRProcessorCustom.from_pretrained(args.model_path,
use_fast=True,
do_resize=True, do_center_crop=False,
size={'height': args.img_height,
'width': args.img_width})
# construct training and validation sets here
# using pixel values from processor and labels from tokenizer
train_dataset = ...
test_dataset = ...
model = VisionEncoderDecoderModel.from_pretrained(args.model_path)
device = torch.device(args.device if torch.cuda.is_available() else "cpu")
model.to(device)
training_args = ...
# Metrics e.g. CER (for model selection) and WER
def compute_metrics(pred):
...
class FreezingCallback(TrainerCallback):
"""Callback to freeze model for freeze_until epochs, except for unfrozen_layers (parameter names match including substring)"""
def __init__(self, unfrozen_layers, freeze_until, trainer: Trainer):
self.trainer = trainer
self.unfrozen_layers = unfrozen_layers
self.current_step_idx = 0
self.freeze_until = freeze_until
def on_epoch_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
if state.epoch < self.freeze_until: # freeze
self.freeze_model_except_for(self.unfrozen_layers, int(state.epoch))
else: # unfreeze
for name, param in self.trainer.model.named_parameters():
param.requires_grad = True
self.current_step_idx += 1
def on_save(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
for name, param in self.trainer.model.named_parameters():
param.requires_grad = True
def freeze_model_except_for(self, unfrozen_layers: list, epoch: int):
#print(f"\nEpoch {epoch}: Freezing model except for {unfrozen_layers}")
for name, param in self.trainer.model.named_parameters():
should_freeze = True
for unfrozen_name in unfrozen_layers:
if unfrozen_name in name:
should_freeze = False
break
if should_freeze:
param.requires_grad = False
else:
param.requires_grad = True
#print(f"Didn't freeze {name}")
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
# optionally freeze decoder (leaving encoder unfrozen) for e.g. first 10 epochs
if args.freeze_until >= 0:
freezing_callback = FreezingCallback(["encoder.encoder"], args.freeze_until, trainer)
trainer.add_callback(freezing_callback)
# Train the model
if len(args.resume) > 0:
print("Resuming from " + args.resume)
trainer.train(resume_from_checkpoint=args.resume)
else:
trainer.train()
# Guard saving in DDP scenarios
if trainer.is_world_process_zero():
# Save processor
model.save_pretrained(args.output_path)
processor.save_pretrained(args.output_path)
Limitations and Biases
The model was trained on synthetically generated text lines that use Cyrillic characters. It has not been trained on non-Cyrillic alphabets, such as significant amounts of text lines with Latin or Chinese characters, or other writing systems like Arabic or Hebrew. The model requires fine-tuning for downstream applications. The texts used include large proportions of Imperial Russian and Soviet documents and unfiltered web corpora.
Future Work
Potential improvements for this model include:
- Expanding training data: incorporating more synthetic data.
- Increase font diversity in text line synthesis: incorporating more fonts, especially handwritten
- Training for downstream tasks: fine-tuning the model with real (non-synthetic) images or with other languages
- Out-of-domain generalization: studying how pre-training and fine-tuning could be optimized to maximize out-of-domain generalization of the fine-tuned model
Citation
If you use this model in your work, please cite it as:
@misc{cyrillic-large-stage1,
author = {Kansallisarkisto},
title = {NAF Cyrillic OCR Model: pre-trained base model},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Kansallisarkisto/cyrillic-large-stage1/}},
}
References
Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. 2023. TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models. Proceedings of the AAAI Conference on Artificial Intelligence 37, 11 (2023), 13094–13102. DOI:https://doi.org/10.1609/aaai.v37i11.26538
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, and others. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023).
Dmitry Zmitrovich, Aleksandr Abramov, Andrey Kalmykov, Vitaly Kadulin, Maria Tikhonova, Ekaterina Taktasheva, Danil Astafurov, Mark Baushenko, Artem Snegirev, Tatiana Shavrina, and others. 2024. A family of pretrained transformer language models for Russian. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 507–524.
Model Card Authors
Author: Kansallisarkisto
Contact Information: john.makela@kansallisarkisto.fi, ilkka.jokipii@kansallisarkisto.fi
- Downloads last month
- 18
