Model description

Model Name: cyrillic-large-stage1

Model Version: proper-cyrillic-synthetic-rurobertalarge-138mil-32node-5e05-dinov2large

Model Type: Transformer-based encoder-decoder for OCR

Purpose: Printed, typed and handwritten text recognition

Languages: Russian

License: Apache 2.0

This is a pre-trained base model meant for fine-tuning application-specific text recognition models for languages written in the Cyrillic alphabet, especially Russian. It has been trained on over 138 million synthetically generated text line images based on historical documents and contemporary web corpora, with computing resources generously provided by CSC – IT Center for Science on the LUMI supercomputer.

This model was developed in the ArchXAI project funded by the Central Baltic Programme.

Model Architecture

The model is based on a Transformer architecture with an encoder-decoder setup, similar to TrOCR from Li et al. (2023):

The encoder processes an image of a single line of text into a sequence of hidden states.
The decoder attends to the hidden states (including the CLS token) from the encoder using cross-attention, to generate the corresponding text output.

The encoder was initialized from DINOv2 (Oquab et al., 2023) and the decoder was initialized from the last 12 layers of ruRoBERTa (Zmitrovich et al., 2024). The cross-attention layers between the encoder and decoder were initialized randomly.

Intended Use

This model is suitable as a base model for producing fine-tuned models for applications such as:

Document digitization (e.g., archival work, historical manuscripts)
Handwritten notes transcription
Optical character recognition

Similarly to how the pre-trained English-language TrOCR checkpoints can be fine-tuned for other languages, this Russian-language checkpoint could be fine-tuned for other languages written in Cyrillic script.

Training data

The training images were synthesized on-the-fly during training. The images were created by a custom text line renderer (will be released later), using a curated collection of 898 Cyrillic font families under open-source or other suitable licenses and a large Russian text-only dataset consisting of historical documents and publications (primarily 18th to 20th century) and contemporary web corpora. The font collection consists of mostly print fonts, with a small number of typewriter and handwriting fonts included.

The training data includes a mix of both pre-1917 Imperial Russian and modern Russian orthography. Therefore, the model should be suitable for downstream applications in both. The texts in pre-1917 orthography and Soviet-era documents are mostly OCR results rather than human transcriptions. A large proportion of the pre-1917 texts coming from OCR have been spell checked using the rule-based ru-petr1708-hunspell-3.1 package from here.

Certain character normalizations were made to the labels only (but not the training images) due to specific requirements of downstream HTR fine-tuning:

trans = str.maketrans({
        "ѣ": "е", "Ѣ": "Е",
        "і": "и", "І": "И"
    })

text_for_label = text_for_image.translate(trans)

Total training set: 138 589 047 synthetic text line images

Evaluation

Because this is a pre-trained model meant to be fine-tuned for downstream tasks, test metrics on the synthetic data were not calculated. CER (character error rate) on a small 10 000-line validation set used for tracking model convergence during training was approximately 0.8% (below one percent) at the end of training. More meaningful test metrics can be calculated on models which have been fine-tuned on downstream tasks.

Used Hyperparameters

Train batch size per device: 8

Number of devices: 256

Learning rate: 5e-5

Warmup ratio: 0.005

Scheduler: linear

Optimizer: AdamW

Number of epochs: 1

FP16 mixed precision training: False

Input image size: 182 x 1022

How to Use the Model

You can fine-tune the model on your own dataset. Note that some details differ from TrOCR due to using a different encoder (DINOv2). Consider freezing the decoder for the first few epochs of fine-tuning.

from transformers import TrOCRProcessor, VisionEncoderDecoderModel, Seq2SeqTrainer, Seq2SeqTrainingArguments, default_data_collator, TrainerCallback, TrainingArguments, Trainer, TrainerState, TrainerControl

import argparse
import torch

class TrOCRProcessorCustom(TrOCRProcessor):
    def __init__(self, image_processor, tokenizer):
        self.image_processor = image_processor
        self.tokenizer = tokenizer
        self.current_processor = self.image_processor
        self.chat_template = None

# various arguments used below
args = ...

# possible to modify image size here
# we disable center crop for DINOv2 image processor
processor = TrOCRProcessorCustom.from_pretrained(args.model_path,
                                use_fast=True,
                                do_resize=True, do_center_crop=False,
                                size={'height': args.img_height,
                                'width': args.img_width})

# construct training and validation sets here
# using pixel values from processor and labels from tokenizer
train_dataset = ...
test_dataset = ...

model = VisionEncoderDecoderModel.from_pretrained(args.model_path)

device = torch.device(args.device if torch.cuda.is_available() else "cpu")
model.to(device)

training_args = ...

# Metrics e.g. CER (for model selection) and WER
def compute_metrics(pred):
    ...

class FreezingCallback(TrainerCallback):
    """Callback to freeze model for freeze_until epochs, except for unfrozen_layers (parameter names match including substring)"""
    def __init__(self, unfrozen_layers, freeze_until, trainer: Trainer):
        self.trainer = trainer
        self.unfrozen_layers = unfrozen_layers
        self.current_step_idx = 0
        self.freeze_until = freeze_until

    def on_epoch_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        if state.epoch < self.freeze_until:  # freeze
            self.freeze_model_except_for(self.unfrozen_layers, int(state.epoch))
        else:  # unfreeze
            for name, param in self.trainer.model.named_parameters():
                param.requires_grad = True
        
        self.current_step_idx += 1

    def on_save(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
        for name, param in self.trainer.model.named_parameters():
            param.requires_grad = True

    def freeze_model_except_for(self, unfrozen_layers: list, epoch: int):
        #print(f"\nEpoch {epoch}: Freezing model except for {unfrozen_layers}")
        for name, param in self.trainer.model.named_parameters():
            should_freeze = True
            for unfrozen_name in unfrozen_layers:
                if unfrozen_name in name:
                    should_freeze = False
                    break
            
            if should_freeze:
                param.requires_grad = False
            else:
                param.requires_grad = True
                #print(f"Didn't freeze {name}")

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# optionally freeze decoder (leaving encoder unfrozen) for e.g. first 10 epochs
if args.freeze_until >= 0:
    freezing_callback = FreezingCallback(["encoder.encoder"], args.freeze_until, trainer)
    trainer.add_callback(freezing_callback)

# Train the model
if len(args.resume) > 0:
    print("Resuming from " + args.resume)
    trainer.train(resume_from_checkpoint=args.resume)
else:
    trainer.train()

# Guard saving in DDP scenarios
if trainer.is_world_process_zero():
    # Save processor
    model.save_pretrained(args.output_path)
    processor.save_pretrained(args.output_path)

Limitations and Biases

The model was trained on synthetically generated text lines that use Cyrillic characters. It has not been trained on non-Cyrillic alphabets, such as significant amounts of text lines with Latin or Chinese characters, or other writing systems like Arabic or Hebrew. The model requires fine-tuning for downstream applications. The texts used include large proportions of Imperial Russian and Soviet documents and unfiltered web corpora.

Future Work

Potential improvements for this model include:

Expanding training data: incorporating more synthetic data.
Increase font diversity in text line synthesis: incorporating more fonts, especially handwritten
Training for downstream tasks: fine-tuning the model with real (non-synthetic) images or with other languages
Out-of-domain generalization: studying how pre-training and fine-tuning could be optimized to maximize out-of-domain generalization of the fine-tuned model

Citation

If you use this model in your work, please cite it as:

@misc{cyrillic-large-stage1,
  author = {Kansallisarkisto},
  title = {NAF Cyrillic OCR Model: pre-trained base model},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Kansallisarkisto/cyrillic-large-stage1/}},
}

References

Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. 2023. TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models. Proceedings of the AAAI Conference on Artificial Intelligence 37, 11 (2023), 13094–13102. DOI:https://doi.org/10.1609/aaai.v37i11.26538

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, and others. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023).

Dmitry Zmitrovich, Aleksandr Abramov, Andrey Kalmykov, Vitaly Kadulin, Maria Tikhonova, Ekaterina Taktasheva, Danil Astafurov, Mark Baushenko, Artem Snegirev, Tatiana Shavrina, and others. 2024. A family of pretrained transformer language models for Russian. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 507–524.

Model Card Authors

Author: Kansallisarkisto

Contact Information: john.makela@kansallisarkisto.fi, ilkka.jokipii@kansallisarkisto.fi

Downloads last month: 18

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for Kansallisarkisto/cyrillic-large-stage1

Finetunes

1 model

Paper for Kansallisarkisto/cyrillic-large-stage1

DINOv2: Learning Robust Visual Features without Supervision

Paper • 2304.07193 • Published Apr 14, 2023 • 9