Model description
Model Name: estonian-large-handwritten
Model Version: estonian-v0.1b
Model Type: Transformer-based encoder-decoder for OCR
Base Model: microsoft/trocr-large-handwritten
Purpose: Handwritten text recognition
Languages: Estonian
License: Apache 2.0
This is a fine-tuned model for Estonian handwriting recognition, trained with computing resources generously provided by CSC โ IT Center for Science on the Puhti and LUMI supercomputers.
This model was developed in the ArchXAI project funded by the Central Baltic Programme.
Model Architecture
The model is based on a Transformer architecture with an encoder-decoder setup, similar to TrOCR from Li et. al. (2023):
- The encoder processes an image of a single line of text into a sequence of hidden states.
- The decoder attends to the hidden states from the encoder using cross-attention, to generate the corresponding text output.
This model is a fine-tuned version of the original trocr-large-handwritten from Li et. al. (2023), for handwritten text recognition in primarily Estonian-language historical documents.
Intended Use
- Document digitization (e.g., archival work, historical manuscripts)
- Handwritten notes transcription
Training data
The training data consists of human-annotated samples of mainly handwritten text lines from historical documents in the collections of the National Archives of Estonia, with some printed and typed text lines included.
Training set: 49379 text lines
Validation set: 5897 text lines
Test set: 5706 text lines
Evaluation
The following metrics were calculated on the test set (in-domain evaluation) using the evaluate library with default settings:
CER (character error rate): 0.0290
WER (word error rate): 0.1378
Used Hyperparameters
Train batch size per device: 8
Number of devices: 32
Learning rate: 1e-5
Scheduler: linear
Optimizer: AdamW
Number of epochs: 188
FP16 mixed precision training: False
Input image size: 192 x 1024
How to Use the Model
You can use the model for inference by loading the processor and model.
from transformers.models.vit.modeling_vit import ViTPatchEmbeddings, ViTEmbeddings
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
def load_custom_trocr_model():
"""Load a TrOCR model with custom image size support"""
original_embeddings_forward = ViTEmbeddings.forward
# Always apply patches for models saved with custom image sizes
def universal_patch_forward(self, *args, **kwargs):
pixel_values = args[0] if args else kwargs['pixel_values']
embeddings = self.projection(pixel_values).flatten(2).transpose(1, 2)
return embeddings
def universal_embeddings_forward(self, *args, **kwargs):
kwargs['interpolate_pos_encoding'] = True
return original_embeddings_forward(self, *args, **kwargs)
# Apply patches
ViTPatchEmbeddings.forward = universal_patch_forward
ViTEmbeddings.forward = universal_embeddings_forward
# Load model and processor
processor = TrOCRProcessor.from_pretrained("Kansallisarkisto/estonian-large-handwritten",
use_fast=True,
do_resize=True,
size={'height': 192,'width': 1024})
model = VisionEncoderDecoderModel.from_pretrained("Kansallisarkisto/estonian-large-handwritten")
return processor, model
# Load model and processor
processor, model = load_custom_trocr_model()
# Open an image of handwritten text
image = Image.open("path_to_image.jpg")
# Preprocess and predict
pixel_values = processor(image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)
Limitations and Biases
The model was trained primarily on handwritten text that uses basic Latin characters and Estonian special characters. It has not been trained on non-Latin alphabets, such as Chinese characters or other writing systems like Arabic or Hebrew. The model may not generalize well to any other language than Estonian.
Future Work
Potential improvements for this model include:
- Expanding training data: incorporating more ground truth data
- Optimizing for specific domains: fine-tuning the model on domain-specific handwriting
- Pretraining: pre-training a fully Estonian-specific model instead of starting the fine-tuning from a model trained on English
- Out-of-domain generalization: studying how pre-training and fine-tuning could be optimized to maximize out-of-domain generalization of the fine-tuned model
Citation
If you use this model in your work, please cite it as:
@misc{estonian-large-handwritten,
author = {Kansallisarkisto},
title = {NAF Estonian HTR model},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Kansallisarkisto/estonian-large-handwritten/}},
}
References
Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z. and Wei, F. 2023. TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models. Proceedings of the AAAI Conference on Artificial Intelligence. 37, 11 (Jun. 2023), 13094-13102. DOI:https://doi.org/10.1609/aaai.v37i11.26538.
Model Card Authors
Author: Kansallisarkisto
Contact Information: john.makela@kansallisarkisto.fi, ilkka.jokipii@kansallisarkisto.fi
- Downloads last month
- 127
Model tree for Kansallisarkisto/estonian-large-handwritten
Base model
microsoft/trocr-large-handwritten