--- license: apache-2.0 datasets: - iapp/thai_handwriting_dataset language: - th metrics: - cer base_model: - microsoft/trocr-base-handwritten tags: - ocr - handwriting - thai - trocr - vision - transformer --- # Thai Handwritten OCR (TrOCR) A Thai Handwritten OCR model fine-tuned from Microsoft TrOCR for recognizing Thai handwritten text. ## Model Details ### Model Description This model is developed to convert Thai handwritten images into text using the TrOCR architecture, which combines Vision Transformer (ViT) for image processing and Transformer Decoder for text generation. - **Developed by:** Warit Sirikosityanggoon - **Model type:** Vision Encoder-Decoder (TrOCR) - **Language(s):** Thai (th) - **License:** Apache 2.0 - **Finetuned from:** [microsoft/trocr-base-handwritten](https://huggingface.co/microsoft/trocr-base-handwritten) ### Model Sources - **Repository:** [waritkan/Thai-Hand-Written-TrOCR-Webapp](https://github.com/waritkan/Thai-Hand-Written-TrOCR-Webapp) ## Uses ### Direct Use This model can be used directly for converting Thai handwritten images into text. Suitable for: - Converting Thai handwritten documents - Real-time handwriting recognition systems - Digitizing handwritten notes ### Out-of-Scope Use - Not suitable for languages other than Thai - May not perform well on extremely difficult handwriting or low-quality images ## Training Details ### Training Data Trained on [iapp/thai_handwriting_dataset](https://huggingface.co/datasets/iapp/thai_handwriting_dataset), which contains Thai handwritten images paired with their corresponding text labels. ### Tokenizer Uses **SentencePiece with Unigram algorithm** instead of Dictionary-based Word Segmentation because: - Handles Out-of-Vocabulary words effectively - Supports misspelled or incomplete words from handwriting - No pre-tokenization required **Tokenizer Configuration:** - Vocab Size: 30,000 - Character Coverage: 0.9995 - Algorithm: Unigram ### Training Hyperparameters | Parameter | Value | |-----------|-------| | Epochs | 250 | | Batch Size | 16 | | Learning Rate | 1e-5 | | Optimizer | AdamW | | Training Regime | fp16 mixed precision | ### Training Infrastructure - **Hardware:** NVIDIA GPU (HPC Cluster) - **Framework:** PyTorch + Hugging Face Transformers ## Evaluation ### Metrics | Metric | Value | |--------|-------| | **CER (Character Error Rate)** | **0.488%** | ### How to Evaluate ```python import editdistance def calculate_cer(pred, label): """Character Error Rate (lower is better)""" if len(label) == 0: return 1.0 if len(pred) > 0 else 0.0 distance = editdistance.eval(pred, label) return distance / len(label) ``` ## How to Get Started with the Model ### Installation ```bash pip install transformers torch sentencepiece pillow ``` ### Usage ```python import torch from PIL import Image import sentencepiece as spm from transformers import VisionEncoderDecoderModel, ViTImageProcessor # Load model model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-handwritten') image_processor = ViTImageProcessor.from_pretrained('microsoft/trocr-base-handwritten') # Load Thai tokenizer sp = spm.SentencePieceProcessor() sp.Load('thai_sp_30000.model') # Load trained weights checkpoint = torch.load('best_model.pt', map_location='cpu') model.decoder.resize_token_embeddings(sp.GetPieceSize()) model.load_state_dict(checkpoint['model_state_dict'], strict=False) model.eval() # Inference image = Image.open('handwriting.jpg').convert('RGB') pixel_values = image_processor(image, return_tensors='pt').pixel_values with torch.no_grad(): generated_ids = model.generate( pixel_values, max_length=128, num_beams=4, ) # Decode ids = generated_ids[0].tolist() text = sp.DecodeIds(ids) print(text) ``` ## Model Architecture ``` Input Image | v Vision Transformer (ViT) Encoder | v Cross-Attention | v Transformer Decoder | v SentencePiece Tokenizer (Unigram) | v Thai Text Output ``` ## Limitations - Performance depends on image quality and handwriting clarity - May not perform well on handwriting styles significantly different from training data - Supports Thai language only ## Citation ```bibtex @misc{thai-handwritten-trocr, author = {Warit Sirikosityanggoon}, title = {Thai Handwritten OCR using TrOCR}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://github.com/waritkan/Thai-Hand-Written-TrOCR-Webapp}} } ``` ## Acknowledgements - [Microsoft TrOCR](https://huggingface.co/microsoft/trocr-base-handwritten) for Pretrained Model - [iApp Technology](https://huggingface.co/datasets/iapp/thai_handwriting_dataset) for Thai Handwriting Dataset - [SentencePiece](https://github.com/google/sentencepiece) for Tokenizer ## Model Card Contact - **Author:** Warit Sirikosityanggoon - **GitHub:** [waritkan/Thai-Hand-Written-TrOCR-Webapp](https://github.com/waritkan/Thai-Hand-Written-TrOCR-Webapp)