Vietnamese Hierarchical CTC OCR Model

This repository contains a hierarchical CTC-based OCR model specifically designed for Vietnamese text recognition with diacritical marks. The model employs a multi-scale approach with dynamic fusion to better capture both local and global features, which is particularly beneficial for recognizing complex Vietnamese diacritical marks.

Model Architecture

The model consists of several key components:

Vision Encoder: Based on Microsoft's TrOCR architecture for image feature extraction
Dynamic Multi-Scale Fusion: Combines features from different encoder layers to capture both local and global patterns
Transformer Encoder: Processes the fused features to model sequence dependencies
Hierarchical Classification: Specialized heads for base characters and diacritical marks
Diacritic Enhancement Modules:
- Visual Diacritic Attention: Focuses on specific regions where diacritics appear
- Character-Diacritic Compatibility: Enforces linguistic constraints
- Few-Shot Diacritic Adapter: Handles rare diacritic combinations

Performance

The model achieves the following performance metrics on Vietnamese handwritten text:

Character Error Rate (CER): TBD
Word Error Rate (WER): TBD
Diacritic Accuracy: TBD

Usage

Installation

pip install transformers
pip install torch
# Install additional dependencies
pip install -r requirements.txt

Basic Inference

from transformers import AutoProcessor
from hierarchical_ctc_model import HierarchicalCtcMultiScaleOcrModel
from PIL import Image
from utils.ctc_utils import CTCDecoder

# Load model and processor
model = HierarchicalCtcMultiScaleOcrModel.from_pretrained("username/vietnamese-hierarchical-ctc-ocr")
processor = model.processor

# Prepare image
image = Image.open("your_image.jpg").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values

# Get predictions
outputs = model(pixel_values=pixel_values)
logits = outputs["logits"]

# Decode predictions
idx_to_char = {i: c for i, c in enumerate(model.combined_char_vocab)}
decoder = CTCDecoder(idx_to_char_map=idx_to_char, blank_idx=0)
predicted_text = decoder(logits)[0]
print(f"Predicted text: {predicted_text}")

Evaluation

You can evaluate the model on your own dataset using the provided evaluation script:

python evaluate_ocr.py \
    --model_path "username/vietnamese-hierarchical-ctc-ocr" \
    --dataset_name "your_dataset" \
    --dataset_split "test" \
    --output_dir "evaluation_results" \
    --batch_size 16

Dataset Format

The model expects datasets in the Hugging Face format with the following columns:

image: The input image (either a PIL Image, path to image, or numpy array)
label or text: The ground truth text

Training

For training your own model, refer to the train_hierarchical_ctc.py script in the repository. The training process utilizes a hierarchical approach with specialized dynamic fusion mechanisms.

Citation

If you use this model in your research, please cite:

@misc{vietnamese-hierarchical-ctc-ocr,
  author = {Your Name},
  title = {Vietnamese Hierarchical CTC OCR with Dynamic Fusion},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Repository},
  howpublished = {\url{https://huggingface.co/username/vietnamese-hierarchical-ctc-ocr}}
}

License

This model is released under [LICENSE TYPE]

Acknowledgements

This model builds upon the architecture from Microsoft's TrOCR with specialized enhancements for Vietnamese.
Thanks to the developers of transformers, PyTorch, and the Vietnamese OCR community.

Downloads last month: 9

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support