YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Vietnamese Hierarchical CTC OCR Model

This repository contains a hierarchical CTC-based OCR model specifically designed for Vietnamese text recognition with diacritical marks. The model employs a multi-scale approach with dynamic fusion to better capture both local and global features, which is particularly beneficial for recognizing complex Vietnamese diacritical marks.

Model Architecture

The model consists of several key components:

  • Vision Encoder: Based on Microsoft's TrOCR architecture for image feature extraction
  • Dynamic Multi-Scale Fusion: Combines features from different encoder layers to capture both local and global patterns
  • Transformer Encoder: Processes the fused features to model sequence dependencies
  • Hierarchical Classification: Specialized heads for base characters and diacritical marks
  • Diacritic Enhancement Modules:
    • Visual Diacritic Attention: Focuses on specific regions where diacritics appear
    • Character-Diacritic Compatibility: Enforces linguistic constraints
    • Few-Shot Diacritic Adapter: Handles rare diacritic combinations

Performance

The model achieves the following performance metrics on Vietnamese handwritten text:

  • Character Error Rate (CER): TBD
  • Word Error Rate (WER): TBD
  • Diacritic Accuracy: TBD

Usage

Installation

pip install transformers
pip install torch
# Install additional dependencies
pip install -r requirements.txt

Basic Inference

from transformers import AutoProcessor
from hierarchical_ctc_model import HierarchicalCtcMultiScaleOcrModel
from PIL import Image
from utils.ctc_utils import CTCDecoder

# Load model and processor
model = HierarchicalCtcMultiScaleOcrModel.from_pretrained("username/vietnamese-hierarchical-ctc-ocr")
processor = model.processor

# Prepare image
image = Image.open("your_image.jpg").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values

# Get predictions
outputs = model(pixel_values=pixel_values)
logits = outputs["logits"]

# Decode predictions
idx_to_char = {i: c for i, c in enumerate(model.combined_char_vocab)}
decoder = CTCDecoder(idx_to_char_map=idx_to_char, blank_idx=0)
predicted_text = decoder(logits)[0]
print(f"Predicted text: {predicted_text}")

Evaluation

You can evaluate the model on your own dataset using the provided evaluation script:

python evaluate_ocr.py \
    --model_path "username/vietnamese-hierarchical-ctc-ocr" \
    --dataset_name "your_dataset" \
    --dataset_split "test" \
    --output_dir "evaluation_results" \
    --batch_size 16

Dataset Format

The model expects datasets in the Hugging Face format with the following columns:

  • image: The input image (either a PIL Image, path to image, or numpy array)
  • label or text: The ground truth text

Training

For training your own model, refer to the train_hierarchical_ctc.py script in the repository. The training process utilizes a hierarchical approach with specialized dynamic fusion mechanisms.

Citation

If you use this model in your research, please cite:

@misc{vietnamese-hierarchical-ctc-ocr,
  author = {Your Name},
  title = {Vietnamese Hierarchical CTC OCR with Dynamic Fusion},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Repository},
  howpublished = {\url{https://huggingface.co/username/vietnamese-hierarchical-ctc-ocr}}
}

License

This model is released under [LICENSE TYPE]

Acknowledgements

  • This model builds upon the architecture from Microsoft's TrOCR with specialized enhancements for Vietnamese.
  • Thanks to the developers of transformers, PyTorch, and the Vietnamese OCR community.
Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support