YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Vietnamese Hierarchical CTC OCR Model
This repository contains a hierarchical CTC-based OCR model specifically designed for Vietnamese text recognition with diacritical marks. The model employs a multi-scale approach with dynamic fusion to better capture both local and global features, which is particularly beneficial for recognizing complex Vietnamese diacritical marks.
Model Architecture
The model consists of several key components:
- Vision Encoder: Based on Microsoft's TrOCR architecture for image feature extraction
- Dynamic Multi-Scale Fusion: Combines features from different encoder layers to capture both local and global patterns
- Transformer Encoder: Processes the fused features to model sequence dependencies
- Hierarchical Classification: Specialized heads for base characters and diacritical marks
- Diacritic Enhancement Modules:
- Visual Diacritic Attention: Focuses on specific regions where diacritics appear
- Character-Diacritic Compatibility: Enforces linguistic constraints
- Few-Shot Diacritic Adapter: Handles rare diacritic combinations
Performance
The model achieves the following performance metrics on Vietnamese handwritten text:
- Character Error Rate (CER): TBD
- Word Error Rate (WER): TBD
- Diacritic Accuracy: TBD
Usage
Installation
pip install transformers
pip install torch
# Install additional dependencies
pip install -r requirements.txt
Basic Inference
from transformers import AutoProcessor
from hierarchical_ctc_model import HierarchicalCtcMultiScaleOcrModel
from PIL import Image
from utils.ctc_utils import CTCDecoder
# Load model and processor
model = HierarchicalCtcMultiScaleOcrModel.from_pretrained("username/vietnamese-hierarchical-ctc-ocr")
processor = model.processor
# Prepare image
image = Image.open("your_image.jpg").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
# Get predictions
outputs = model(pixel_values=pixel_values)
logits = outputs["logits"]
# Decode predictions
idx_to_char = {i: c for i, c in enumerate(model.combined_char_vocab)}
decoder = CTCDecoder(idx_to_char_map=idx_to_char, blank_idx=0)
predicted_text = decoder(logits)[0]
print(f"Predicted text: {predicted_text}")
Evaluation
You can evaluate the model on your own dataset using the provided evaluation script:
python evaluate_ocr.py \
--model_path "username/vietnamese-hierarchical-ctc-ocr" \
--dataset_name "your_dataset" \
--dataset_split "test" \
--output_dir "evaluation_results" \
--batch_size 16
Dataset Format
The model expects datasets in the Hugging Face format with the following columns:
image: The input image (either a PIL Image, path to image, or numpy array)labelortext: The ground truth text
Training
For training your own model, refer to the train_hierarchical_ctc.py script in the repository. The training process utilizes a hierarchical approach with specialized dynamic fusion mechanisms.
Citation
If you use this model in your research, please cite:
@misc{vietnamese-hierarchical-ctc-ocr,
author = {Your Name},
title = {Vietnamese Hierarchical CTC OCR with Dynamic Fusion},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Repository},
howpublished = {\url{https://huggingface.co/username/vietnamese-hierarchical-ctc-ocr}}
}
License
This model is released under [LICENSE TYPE]
Acknowledgements
- This model builds upon the architecture from Microsoft's TrOCR with specialized enhancements for Vietnamese.
- Thanks to the developers of transformers, PyTorch, and the Vietnamese OCR community.
- Downloads last month
- 9