distilled-protonx-legal-tc β€” CTranslate2 build (Vietnamese OCR text correction)

CTranslate2-converted version of protonx-models/distilled-protonx-legal-tc, optimised for fast CPU OCR text correction on Vietnamese administrative documents.

The upstream distilled-protonx-legal-tc is a smaller student distilled from protonx-legal-tc. This repo only does the CT2 conversion + INT8 quantization β€” no further training. Used by the ScanIndex pipeline as the correction stage between OCR and PDF/DOCX export.

Performance (ScanIndex internal benchmark, 13-page Vietnamese admin doc)

Variant Time Accuracy
protonx-legal-tc β€” CT2 OPTIMIZE, beam=1 14.5s 99.561%
This repo (distilled-protonx-legal-tc CT2 INT8, beam=1) 8.3s 99.550%

42% faster, 0.011 pp accuracy drop β€” recommended trade-off for CPU.

Files

  • distilled_ct2/model.bin β€” CTranslate2 model
  • distilled_ct2/tokenizer.json, tokenizer_config.json, special_tokens_map.json, shared_vocabulary.json, config.json

Loading

import ctranslate2
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download

local = snapshot_download("welcomyou/distilled-protonx-vn-correction-ct2", local_dir="models")
sub = f"{local}/distilled_ct2"
translator = ctranslate2.Translator(sub, device="cpu")
tok = AutoTokenizer.from_pretrained(sub)

License

Apache-2.0, inheriting from the protonx-legal-tc base.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for welcomyou/distilled-protonx-vn-correction-ct2

Finetuned
(1)
this model

Collection including welcomyou/distilled-protonx-vn-correction-ct2