distilled-protonx-legal-tc — CTranslate2 build (Vietnamese OCR text correction)

CTranslate2-converted version of protonx-models/distilled-protonx-legal-tc, optimised for fast CPU OCR text correction on Vietnamese administrative documents.

The upstream distilled-protonx-legal-tc is a smaller student distilled from protonx-legal-tc. This repo only does the CT2 conversion + INT8 quantization — no further training. Used by the ScanIndex pipeline as the correction stage between OCR and PDF/DOCX export.

Performance (ScanIndex internal benchmark, 13-page Vietnamese admin doc)

Variant	Time	Accuracy
`protonx-legal-tc` — CT2 OPTIMIZE, beam=1	14.5s	99.561%
This repo (`distilled-protonx-legal-tc` CT2 INT8, beam=1)	8.3s	99.550%

42% faster, 0.011 pp accuracy drop — recommended trade-off for CPU.

Files

distilled_ct2/model.bin — CTranslate2 model
distilled_ct2/tokenizer.json, tokenizer_config.json, special_tokens_map.json, shared_vocabulary.json, config.json

Loading

import ctranslate2
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download

local = snapshot_download("welcomyou/distilled-protonx-vn-correction-ct2", local_dir="models")
sub = f"{local}/distilled_ct2"
translator = ctranslate2.Translator(sub, device="cpu")
tok = AutoTokenizer.from_pretrained(sub)

License

Apache-2.0, inheriting from the protonx-legal-tc base.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for welcomyou/distilled-protonx-vn-correction-ct2

Base model

protonx-models/distilled-protonx-legal-tc

Finetuned

(1)

this model

Collection including welcomyou/distilled-protonx-vn-correction-ct2

ScanIndex

Collection

Models loaded by https://github.com/welcomyou/scanindex — OCR, KIE, layout, tables, embedder for Vietnamese admin docs. • 8 items • Updated 2 days ago