ScanIndex
Collection
Models loaded by https://github.com/welcomyou/scanindex β OCR, KIE, layout, tables, embedder for Vietnamese admin docs. β’ 8 items β’ Updated
CTranslate2-converted version of protonx-models/distilled-protonx-legal-tc, optimised for fast CPU OCR text correction on Vietnamese administrative documents.
The upstream distilled-protonx-legal-tc is a smaller student distilled from protonx-legal-tc. This repo only does the CT2 conversion + INT8 quantization β no further training. Used by the ScanIndex pipeline as the correction stage between OCR and PDF/DOCX export.
| Variant | Time | Accuracy |
|---|---|---|
protonx-legal-tc β CT2 OPTIMIZE, beam=1 |
14.5s | 99.561% |
This repo (distilled-protonx-legal-tc CT2 INT8, beam=1) |
8.3s | 99.550% |
42% faster, 0.011 pp accuracy drop β recommended trade-off for CPU.
distilled_ct2/model.bin β CTranslate2 modeldistilled_ct2/tokenizer.json, tokenizer_config.json, special_tokens_map.json, shared_vocabulary.json, config.jsonimport ctranslate2
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download
local = snapshot_download("welcomyou/distilled-protonx-vn-correction-ct2", local_dir="models")
sub = f"{local}/distilled_ct2"
translator = ctranslate2.Translator(sub, device="cpu")
tok = AutoTokenizer.from_pretrained(sub)
Apache-2.0, inheriting from the protonx-legal-tc base.
Base model
protonx-models/distilled-protonx-legal-tc