| --- |
| license: cc-by-sa-4.0 |
| language: |
| - vi |
| base_model: dragonSwing/vibert-capu |
| tags: |
| - punctuation |
| - capitalization |
| - vietnamese |
| - onnx |
| - bert |
| - vibert |
| library_name: onnxruntime |
| --- |
| |
| # ViBERT-capu ONNX (FP32 + INT8) |
|
|
| Vietnamese **punctuation restoration + capitalization** model — ONNX Runtime version of [dragonSwing/vibert-capu](https://huggingface.co/dragonSwing/vibert-capu). |
| PyTorch dependency removed (~2 GB → ~50 MB onnxruntime). |
|
|
| | Variant | File | Size | Use case | |
| |---|---|---|---| |
| | FP32 | `vibert-capu.onnx` | 438 MB | Best accuracy, server / web service | |
| | INT8 | `vibert-capu.int8.onnx` | 110 MB | Desktop, embedded — dynamic-quantized weights, ~99% of FP32 accuracy | |
|
|
| Architecture: BERT (FPTAI/vibert-base-cased) fine-tuned by [dragonSwing](https://huggingface.co/dragonSwing) on 5.6M OSCAR-2109 samples for the Seq2Labels punctuation+capitalization task (15 GECToR-style edit actions). |
|
|
| ## Why ONNX? |
|
|
| | | PyTorch (original) | ONNX Runtime (this repo) | |
| |---|---|---| |
| | Cold start | ~6 s | ~0.8 s | |
| | Runtime deps | torch (~2 GB) | onnxruntime (~50 MB) | |
| | Portable build | very heavy | lightweight | |
|
|
| ## Quick start |
|
|
| ```python |
| import numpy as np |
| import onnxruntime as ort |
| from huggingface_hub import snapshot_download |
| from transformers import AutoTokenizer |
| |
| local = snapshot_download("welcomyou/vibert-capu-onnx") |
| tok = AutoTokenizer.from_pretrained(local) |
| sess = ort.InferenceSession(f"{local}/vibert-capu.int8.onnx", |
| providers=["CPUExecutionProvider"]) |
| |
| text = "hà nội là thủ đô việt nam tôi yêu nó" |
| enc = tok(text.split(), is_split_into_words=True, return_tensors="np") |
| # input_offsets: index of first subword for each word |
| word_ids = enc.word_ids() |
| offsets = [] |
| prev = None |
| for i, w in enumerate(word_ids): |
| if w is not None and w != prev: |
| offsets.append(i); prev = w |
| input_offsets = np.array([offsets], dtype=np.int64) |
| |
| logits, detect_logits = sess.run(None, { |
| "input_ids": enc["input_ids"].astype(np.int64), |
| "attention_mask": enc["attention_mask"].astype(np.int64), |
| "token_type_ids": enc["token_type_ids"].astype(np.int64), |
| "input_offsets": input_offsets, |
| }) |
| # logits: (1, num_words, 15) — 15 GECToR actions |
| # detect_logits: (1, num_words, 4) — error detection |
| ``` |
|
|
| ## Model I/O |
|
|
| **Inputs** (all `int64`): |
|
|
| | Name | Shape | Description | |
| |---|---|---| |
| | `input_ids` | `(batch, seq_len)` | BPE token IDs from BertTokenizer | |
| | `attention_mask` | `(batch, seq_len)` | 1 = real token, 0 = padding | |
| | `token_type_ids` | `(batch, seq_len)` | Segment IDs (always 0) | |
| | `input_offsets` | `(batch, num_words)` | Index of first subword for each whitespace-separated word | |
|
|
| **Outputs** (`float32`): |
|
|
| | Name | Shape | Description | |
| |---|---|---| |
| | `logits` | `(batch, num_words, 15)` | Action probabilities (15 GECToR-style edits) | |
| | `detect_logits` | `(batch, num_words, 4)` | Error-detection probabilities | |
|
|
| **15 actions:** |
|
|
| ``` |
| $KEEP Giữ nguyên |
| $TRANSFORM_CASE_CAPITAL Viết hoa chữ cái đầu (hà nội → Hà Nội) |
| $APPEND_, Thêm dấu phẩy |
| $APPEND_. Thêm dấu chấm |
| $TRANSFORM_VERB_VB_VBN (không dùng cho tiếng Việt) |
| $TRANSFORM_CASE_UPPER Viết hoa toàn bộ (who → WHO) |
| $APPEND_: Thêm dấu hai chấm |
| $APPEND_? Thêm dấu hỏi |
| $TRANSFORM_VERB_VB_VBC (không dùng cho tiếng Việt) |
| $TRANSFORM_CASE_LOWER Viết thường |
| $TRANSFORM_CASE_CAPITAL_1 Viết hoa ký tự thứ 2 |
| $TRANSFORM_CASE_UPPER_-1 Viết hoa trừ ký tự cuối |
| $MERGE_SPACE Nối từ |
| @@UNKNOWN@@ |
| @@PADDING@@ |
| ``` |
|
|
| ## Reproducing the export |
|
|
| ```bash |
| git clone https://huggingface.co/dragonSwing/vibert-capu |
| pip install torch transformers onnxruntime numpy |
| |
| # Export FP32 + dynamic-quantize INT8 in one step: |
| python convert_onnx/export_vibert_onnx.py \ |
| --model_dir vibert-capu \ |
| --output vibert-capu.onnx \ |
| --opset 14 \ |
| --verify |
| ``` |
|
|
| Script: [`convert_onnx/export_vibert_onnx.py`](https://github.com/welcomyou/sherpa-vietnamese-asr/blob/main/convert_onnx/export_vibert_onnx.py). |
|
|
| ## Files |
|
|
| ``` |
| config.json BERT config (from dragonSwing) |
| vocab.txt BERT vocabulary (from dragonSwing) |
| vocabulary/ GECToR action labels |
| d_tags.txt |
| labels.txt |
| non_padded_namespaces.txt |
| verb-form-vocab.txt Verb form vocabulary |
| vibert-capu.onnx FP32 ONNX (438 MB) |
| vibert-capu.int8.onnx INT8 ONNX (110 MB) |
| configuration_seq2labels.py Seq2Labels HF config class |
| modeling_seq2labels.py Seq2Labels HF model class (PyTorch reference, not used at runtime) |
| gec_model.py GECToR inference helpers |
| utils.py Tokenization helpers |
| vocabulary.py GECToR Vocabulary class |
| ``` |
|
|
| ## Credits & License |
|
|
| - **Original model**: [dragonSwing/vibert-capu](https://huggingface.co/dragonSwing/vibert-capu) |
| - **Base BERT**: [FPTAI/vibert-base-cased](https://huggingface.co/FPTAI/vibert-base-cased) |
| - **Training data**: [OSCAR-2109](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) Vietnamese subset (5.6M samples) |
|
|
| License: **CC-BY-SA-4.0** (inherited from dragonSwing/vibert-capu — derivative works must use the same license). |
|
|
| ## Used by |
|
|
| - [Sherpa Vietnamese ASR](https://github.com/welcomyou/sherpa-vietnamese-asr) — offline Vietnamese ASR for desktop and web (CPU-only). |
|
|