--- license: cc-by-sa-4.0 language: - vi base_model: dragonSwing/vibert-capu tags: - punctuation - capitalization - vietnamese - onnx - bert - vibert library_name: onnxruntime --- # ViBERT-capu ONNX (FP32 + INT8) Vietnamese **punctuation restoration + capitalization** model — ONNX Runtime version of [dragonSwing/vibert-capu](https://huggingface.co/dragonSwing/vibert-capu). PyTorch dependency removed (~2 GB → ~50 MB onnxruntime). | Variant | File | Size | Use case | |---|---|---|---| | FP32 | `vibert-capu.onnx` | 438 MB | Best accuracy, server / web service | | INT8 | `vibert-capu.int8.onnx` | 110 MB | Desktop, embedded — dynamic-quantized weights, ~99% of FP32 accuracy | Architecture: BERT (FPTAI/vibert-base-cased) fine-tuned by [dragonSwing](https://huggingface.co/dragonSwing) on 5.6M OSCAR-2109 samples for the Seq2Labels punctuation+capitalization task (15 GECToR-style edit actions). ## Why ONNX? | | PyTorch (original) | ONNX Runtime (this repo) | |---|---|---| | Cold start | ~6 s | ~0.8 s | | Runtime deps | torch (~2 GB) | onnxruntime (~50 MB) | | Portable build | very heavy | lightweight | ## Quick start ```python import numpy as np import onnxruntime as ort from huggingface_hub import snapshot_download from transformers import AutoTokenizer local = snapshot_download("welcomyou/vibert-capu-onnx") tok = AutoTokenizer.from_pretrained(local) sess = ort.InferenceSession(f"{local}/vibert-capu.int8.onnx", providers=["CPUExecutionProvider"]) text = "hà nội là thủ đô việt nam tôi yêu nó" enc = tok(text.split(), is_split_into_words=True, return_tensors="np") # input_offsets: index of first subword for each word word_ids = enc.word_ids() offsets = [] prev = None for i, w in enumerate(word_ids): if w is not None and w != prev: offsets.append(i); prev = w input_offsets = np.array([offsets], dtype=np.int64) logits, detect_logits = sess.run(None, { "input_ids": enc["input_ids"].astype(np.int64), "attention_mask": enc["attention_mask"].astype(np.int64), "token_type_ids": enc["token_type_ids"].astype(np.int64), "input_offsets": input_offsets, }) # logits: (1, num_words, 15) — 15 GECToR actions # detect_logits: (1, num_words, 4) — error detection ``` ## Model I/O **Inputs** (all `int64`): | Name | Shape | Description | |---|---|---| | `input_ids` | `(batch, seq_len)` | BPE token IDs from BertTokenizer | | `attention_mask` | `(batch, seq_len)` | 1 = real token, 0 = padding | | `token_type_ids` | `(batch, seq_len)` | Segment IDs (always 0) | | `input_offsets` | `(batch, num_words)` | Index of first subword for each whitespace-separated word | **Outputs** (`float32`): | Name | Shape | Description | |---|---|---| | `logits` | `(batch, num_words, 15)` | Action probabilities (15 GECToR-style edits) | | `detect_logits` | `(batch, num_words, 4)` | Error-detection probabilities | **15 actions:** ``` $KEEP Giữ nguyên $TRANSFORM_CASE_CAPITAL Viết hoa chữ cái đầu (hà nội → Hà Nội) $APPEND_, Thêm dấu phẩy $APPEND_. Thêm dấu chấm $TRANSFORM_VERB_VB_VBN (không dùng cho tiếng Việt) $TRANSFORM_CASE_UPPER Viết hoa toàn bộ (who → WHO) $APPEND_: Thêm dấu hai chấm $APPEND_? Thêm dấu hỏi $TRANSFORM_VERB_VB_VBC (không dùng cho tiếng Việt) $TRANSFORM_CASE_LOWER Viết thường $TRANSFORM_CASE_CAPITAL_1 Viết hoa ký tự thứ 2 $TRANSFORM_CASE_UPPER_-1 Viết hoa trừ ký tự cuối $MERGE_SPACE Nối từ @@UNKNOWN@@ @@PADDING@@ ``` ## Reproducing the export ```bash git clone https://huggingface.co/dragonSwing/vibert-capu pip install torch transformers onnxruntime numpy # Export FP32 + dynamic-quantize INT8 in one step: python convert_onnx/export_vibert_onnx.py \ --model_dir vibert-capu \ --output vibert-capu.onnx \ --opset 14 \ --verify ``` Script: [`convert_onnx/export_vibert_onnx.py`](https://github.com/welcomyou/sherpa-vietnamese-asr/blob/main/convert_onnx/export_vibert_onnx.py). ## Files ``` config.json BERT config (from dragonSwing) vocab.txt BERT vocabulary (from dragonSwing) vocabulary/ GECToR action labels d_tags.txt labels.txt non_padded_namespaces.txt verb-form-vocab.txt Verb form vocabulary vibert-capu.onnx FP32 ONNX (438 MB) vibert-capu.int8.onnx INT8 ONNX (110 MB) configuration_seq2labels.py Seq2Labels HF config class modeling_seq2labels.py Seq2Labels HF model class (PyTorch reference, not used at runtime) gec_model.py GECToR inference helpers utils.py Tokenization helpers vocabulary.py GECToR Vocabulary class ``` ## Credits & License - **Original model**: [dragonSwing/vibert-capu](https://huggingface.co/dragonSwing/vibert-capu) - **Base BERT**: [FPTAI/vibert-base-cased](https://huggingface.co/FPTAI/vibert-base-cased) - **Training data**: [OSCAR-2109](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) Vietnamese subset (5.6M samples) License: **CC-BY-SA-4.0** (inherited from dragonSwing/vibert-capu — derivative works must use the same license). ## Used by - [Sherpa Vietnamese ASR](https://github.com/welcomyou/sherpa-vietnamese-asr) — offline Vietnamese ASR for desktop and web (CPU-only).