File size: 5,451 Bytes
a7754d0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | ---
license: cc-by-sa-4.0
language:
- vi
base_model: dragonSwing/vibert-capu
tags:
- punctuation
- capitalization
- vietnamese
- onnx
- bert
- vibert
library_name: onnxruntime
---
# ViBERT-capu ONNX (FP32 + INT8)
Vietnamese **punctuation restoration + capitalization** model — ONNX Runtime version of [dragonSwing/vibert-capu](https://huggingface.co/dragonSwing/vibert-capu).
PyTorch dependency removed (~2 GB → ~50 MB onnxruntime).
| Variant | File | Size | Use case |
|---|---|---|---|
| FP32 | `vibert-capu.onnx` | 438 MB | Best accuracy, server / web service |
| INT8 | `vibert-capu.int8.onnx` | 110 MB | Desktop, embedded — dynamic-quantized weights, ~99% of FP32 accuracy |
Architecture: BERT (FPTAI/vibert-base-cased) fine-tuned by [dragonSwing](https://huggingface.co/dragonSwing) on 5.6M OSCAR-2109 samples for the Seq2Labels punctuation+capitalization task (15 GECToR-style edit actions).
## Why ONNX?
| | PyTorch (original) | ONNX Runtime (this repo) |
|---|---|---|
| Cold start | ~6 s | ~0.8 s |
| Runtime deps | torch (~2 GB) | onnxruntime (~50 MB) |
| Portable build | very heavy | lightweight |
## Quick start
```python
import numpy as np
import onnxruntime as ort
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
local = snapshot_download("welcomyou/vibert-capu-onnx")
tok = AutoTokenizer.from_pretrained(local)
sess = ort.InferenceSession(f"{local}/vibert-capu.int8.onnx",
providers=["CPUExecutionProvider"])
text = "hà nội là thủ đô việt nam tôi yêu nó"
enc = tok(text.split(), is_split_into_words=True, return_tensors="np")
# input_offsets: index of first subword for each word
word_ids = enc.word_ids()
offsets = []
prev = None
for i, w in enumerate(word_ids):
if w is not None and w != prev:
offsets.append(i); prev = w
input_offsets = np.array([offsets], dtype=np.int64)
logits, detect_logits = sess.run(None, {
"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64),
"token_type_ids": enc["token_type_ids"].astype(np.int64),
"input_offsets": input_offsets,
})
# logits: (1, num_words, 15) — 15 GECToR actions
# detect_logits: (1, num_words, 4) — error detection
```
## Model I/O
**Inputs** (all `int64`):
| Name | Shape | Description |
|---|---|---|
| `input_ids` | `(batch, seq_len)` | BPE token IDs from BertTokenizer |
| `attention_mask` | `(batch, seq_len)` | 1 = real token, 0 = padding |
| `token_type_ids` | `(batch, seq_len)` | Segment IDs (always 0) |
| `input_offsets` | `(batch, num_words)` | Index of first subword for each whitespace-separated word |
**Outputs** (`float32`):
| Name | Shape | Description |
|---|---|---|
| `logits` | `(batch, num_words, 15)` | Action probabilities (15 GECToR-style edits) |
| `detect_logits` | `(batch, num_words, 4)` | Error-detection probabilities |
**15 actions:**
```
$KEEP Giữ nguyên
$TRANSFORM_CASE_CAPITAL Viết hoa chữ cái đầu (hà nội → Hà Nội)
$APPEND_, Thêm dấu phẩy
$APPEND_. Thêm dấu chấm
$TRANSFORM_VERB_VB_VBN (không dùng cho tiếng Việt)
$TRANSFORM_CASE_UPPER Viết hoa toàn bộ (who → WHO)
$APPEND_: Thêm dấu hai chấm
$APPEND_? Thêm dấu hỏi
$TRANSFORM_VERB_VB_VBC (không dùng cho tiếng Việt)
$TRANSFORM_CASE_LOWER Viết thường
$TRANSFORM_CASE_CAPITAL_1 Viết hoa ký tự thứ 2
$TRANSFORM_CASE_UPPER_-1 Viết hoa trừ ký tự cuối
$MERGE_SPACE Nối từ
@@UNKNOWN@@
@@PADDING@@
```
## Reproducing the export
```bash
git clone https://huggingface.co/dragonSwing/vibert-capu
pip install torch transformers onnxruntime numpy
# Export FP32 + dynamic-quantize INT8 in one step:
python convert_onnx/export_vibert_onnx.py \
--model_dir vibert-capu \
--output vibert-capu.onnx \
--opset 14 \
--verify
```
Script: [`convert_onnx/export_vibert_onnx.py`](https://github.com/welcomyou/sherpa-vietnamese-asr/blob/main/convert_onnx/export_vibert_onnx.py).
## Files
```
config.json BERT config (from dragonSwing)
vocab.txt BERT vocabulary (from dragonSwing)
vocabulary/ GECToR action labels
d_tags.txt
labels.txt
non_padded_namespaces.txt
verb-form-vocab.txt Verb form vocabulary
vibert-capu.onnx FP32 ONNX (438 MB)
vibert-capu.int8.onnx INT8 ONNX (110 MB)
configuration_seq2labels.py Seq2Labels HF config class
modeling_seq2labels.py Seq2Labels HF model class (PyTorch reference, not used at runtime)
gec_model.py GECToR inference helpers
utils.py Tokenization helpers
vocabulary.py GECToR Vocabulary class
```
## Credits & License
- **Original model**: [dragonSwing/vibert-capu](https://huggingface.co/dragonSwing/vibert-capu)
- **Base BERT**: [FPTAI/vibert-base-cased](https://huggingface.co/FPTAI/vibert-base-cased)
- **Training data**: [OSCAR-2109](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) Vietnamese subset (5.6M samples)
License: **CC-BY-SA-4.0** (inherited from dragonSwing/vibert-capu — derivative works must use the same license).
## Used by
- [Sherpa Vietnamese ASR](https://github.com/welcomyou/sherpa-vietnamese-asr) — offline Vietnamese ASR for desktop and web (CPU-only).
|