vibert-capu-onnx / README.md
welcomyou's picture
Initial upload: ViBERT-capu ONNX (FP32 + INT8) — converted from dragonSwing/vibert-capu
a7754d0 verified
---
license: cc-by-sa-4.0
language:
- vi
base_model: dragonSwing/vibert-capu
tags:
- punctuation
- capitalization
- vietnamese
- onnx
- bert
- vibert
library_name: onnxruntime
---
# ViBERT-capu ONNX (FP32 + INT8)
Vietnamese **punctuation restoration + capitalization** model — ONNX Runtime version of [dragonSwing/vibert-capu](https://huggingface.co/dragonSwing/vibert-capu).
PyTorch dependency removed (~2 GB → ~50 MB onnxruntime).
| Variant | File | Size | Use case |
|---|---|---|---|
| FP32 | `vibert-capu.onnx` | 438 MB | Best accuracy, server / web service |
| INT8 | `vibert-capu.int8.onnx` | 110 MB | Desktop, embedded — dynamic-quantized weights, ~99% of FP32 accuracy |
Architecture: BERT (FPTAI/vibert-base-cased) fine-tuned by [dragonSwing](https://huggingface.co/dragonSwing) on 5.6M OSCAR-2109 samples for the Seq2Labels punctuation+capitalization task (15 GECToR-style edit actions).
## Why ONNX?
| | PyTorch (original) | ONNX Runtime (this repo) |
|---|---|---|
| Cold start | ~6 s | ~0.8 s |
| Runtime deps | torch (~2 GB) | onnxruntime (~50 MB) |
| Portable build | very heavy | lightweight |
## Quick start
```python
import numpy as np
import onnxruntime as ort
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
local = snapshot_download("welcomyou/vibert-capu-onnx")
tok = AutoTokenizer.from_pretrained(local)
sess = ort.InferenceSession(f"{local}/vibert-capu.int8.onnx",
providers=["CPUExecutionProvider"])
text = "hà nội là thủ đô việt nam tôi yêu nó"
enc = tok(text.split(), is_split_into_words=True, return_tensors="np")
# input_offsets: index of first subword for each word
word_ids = enc.word_ids()
offsets = []
prev = None
for i, w in enumerate(word_ids):
if w is not None and w != prev:
offsets.append(i); prev = w
input_offsets = np.array([offsets], dtype=np.int64)
logits, detect_logits = sess.run(None, {
"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64),
"token_type_ids": enc["token_type_ids"].astype(np.int64),
"input_offsets": input_offsets,
})
# logits: (1, num_words, 15) — 15 GECToR actions
# detect_logits: (1, num_words, 4) — error detection
```
## Model I/O
**Inputs** (all `int64`):
| Name | Shape | Description |
|---|---|---|
| `input_ids` | `(batch, seq_len)` | BPE token IDs from BertTokenizer |
| `attention_mask` | `(batch, seq_len)` | 1 = real token, 0 = padding |
| `token_type_ids` | `(batch, seq_len)` | Segment IDs (always 0) |
| `input_offsets` | `(batch, num_words)` | Index of first subword for each whitespace-separated word |
**Outputs** (`float32`):
| Name | Shape | Description |
|---|---|---|
| `logits` | `(batch, num_words, 15)` | Action probabilities (15 GECToR-style edits) |
| `detect_logits` | `(batch, num_words, 4)` | Error-detection probabilities |
**15 actions:**
```
$KEEP Giữ nguyên
$TRANSFORM_CASE_CAPITAL Viết hoa chữ cái đầu (hà nội → Hà Nội)
$APPEND_, Thêm dấu phẩy
$APPEND_. Thêm dấu chấm
$TRANSFORM_VERB_VB_VBN (không dùng cho tiếng Việt)
$TRANSFORM_CASE_UPPER Viết hoa toàn bộ (who → WHO)
$APPEND_: Thêm dấu hai chấm
$APPEND_? Thêm dấu hỏi
$TRANSFORM_VERB_VB_VBC (không dùng cho tiếng Việt)
$TRANSFORM_CASE_LOWER Viết thường
$TRANSFORM_CASE_CAPITAL_1 Viết hoa ký tự thứ 2
$TRANSFORM_CASE_UPPER_-1 Viết hoa trừ ký tự cuối
$MERGE_SPACE Nối từ
@@UNKNOWN@@
@@PADDING@@
```
## Reproducing the export
```bash
git clone https://huggingface.co/dragonSwing/vibert-capu
pip install torch transformers onnxruntime numpy
# Export FP32 + dynamic-quantize INT8 in one step:
python convert_onnx/export_vibert_onnx.py \
--model_dir vibert-capu \
--output vibert-capu.onnx \
--opset 14 \
--verify
```
Script: [`convert_onnx/export_vibert_onnx.py`](https://github.com/welcomyou/sherpa-vietnamese-asr/blob/main/convert_onnx/export_vibert_onnx.py).
## Files
```
config.json BERT config (from dragonSwing)
vocab.txt BERT vocabulary (from dragonSwing)
vocabulary/ GECToR action labels
d_tags.txt
labels.txt
non_padded_namespaces.txt
verb-form-vocab.txt Verb form vocabulary
vibert-capu.onnx FP32 ONNX (438 MB)
vibert-capu.int8.onnx INT8 ONNX (110 MB)
configuration_seq2labels.py Seq2Labels HF config class
modeling_seq2labels.py Seq2Labels HF model class (PyTorch reference, not used at runtime)
gec_model.py GECToR inference helpers
utils.py Tokenization helpers
vocabulary.py GECToR Vocabulary class
```
## Credits & License
- **Original model**: [dragonSwing/vibert-capu](https://huggingface.co/dragonSwing/vibert-capu)
- **Base BERT**: [FPTAI/vibert-base-cased](https://huggingface.co/FPTAI/vibert-base-cased)
- **Training data**: [OSCAR-2109](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) Vietnamese subset (5.6M samples)
License: **CC-BY-SA-4.0** (inherited from dragonSwing/vibert-capu — derivative works must use the same license).
## Used by
- [Sherpa Vietnamese ASR](https://github.com/welcomyou/sherpa-vietnamese-asr) — offline Vietnamese ASR for desktop and web (CPU-only).