File size: 5,451 Bytes

a7754d0

---
license: cc-by-sa-4.0
language:
- vi
base_model: dragonSwing/vibert-capu
tags:
- punctuation
- capitalization
- vietnamese
- onnx
- bert
- vibert
library_name: onnxruntime
---

# ViBERT-capu ONNX (FP32 + INT8)

Vietnamese **punctuation restoration + capitalization** model — ONNX Runtime version of [dragonSwing/vibert-capu](https://huggingface.co/dragonSwing/vibert-capu).
PyTorch dependency removed (~2 GB → ~50 MB onnxruntime).

| Variant | File | Size | Use case |
|---|---|---|---|
| FP32 | `vibert-capu.onnx` | 438 MB | Best accuracy, server / web service |
| INT8 | `vibert-capu.int8.onnx` | 110 MB | Desktop, embedded — dynamic-quantized weights, ~99% of FP32 accuracy |

Architecture: BERT (FPTAI/vibert-base-cased) fine-tuned by [dragonSwing](https://huggingface.co/dragonSwing) on 5.6M OSCAR-2109 samples for the Seq2Labels punctuation+capitalization task (15 GECToR-style edit actions).

## Why ONNX?

| | PyTorch (original) | ONNX Runtime (this repo) |
|---|---|---|
| Cold start | ~6 s | ~0.8 s |
| Runtime deps | torch (~2 GB) | onnxruntime (~50 MB) |
| Portable build | very heavy | lightweight |

## Quick start

```python
import numpy as np
import onnxruntime as ort
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

local = snapshot_download("welcomyou/vibert-capu-onnx")
tok = AutoTokenizer.from_pretrained(local)
sess = ort.InferenceSession(f"{local}/vibert-capu.int8.onnx",
                            providers=["CPUExecutionProvider"])

text = "hà nội là thủ đô việt nam tôi yêu nó"
enc = tok(text.split(), is_split_into_words=True, return_tensors="np")
# input_offsets: index of first subword for each word
word_ids = enc.word_ids()
offsets = []
prev = None
for i, w in enumerate(word_ids):
    if w is not None and w != prev:
        offsets.append(i); prev = w
input_offsets = np.array([offsets], dtype=np.int64)

logits, detect_logits = sess.run(None, {
    "input_ids": enc["input_ids"].astype(np.int64),
    "attention_mask": enc["attention_mask"].astype(np.int64),
    "token_type_ids": enc["token_type_ids"].astype(np.int64),
    "input_offsets": input_offsets,
})
# logits: (1, num_words, 15)  — 15 GECToR actions
# detect_logits: (1, num_words, 4) — error detection
```

## Model I/O

**Inputs** (all `int64`):

| Name | Shape | Description |
|---|---|---|
| `input_ids` | `(batch, seq_len)` | BPE token IDs from BertTokenizer |
| `attention_mask` | `(batch, seq_len)` | 1 = real token, 0 = padding |
| `token_type_ids` | `(batch, seq_len)` | Segment IDs (always 0) |
| `input_offsets` | `(batch, num_words)` | Index of first subword for each whitespace-separated word |

**Outputs** (`float32`):

| Name | Shape | Description |
|---|---|---|
| `logits` | `(batch, num_words, 15)` | Action probabilities (15 GECToR-style edits) |
| `detect_logits` | `(batch, num_words, 4)` | Error-detection probabilities |

**15 actions:**

```
$KEEP                      Giữ nguyên
$TRANSFORM_CASE_CAPITAL    Viết hoa chữ cái đầu (hà nội → Hà Nội)
$APPEND_,                  Thêm dấu phẩy
$APPEND_.                  Thêm dấu chấm
$TRANSFORM_VERB_VB_VBN     (không dùng cho tiếng Việt)
$TRANSFORM_CASE_UPPER      Viết hoa toàn bộ (who → WHO)
$APPEND_:                  Thêm dấu hai chấm
$APPEND_?                  Thêm dấu hỏi
$TRANSFORM_VERB_VB_VBC     (không dùng cho tiếng Việt)
$TRANSFORM_CASE_LOWER      Viết thường
$TRANSFORM_CASE_CAPITAL_1  Viết hoa ký tự thứ 2
$TRANSFORM_CASE_UPPER_-1   Viết hoa trừ ký tự cuối
$MERGE_SPACE               Nối từ
@@UNKNOWN@@
@@PADDING@@
```

## Reproducing the export

```bash
git clone https://huggingface.co/dragonSwing/vibert-capu
pip install torch transformers onnxruntime numpy

# Export FP32 + dynamic-quantize INT8 in one step:
python convert_onnx/export_vibert_onnx.py \
    --model_dir vibert-capu \
    --output    vibert-capu.onnx \
    --opset     14 \
    --verify
```

Script: [`convert_onnx/export_vibert_onnx.py`](https://github.com/welcomyou/sherpa-vietnamese-asr/blob/main/convert_onnx/export_vibert_onnx.py).

## Files

```
config.json                    BERT config (from dragonSwing)
vocab.txt                      BERT vocabulary (from dragonSwing)
vocabulary/                    GECToR action labels
  d_tags.txt
  labels.txt
  non_padded_namespaces.txt
verb-form-vocab.txt            Verb form vocabulary
vibert-capu.onnx              FP32 ONNX (438 MB)
vibert-capu.int8.onnx         INT8 ONNX (110 MB)
configuration_seq2labels.py   Seq2Labels HF config class
modeling_seq2labels.py        Seq2Labels HF model class (PyTorch reference, not used at runtime)
gec_model.py                  GECToR inference helpers
utils.py                      Tokenization helpers
vocabulary.py                 GECToR Vocabulary class
```

## Credits & License

- **Original model**: [dragonSwing/vibert-capu](https://huggingface.co/dragonSwing/vibert-capu)
- **Base BERT**: [FPTAI/vibert-base-cased](https://huggingface.co/FPTAI/vibert-base-cased)
- **Training data**: [OSCAR-2109](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) Vietnamese subset (5.6M samples)

License: **CC-BY-SA-4.0** (inherited from dragonSwing/vibert-capu — derivative works must use the same license).

## Used by

- [Sherpa Vietnamese ASR](https://github.com/welcomyou/sherpa-vietnamese-asr) — offline Vietnamese ASR for desktop and web (CPU-only).