File size: 5,451 Bytes
a7754d0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
license: cc-by-sa-4.0
language:
- vi
base_model: dragonSwing/vibert-capu
tags:
- punctuation
- capitalization
- vietnamese
- onnx
- bert
- vibert
library_name: onnxruntime
---

# ViBERT-capu ONNX (FP32 + INT8)

Vietnamese **punctuation restoration + capitalization** model — ONNX Runtime version of [dragonSwing/vibert-capu](https://huggingface.co/dragonSwing/vibert-capu).
PyTorch dependency removed (~2 GB → ~50 MB onnxruntime).

| Variant | File | Size | Use case |
|---|---|---|---|
| FP32 | `vibert-capu.onnx` | 438 MB | Best accuracy, server / web service |
| INT8 | `vibert-capu.int8.onnx` | 110 MB | Desktop, embedded — dynamic-quantized weights, ~99% of FP32 accuracy |

Architecture: BERT (FPTAI/vibert-base-cased) fine-tuned by [dragonSwing](https://huggingface.co/dragonSwing) on 5.6M OSCAR-2109 samples for the Seq2Labels punctuation+capitalization task (15 GECToR-style edit actions).

## Why ONNX?

| | PyTorch (original) | ONNX Runtime (this repo) |
|---|---|---|
| Cold start | ~6 s | ~0.8 s |
| Runtime deps | torch (~2 GB) | onnxruntime (~50 MB) |
| Portable build | very heavy | lightweight |

## Quick start

```python
import numpy as np
import onnxruntime as ort
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

local = snapshot_download("welcomyou/vibert-capu-onnx")
tok = AutoTokenizer.from_pretrained(local)
sess = ort.InferenceSession(f"{local}/vibert-capu.int8.onnx",
                            providers=["CPUExecutionProvider"])

text = "hà nội là thủ đô việt nam tôi yêu nó"
enc = tok(text.split(), is_split_into_words=True, return_tensors="np")
# input_offsets: index of first subword for each word
word_ids = enc.word_ids()
offsets = []
prev = None
for i, w in enumerate(word_ids):
    if w is not None and w != prev:
        offsets.append(i); prev = w
input_offsets = np.array([offsets], dtype=np.int64)

logits, detect_logits = sess.run(None, {
    "input_ids": enc["input_ids"].astype(np.int64),
    "attention_mask": enc["attention_mask"].astype(np.int64),
    "token_type_ids": enc["token_type_ids"].astype(np.int64),
    "input_offsets": input_offsets,
})
# logits: (1, num_words, 15)  — 15 GECToR actions
# detect_logits: (1, num_words, 4) — error detection
```

## Model I/O

**Inputs** (all `int64`):

| Name | Shape | Description |
|---|---|---|
| `input_ids` | `(batch, seq_len)` | BPE token IDs from BertTokenizer |
| `attention_mask` | `(batch, seq_len)` | 1 = real token, 0 = padding |
| `token_type_ids` | `(batch, seq_len)` | Segment IDs (always 0) |
| `input_offsets` | `(batch, num_words)` | Index of first subword for each whitespace-separated word |

**Outputs** (`float32`):

| Name | Shape | Description |
|---|---|---|
| `logits` | `(batch, num_words, 15)` | Action probabilities (15 GECToR-style edits) |
| `detect_logits` | `(batch, num_words, 4)` | Error-detection probabilities |

**15 actions:**

```
$KEEP                      Giữ nguyên
$TRANSFORM_CASE_CAPITAL    Viết hoa chữ cái đầu (hà nội → Hà Nội)
$APPEND_,                  Thêm dấu phẩy
$APPEND_.                  Thêm dấu chấm
$TRANSFORM_VERB_VB_VBN     (không dùng cho tiếng Việt)
$TRANSFORM_CASE_UPPER      Viết hoa toàn bộ (who → WHO)
$APPEND_:                  Thêm dấu hai chấm
$APPEND_?                  Thêm dấu hỏi
$TRANSFORM_VERB_VB_VBC     (không dùng cho tiếng Việt)
$TRANSFORM_CASE_LOWER      Viết thường
$TRANSFORM_CASE_CAPITAL_1  Viết hoa ký tự thứ 2
$TRANSFORM_CASE_UPPER_-1   Viết hoa trừ ký tự cuối
$MERGE_SPACE               Nối từ
@@UNKNOWN@@
@@PADDING@@
```

## Reproducing the export

```bash
git clone https://huggingface.co/dragonSwing/vibert-capu
pip install torch transformers onnxruntime numpy

# Export FP32 + dynamic-quantize INT8 in one step:
python convert_onnx/export_vibert_onnx.py \
    --model_dir vibert-capu \
    --output    vibert-capu.onnx \
    --opset     14 \
    --verify
```

Script: [`convert_onnx/export_vibert_onnx.py`](https://github.com/welcomyou/sherpa-vietnamese-asr/blob/main/convert_onnx/export_vibert_onnx.py).

## Files

```
config.json                    BERT config (from dragonSwing)
vocab.txt                      BERT vocabulary (from dragonSwing)
vocabulary/                    GECToR action labels
  d_tags.txt
  labels.txt
  non_padded_namespaces.txt
verb-form-vocab.txt            Verb form vocabulary
vibert-capu.onnx              FP32 ONNX (438 MB)
vibert-capu.int8.onnx         INT8 ONNX (110 MB)
configuration_seq2labels.py   Seq2Labels HF config class
modeling_seq2labels.py        Seq2Labels HF model class (PyTorch reference, not used at runtime)
gec_model.py                  GECToR inference helpers
utils.py                      Tokenization helpers
vocabulary.py                 GECToR Vocabulary class
```

## Credits & License

- **Original model**: [dragonSwing/vibert-capu](https://huggingface.co/dragonSwing/vibert-capu)
- **Base BERT**: [FPTAI/vibert-base-cased](https://huggingface.co/FPTAI/vibert-base-cased)
- **Training data**: [OSCAR-2109](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) Vietnamese subset (5.6M samples)

License: **CC-BY-SA-4.0** (inherited from dragonSwing/vibert-capu — derivative works must use the same license).

## Used by

- [Sherpa Vietnamese ASR](https://github.com/welcomyou/sherpa-vietnamese-asr) — offline Vietnamese ASR for desktop and web (CPU-only).