Initial upload: ViBERT-capu ONNX (FP32 + INT8) — converted from dragonSwing/vibert-capu

a7754d0 verified 3 days ago

5.45 kB

	---
	license: cc-by-sa-4.0
	language:
	- vi
	base_model: dragonSwing/vibert-capu
	tags:
	- punctuation
	- capitalization
	- vietnamese
	- onnx
	- bert
	- vibert
	library_name: onnxruntime
	---

	# ViBERT-capu ONNX (FP32 + INT8)

	Vietnamese punctuation restoration + capitalization model — ONNX Runtime version of [dragonSwing/vibert-capu](https://huggingface.co/dragonSwing/vibert-capu).
	PyTorch dependency removed (~2 GB → ~50 MB onnxruntime).

	\| Variant \| File \| Size \| Use case \|
	\|---\|---\|---\|---\|
	\| FP32 \| `vibert-capu.onnx` \| 438 MB \| Best accuracy, server / web service \|
	\| INT8 \| `vibert-capu.int8.onnx` \| 110 MB \| Desktop, embedded — dynamic-quantized weights, ~99% of FP32 accuracy \|

	Architecture: BERT (FPTAI/vibert-base-cased) fine-tuned by [dragonSwing](https://huggingface.co/dragonSwing) on 5.6M OSCAR-2109 samples for the Seq2Labels punctuation+capitalization task (15 GECToR-style edit actions).

	## Why ONNX?

	\| \| PyTorch (original) \| ONNX Runtime (this repo) \|
	\|---\|---\|---\|
	\| Cold start \| ~6 s \| ~0.8 s \|
	\| Runtime deps \| torch (~2 GB) \| onnxruntime (~50 MB) \|
	\| Portable build \| very heavy \| lightweight \|

	## Quick start

	```python
	import numpy as np
	import onnxruntime as ort
	from huggingface_hub import snapshot_download
	from transformers import AutoTokenizer

	local = snapshot_download("welcomyou/vibert-capu-onnx")
	tok = AutoTokenizer.from_pretrained(local)
	sess = ort.InferenceSession(f"{local}/vibert-capu.int8.onnx",
	providers=["CPUExecutionProvider"])

	text = "hà nội là thủ đô việt nam tôi yêu nó"
	enc = tok(text.split(), is_split_into_words=True, return_tensors="np")
	# input_offsets: index of first subword for each word
	word_ids = enc.word_ids()
	offsets = []
	prev = None
	for i, w in enumerate(word_ids):
	if w is not None and w != prev:
	offsets.append(i); prev = w
	input_offsets = np.array([offsets], dtype=np.int64)

	logits, detect_logits = sess.run(None, {
	"input_ids": enc["input_ids"].astype(np.int64),
	"attention_mask": enc["attention_mask"].astype(np.int64),
	"token_type_ids": enc["token_type_ids"].astype(np.int64),
	"input_offsets": input_offsets,
	})
	# logits: (1, num_words, 15) — 15 GECToR actions
	# detect_logits: (1, num_words, 4) — error detection
	```

	## Model I/O

	Inputs (all `int64`):

	\| Name \| Shape \| Description \|
	\|---\|---\|---\|
	\| `input_ids` \| `(batch, seq_len)` \| BPE token IDs from BertTokenizer \|
	\| `attention_mask` \| `(batch, seq_len)` \| 1 = real token, 0 = padding \|
	\| `token_type_ids` \| `(batch, seq_len)` \| Segment IDs (always 0) \|
	\| `input_offsets` \| `(batch, num_words)` \| Index of first subword for each whitespace-separated word \|

	Outputs (`float32`):

	\| Name \| Shape \| Description \|
	\|---\|---\|---\|
	\| `logits` \| `(batch, num_words, 15)` \| Action probabilities (15 GECToR-style edits) \|
	\| `detect_logits` \| `(batch, num_words, 4)` \| Error-detection probabilities \|

	15 actions:

	```
	$KEEP Giữ nguyên
	$TRANSFORM_CASE_CAPITAL Viết hoa chữ cái đầu (hà nội → Hà Nội)
	$APPEND_, Thêm dấu phẩy
	$APPEND_. Thêm dấu chấm
	$TRANSFORM_VERB_VB_VBN (không dùng cho tiếng Việt)
	$TRANSFORM_CASE_UPPER Viết hoa toàn bộ (who → WHO)
	$APPEND_: Thêm dấu hai chấm
	$APPEND_? Thêm dấu hỏi
	$TRANSFORM_VERB_VB_VBC (không dùng cho tiếng Việt)
	$TRANSFORM_CASE_LOWER Viết thường
	$TRANSFORM_CASE_CAPITAL_1 Viết hoa ký tự thứ 2
	$TRANSFORM_CASE_UPPER_-1 Viết hoa trừ ký tự cuối
	$MERGE_SPACE Nối từ
	@@UNKNOWN@@
	@@PADDING@@
	```

	## Reproducing the export

	```bash
	git clone https://huggingface.co/dragonSwing/vibert-capu
	pip install torch transformers onnxruntime numpy

	# Export FP32 + dynamic-quantize INT8 in one step:
	python convert_onnx/export_vibert_onnx.py \
	--model_dir vibert-capu \
	--output vibert-capu.onnx \
	--opset 14 \
	--verify
	```

	Script: [`convert_onnx/export_vibert_onnx.py`](https://github.com/welcomyou/sherpa-vietnamese-asr/blob/main/convert_onnx/export_vibert_onnx.py).

	## Files

	```
	config.json BERT config (from dragonSwing)
	vocab.txt BERT vocabulary (from dragonSwing)
	vocabulary/ GECToR action labels
	d_tags.txt
	labels.txt
	non_padded_namespaces.txt
	verb-form-vocab.txt Verb form vocabulary
	vibert-capu.onnx FP32 ONNX (438 MB)
	vibert-capu.int8.onnx INT8 ONNX (110 MB)
	configuration_seq2labels.py Seq2Labels HF config class
	modeling_seq2labels.py Seq2Labels HF model class (PyTorch reference, not used at runtime)
	gec_model.py GECToR inference helpers
	utils.py Tokenization helpers
	vocabulary.py GECToR Vocabulary class
	```

	## Credits & License

	- Original model: [dragonSwing/vibert-capu](https://huggingface.co/dragonSwing/vibert-capu)
	- Base BERT: [FPTAI/vibert-base-cased](https://huggingface.co/FPTAI/vibert-base-cased)
	- Training data: [OSCAR-2109](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109) Vietnamese subset (5.6M samples)

	License: CC-BY-SA-4.0 (inherited from dragonSwing/vibert-capu — derivative works must use the same license).

	## Used by

	- [Sherpa Vietnamese ASR](https://github.com/welcomyou/sherpa-vietnamese-asr) — offline Vietnamese ASR for desktop and web (CPU-only).