🗣️ Vietnamese Pronunciation Classifier — Binary (Đúng / Sai)

Mô hình phân loại phát âm tiếng Việt: đúng hay sai (ngọng). Gộp dữ liệu cả 3 miền Bắc + Trung + Nam.

📊 Kết Quả Training

Metric	Giá trị
Best Val Accuracy	86.0%
F1 — Phát âm đúng	0.92
F1 — Phát âm sai	0.45
ONNX Latency	48.8ms (CPU)

📋 Classes (2)

ID	Label	Mô tả
0	`correct`	✓ Phát âm đúng (chuẩn)
1	`error`	✗ Phát âm sai (ngọng)

Các loại lỗi được phát hiện:

Miền Bắc: L → N
Miền Trung: S ↔ X, TR ↔ CH
Miền Nam: D ↔ R, GI → D, V → D

🏗️ Architecture

Theo đúng hướng dẫn fine-tune wav2vec2:

Backbone: nguyenvulebinh/wav2vec2-large-vi-vlsp2020
Freeze: CNN layers + 8 encoder layers đầu
Pooling: Mean pooling
Head: Dropout(0.3) → Linear(1024, 128) → ReLU → Dropout(0.1) → Linear(128, 2)
Audio: 16kHz mono, 500ms duration
Export: ONNX

📈 Training Config

Data: 2,070 samples (1,737 đúng + 333 sai) — 3 miền gộp
Split: 80/20 stratified
Loss: CrossEntropyLoss
Optimizer: AdamW (LR=1e-4, wd=0.01)
Warmup: 10% linear
Epochs: 8/20 (early stopping, patience=5)

🚀 Usage (ONNX Runtime)

import onnxruntime as ort
import librosa
import numpy as np

# Load model
sess = ort.InferenceSession("phoneme_classifier_all.onnx")

# Load audio (16kHz, 500ms)
audio, _ = librosa.load("your_audio.wav", sr=16000, mono=True)
audio = audio[:8000]  # 500ms at 16kHz
if len(audio) < 8000:
    audio = np.pad(audio, (0, 8000 - len(audio)))

# Normalize
std = audio.std()
if std > 0:
    audio = (audio - audio.mean()) / std

# Inference
logits = sess.run(None, {"input_values": audio.reshape(1, -1).astype(np.float32)})[0][0]
probs = np.exp(logits) / np.exp(logits).sum()
is_correct = int(np.argmax(probs)) == 0

print(f"Correct: {is_correct}")
print(f"Confidence: {probs[int(np.argmax(probs))]*100:.1f}%")

📁 Files

phoneme_classifier_all.onnx — ONNX model graph
phoneme_classifier_all.onnx.data — ONNX model weights
best_model.pt — PyTorch checkpoint
config.json — Label map + metadata
preprocessor_config.json — Wav2Vec2 feature extractor config

Downloads last month: 39

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Bao2311
/

speak-journey-binary-onnx