🏆 BrightoSV Speaker Verification V1.2-LMF
COMMERCIAL SOTA • LARGE MARGIN FINE-TUNED
Bank-Grade Voice Identity Verification (Xác thực Định danh Giọng nói cấp Ngân hàng)
BrightoSV Speaker Verification V1.2-LMF is an upgraded release of V1.2, fine-tuned with Large Margin Fine-tuning (LMF) to push bank-grade operating points further. All 8 evaluation configurations surpass the V1.2-E7 baseline — delivering the new production SOTA for offline voiceprint verification.
(BrightoSV V1.2-LMF là bản nâng cấp của V1.2, tinh chỉnh bằng Large Margin Fine-tuning (LMF) để cải thiện hiệu suất tại các điểm vận hành ngân hàng. Toàn bộ 8 cấu hình đều vượt baseline V1.2-E7 — đạt SOTA mới cho xác thực giọng nói offline.)
📈 V1.2-LMF vs V1.2 — What Improved
| Config | Metric | V1.2 (E7) | V1.2-LMF | Δ |
|---|---|---|---|---|
| 🏦 4s / 5-Enroll | EER | 1.184% | 1.174% | −0.010% ✅ |
| 🏦 4s / 5-Enroll | FRR @ FAR 0.1% | 3.18% | 3.11% | −0.07% ✅ |
| 🏦 4s / 5-Enroll | FRR @ FAR 0.01% | 6.65% | 6.44% | −0.21% ✅ |
| 🏦 4s / 5-Enroll | Tail Gap 5% | +5.3463 | +5.3870 | +0.041 ✅ |
| 🏦 4s / 3-Enroll | FRR @ FAR 0.01% | 8.60% | 8.51% | −0.09% ✅ |
| 📱 2s / 5-Enroll | FRR @ FAR 0.01% | 12.58% | 12.13% | −0.45% ✅ |
| 📱 2s / 3-Enroll | FRR @ FAR 0.01% | 15.91% | 15.44% | −0.47% ✅ |
All 8 configurations improved. At production scale, every 0.1% FRR reduction means thousands fewer genuine users rejected per day.
(Toàn bộ 8 cấu hình đều cải thiện. Ở quy mô production, mỗi 0.1% giảm FRR đồng nghĩa hàng ngàn người dùng ít bị từ chối mỗi ngày.)
🏆 Key Performance Indicators
🏦 Bank-Grade Benchmarks (4-Second QA Gate, 4s/2s Windows)
Evaluated on 10,000,000 positive + 10,000,000 negative pairs with strict bank-grade QA (Audio ≥ 4s, SNR ≥ 10dB, Speech Ratio ≥ 15%):
| Metric (Chỉ số) | 3-Enrollment | 5-Enrollment | Significance (Ý nghĩa) |
|---|---|---|---|
| EER | 1.454% | 1.174% 👑 | 🏆 Commercial SOTA. Sub-1.2% on diverse multilingual test set. (Tỷ lệ lỗi cân bằng dưới 1.2%). |
| FRR @ FAR=0.1% | 4.26% | 3.11% 👑 | ✅ Bank-Grade Achieved. Under 5% at 1-in-1,000 impostor protection. (Đạt chuẩn ngân hàng tại FAR 0.1%). |
| FRR @ FAR=0.01% | 8.51% | 6.44% 👑 | 🔒 Maximum Security. 1-in-10,000 impostor protection for high-value transactions. (Bảo vệ 1/10,000 cho giao dịch giá trị cao). |
| Threshold @ FAR=0.1% | 2.5374 | 2.6330 | 🎯 Production threshold at balanced operating point. |
| Threshold @ FAR=0.01% | 4.5337 | 4.6411 | 🎯 Production threshold at maximum security. |
| Tail Gap 5% | +4.4937 | +5.3870 👑 | 🛡️ Full Separation. 95th percentile genuine scores fully above worst impostor region. (Phân tách hoàn toàn vùng đuôi). |
| Latency ⚡ | — | < 60ms | Real-time processing on consumer GPU. (Xử lý thời gian thực). |
📱 Extended Coverage Benchmarks (2-Second QA Gate, 2s/1s Windows)
For scenarios requiring shorter audio — mobile apps, call centers, IoT:
| Metric (Chỉ số) | 3-Enrollment | 5-Enrollment | Significance (Ý nghĩa) |
|---|---|---|---|
| EER | 2.294% | 1.687% 👑 | 🏆 Best-in-Class. Strong accuracy even with 2-second audio. (Độ chính xác cao ngay cả với audio 2 giây). |
| FRR @ FAR=0.1% | 7.81% | 5.90% | ✅ Practical Consumer UX. Manageable retry rate for mobile. (Tỷ lệ xác thực lại phù hợp ứng dụng di động). |
| FRR @ FAR=0.01% | 15.44% | 12.13% | 🔒 High Security on Short Audio. Viable for multi-factor deployments. (Bảo mật cao trên audio ngắn). |
| Threshold @ FAR=0.1% | 2.1329 | 2.2478 | 🎯 Production threshold at balanced operating point. |
| Threshold @ FAR=0.01% | 4.1170 | 4.1908 | 🎯 Production threshold at maximum security. |
| Tail Gap 5% | +2.4170 | +3.2310 | 🛡️ Positive Separation. Genuine and impostor tails clearly separated. (Vùng đuôi tách biệt rõ ràng). |
📊 Evaluation Methodology — Statistical Rigor
Unlike many speaker verification benchmarks that report on small test sets, BrightoSV V1.2-LMF is evaluated at industrial scale:
| Aspect | Detail |
|---|---|
| Positive pairs | 10,000,000 (same-speaker, cross-utterance) |
| Negative pairs | 10,000,000 (different-speaker, balanced 1:1) |
| Total scored pairs | 20,000,000 |
| Unique speakers | 3,900+ in evaluation set |
| Multi-enrollment | 3 and 5 enrollment utterances, mean-aggregated |
| Score normalization | AS-NORM with 600,000+ speaker cohort |
| Quality fusion | QMF (Quality Metric Fusion) — compensates for speaker-specific and duration offsets |
| QA gates | 5-gate bank-grade: Duration, Clipping, RMS Energy, Speech Ratio, SNR |
This scale ensures that operating points at FAR=0.01% (1 in 10,000) are backed by actual counts of 1,000 impostor threshold crossings, not statistical extrapolation.
(Quy mô này đảm bảo các chỉ số tại FAR=0.01% được xác thực bởi 1,000 mẫu vượt ngưỡng thực tế, không phải ngoại suy thống kê.)
🔬 What is Large Margin Fine-Tuning (LMF)?
LMF is a post-training optimization stage applied after the base model converges. Starting from the V1.2-E7 checkpoint:
| Aspect | Base Training (V1.2) | LMF (V1.2-LMF) |
|---|---|---|
| Angular margin | 0.35 | 0.40 (+14.3%) |
| Data selection | Mixed duration (≥2s) | ≥4s only (high-information) |
| Window strategy | 3 windows | 2 windows (first/last, cleaner) |
| Learning rate | Full training LR | Finetune LR (1000× lower backbone) |
| Optimizer | From scratch | Preserved (7 epochs of gradient history) |
| Scheduler | Continuous | Fresh cosine cycle |
The increased margin forces the model to push genuine and impostor embeddings further apart on the hypersphere. Combined with high-quality 4s-only data and preserved optimizer state, this delivers targeted improvements at the most critical operating points.
(LMF tăng angular margin, buộc model đẩy embedding genuine và impostor xa hơn trên hypersphere. Kết hợp với data 4s chất lượng cao và optimizer state được giữ lại, mang đến cải thiện có mục tiêu tại các điểm vận hành quan trọng nhất.)
🌍 Multilingual Training — Global Voice Coverage
BrightoSV V1.2-LMF inherits the full multilingual foundation of V1.2:
| Language | Coverage | Notes |
|---|---|---|
| 🇻🇳 Vietnamese | ★★★★★ | Primary language. Extensive dialect coverage (Northern, Central, Southern). (Ngôn ngữ chính, bao phủ đầy đủ phương ngữ Bắc, Trung, Nam). |
| 🇬🇧 English | ★★★★★ | Primary language. Multiple accents (US, UK, AU, Indian, Singapore) |
| 🇨🇳 Chinese | ★★★★★ | Primary language. Mandarin and regional variants |
| 🇰🇷 Korean | ★★★★☆ | Native speaker corpus |
| 🇩🇪 German | ★★★★☆ | European language coverage |
| 🇫🇷 French | ★★★★☆ | Including African French variants |
| 🇳🇱 Dutch | ★★★★☆ | European language coverage |
| 🇯🇵 Japanese | ★★★★☆ | Native speaker corpus |
| 🇸🇦 Arabic | ★★★★☆ | Multiple dialect coverage |
| 🇮🇩 Indonesian | ★★★☆☆ | Southeast Asian coverage |
Key principle: Speaker identity is carried by vocal tract shape, pitch dynamics, and articulatory patterns — these are language-independent. A speaker can enroll in Vietnamese and verify in English.
(Nguyên tắc then chốt: Định danh người nói mang tính phổ quát, không phụ thuộc ngôn ngữ.)
🛡️ Robustness — Augmentation & Real-World Resilience
🔊 Noise Resilience
| Category | Examples | Goal |
|---|---|---|
| 🏙️ Urban | Street noise, sirens, traffic, construction | On-the-go verification |
| 🏠 Domestic | TV/radio background, appliances, children | Work-from-home reliability |
| 🗣️ Babble | Crowd noise, overlapping speakers, cafeteria | The hardest scenario — solved |
| ⛈️ Natural | Wind, rain, thunder | Outdoor stability |
| 🐾 Biological | Coughing, sneezing, baby crying | Disentangle speaker from artifacts |
📡 Channel & Codec Resilience
| Codec / Channel | Simulation |
|---|---|
| GSM / AMR | Mobile telephony compression |
| VoIP (Zalo, WhatsApp) | Internet calling artifacts |
| MP3 / AAC / OGG | Lossy compression at various bitrates |
| Microphone variance | Laptop, phone, headset, far-field |
| Room acoustics | Reverb, echo, room impulse response |
🎛️ SpecAugment
Time masking applied on WavLM hidden states during training to prevent overfitting, forcing the model to learn robust speaker representations from partial temporal information.
Result: The model maintains accuracy down to 10 dB SNR — equivalent to speaking in a moderately noisy café. Below 10 dB, the QA gate rejects the audio, protecting against unreliable decisions.
🎯 Production Deployment
Three Security Levels
| Level | Min Audio | Enrollment | FAR Options | Use Case |
|---|---|---|---|---|
🏦 bank_strict |
≥ 4.0s | 5 samples | 0.1% / 0.01% | High-value banking, government (Ngân hàng giá trị cao) |
🏛️ bank_flex |
≥ 4.0s | 3 samples | 0.1% / 0.01% | Standard banking, telecom (Ngân hàng tiêu chuẩn) |
📱 consumer |
≥ 2.0s | 3 samples | 0.1% / 0.01% | Mobile apps, call centers, IoT (Ứng dụng di động) |
Scoring Pipeline
Audio → QA Gate (5 checks) → Windowed Embedding Extraction → AS-NORM (600K cohort) → QMF → Decision
| Component | Detail |
|---|---|
| Embedding | 512-dimensional voiceprint |
| Score normalization | AS-NORM with top-300 cohort matching |
| Quality fusion | Cohort Mean Fusion + Duration compensation |
| Multi-enrollment | Mean-aggregated across sessions |
| Window strategy | All valid windows from audio, mean-aggregated (matches eval pipeline) |
| Storage | ~2 KB per enrolled speaker |
QA Gate — Mandatory Pre-Check
| Check | Threshold | Purpose |
|---|---|---|
| Duration | ≥ 2s or ≥ 4s | Sufficient speech content |
| Clipping | < 0.1% | No distorted audio |
| RMS Energy | −45 to −5 dBFS | Proper recording level |
| Speech Ratio | ≥ 15% | Actual speech, not silence |
| SNR | ≥ 10 dB | Acceptable noise level |
Without QA gate, tail performance degrades significantly. QA is mandatory for production. (Không có QA gate, hiệu suất vùng đuôi giảm đáng kể. QA là bắt buộc.)
⚙️ Technical Specifications
| Specification | Value |
|---|---|
| Model Version | V1.2-LMF (Large Margin Fine-tuned SOTA) |
| Base Checkpoint | V1.2-E7 |
| LMF Method | Angular margin 0.35 → 0.40, 4s-only data, finetune LR |
| Parameters | 316M (High-Capacity Self-Supervised Backbone) |
| Embedding Dimension | 512 |
| Input Sample Rate | 16kHz (Auto-resampling supported) |
| Input Formats | WAV, FLAC, MP3, OGG, M4A |
| Output | 512D L2-normalized embedding |
| Backends | PyTorch, ONNX, HuggingFace |
🚀 Hardware & Performance
| Specification | Value |
|---|---|
| GPU Support | NVIDIA T4, A10, A100, H100, L4 |
| CPU Support | Intel Xeon, AMD EPYC (via ONNX) |
| Inference Latency | < 60ms (GPU) / < 400ms (CPU ONNX) |
| Model Size | ~1.2 GB |
| Batch Processing | Supported |
| Deployment | Fully offline after initial download |
🔒 Privacy & Security
| Aspect | Implementation |
|---|---|
| Audio Retention | Zero. Audio processed in RAM, immediately discarded. (Không lưu audio). |
| Voiceprint | 512 numbers. Non-reversible — cannot reconstruct voice. (Không thể tái tạo giọng nói). |
| Deployment | On-premise or private cloud. No external calls. (Triển khai nội bộ, không gọi ra ngoài). |
| Compliance | GDPR, PDPA, PCI-DSS ready |
| Data Sovereignty | 100% local processing. Your data never leaves your infrastructure. |
🤝 Combined with Anti-Spoofing
For maximum security, deploy BrightoSV Speaker Verification alongside BrightoSV Anti-Spoofing V1.5:
Audio → Anti-Spoof Check (Is this a real voice?) → Speaker Verify (Is this the right person?) → Decision
| Layer | Model | Purpose |
|---|---|---|
| Layer 1 | Anti-Spoof V1.5 | Reject deepfakes, replay attacks, TTS |
| Layer 2 | Speaker Verify V1.2-LMF | Confirm speaker identity |
This dual-layer architecture provides defense-in-depth: even if a sophisticated deepfake passes liveness detection, it must still match the enrolled voiceprint — and vice versa.
(Kiến trúc hai lớp cung cấp phòng thủ theo chiều sâu: ngay cả khi deepfake vượt qua kiểm tra liveness, vẫn phải khớp voiceprint — và ngược lại.)
📈 Roadmap
| Version | Status | Highlight |
|---|---|---|
| V1.2 | ✅ Released | Commercial SOTA baseline — EER 1.184% |
| V1.2-LMF | 🟢 Current | New SOTA — EER 1.174%, FRR@0.01% 6.44% |
| V1.5-SE | 🔵 In progress | Full retrain — clean data, new schedule, targeting FRR@0.01% ≤ 5% |
| V2.0 | 🟡 Planned | Next-generation architecture |
📞 Access & Licensing
This model is Private and available exclusively for enterprise partners under NDA. (Model nội bộ, chỉ cung cấp cho đối tác Doanh nghiệp ký NDA.)
Thương mại & Triển khai
- License trọn gói hoặc qua API
- Hỗ trợ tích hợp theo yêu cầu (triển khai, tối ưu hiệu năng, giám sát chất lượng)
- Công ty Cổ phần SphinX (sphinxjsc.com) được giao quyền đóng gói, cung cấp API và phân phối
Bản quyền & License
Thương mại / Proprietary. Việc sử dụng, phân phối lại hoặc tạo bản phái sinh cần có chấp thuận bằng văn bản từ BrighTO Technology.
Liên hệ
| Purpose | Contact |
|---|---|
| Commercial Licensing | nguyen@brighto.ai, nghia@brighto.ai |
| API & Distribution | duc@sphinxjsc.com (SphinX JSC) |
| Technical Inquiries | nguyen@hatto.com |
🏆 BrightoSV Speaker Verification V1.2-LMF
New SOTA • Large Margin Fine-Tuned • Bank-Grade • Offline-Ready
EER 1.174% · FRR 6.44% @ FAR 0.01% · 20M Eval Pairs · 600K Cohort · 9+ Languages
Built in Vietnam 🇻🇳 • Engineered for the World 🌏
This model card refers to BrightoSV Speaker Verification V1.2-LMF (Large Margin Fine-tuned Release). This is a direct upgrade from V1.2-E7, with all 8 evaluation configurations improved. All benchmark results are verified on internal evaluation sets comprising 20,000,000 scored pairs across 3,900+ speakers with strict bank-grade QA methodology and AS-NORM score normalization using a 600,000+ speaker cohort.
- Downloads last month
- -
Evaluation results
- Equal Error Rate (%) - Bank-Grade 4s / 5-Enrollself-reported1.174
- Equal Error Rate (%) - Consumer 2s / 5-Enrollself-reported1.687
- FRR @ FAR=0.1% (%) - Bank-Grade 4s / 5-Enrollself-reported3.110
- FRR @ FAR=0.01% (%) - Bank-Grade 4s / 5-Enrollself-reported6.440
- Tail Gap 5% - Bank-Grade 4s / 5-Enrollself-reported5.387