You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

🎙️ BrighTO-SSAP V1.5-S-SE - World's First Audio-Native Social Intelligence System

Mô hình AI dự đoán Cảm xúc & Lý lịch Xã hội từ Âm thanh Đầu tiên trên Thế giới

🎯 Gender	🌍 Language	😊 Emotion	💼 Social Class	🎓 Education
92.14%	94.30%	70.95%	87.82%	79.34%

🌟 What is BrighTO SSAP? | BrighTO SSAP là gì?

BrighTO-SSAP extracts rich voice intelligence directly from raw audio without requiring transcription:

BrighTO-SSAP trích xuất thông tin giọng nói phong phú trực tiếp từ âm thanh thô mà không cần phiên âm:

Social Class | Tầng lớp xã hội (working, middle, upper)
Education Level | Trình độ học vấn (elementary → postgraduate)
Regional Accent | Giọng vùng miền (20+ regions including vn_north, vn_south, vn_central)
Emotion & Attitude | Cảm xúc & Thái độ (10+ emotions, 15+ attitudes)
Voice Quality | Chất lượng giọng (pitch, energy, tension, noise)
Risk Assessment | Đánh giá rủi ro (customer_risk, teller_score)
Explainable AI (required by law sometime) | *Dự báo của AI có thể giải thích được (theo Luật nhiều ơi sẽ yêu cầu chức năng này)

Why BrighTO SSAP? | Tại sao chọn BrighTO SSAP?

Traditional Pipeline	BrighTO SSAP
ASR → Text → LLM → Analysis	Audio → Direct Analysis
~5-10 seconds latency	~70ms (Quick)
ASR errors propagate	Zero error propagation
Requires transcription	Audio-native

📦 Installation | Cài đặt

# Core dependencies | Phụ thuộc cốt lõi
pip install torch torchaudio transformers accelerate peft safetensors

# Recommended | Khuyến nghị
pip install flash-attn --no-build-isolation  # 2x faster attention
pip install json-repair                       # Robust JSON parsing

# For vLLM acceleration (optional) | Tăng tốc vLLM (tùy chọn)
pip install vllm>=0.4.0

One-liner | Một dòng lệnh

pip install torch torchaudio transformers accelerate peft safetensors json-repair

🚀 Quick Start | Bắt đầu Nhanh

1. Load from HuggingFace | Tải từ HuggingFace

from hattovoice_v3_prod import HattoVoice

# Load model from HuggingFace Hub
# Tải model từ HuggingFace Hub
model = HattoVoice.load("thusinh1969/BrighTO-Semantic-Social-Audio-Profiler-V1.5.SE")

2. Save for Offline Use | Lưu để Dùng Offline

# Download once, use forever without internet
# Tải một lần, dùng mãi mãi không cần internet
model.save("./ssap_local")

# Later: Load completely offline (on-premise)
# Sau này: Tải hoàn toàn offline (tại chỗ)
model = HattoVoice.load("./ssap_local")

3. Analyze Audio | Phân tích Âm thanh

# Single file analysis | Phân tích file đơn
result = model.analyze("customer_call.wav")

# Access results | Truy cập kết quả
print(f"Gender: {result.gender}")           # "female"
print(f"Language: {result.language}")       # "vi"
print(f"Region: {result.region}")           # "vn_north"
print(f"Emotion: {result.emotion}")         # "neutral"
print(f"Social Class: {result.social_class}") # "middle"
print(f"Education: {result.education}")     # "tertiary"

# Full JSON with reasoning | JSON đầy đủ với lý luận
print(result.json)

⚡ Inference Modes | Các Chế độ Suy luận

Mode 1: Quick Classification (~70ms) | Phân loại Nhanh

Best for: Real-time screening, call routing, live alerts

Tốt nhất cho: Sàng lọc thời gian thực, định tuyến cuộc gọi, cảnh báo trực tiếp

# Ultra-fast: Auxiliary heads only, NO LLM
# Siêu nhanh: Chỉ Auxiliary heads, KHÔNG LLM
probs = model.classify("audio.wav")

# Returns probability distributions
# Trả về phân phối xác suất
print(probs)
# {
#   'gender': {'female': 0.96, 'male': 0.04},
#   'emotion': {'neutral': 0.85, 'calm': 0.10, 'happy': 0.05},
#   'language': {'vi': 0.94, 'en': 0.05, 'zh': 0.01},
#   'region': {'vn_north': 0.88, 'vn_south': 0.10, 'vn_central': 0.02},
#   'social_class': {'middle': 0.90, 'upper': 0.07, 'working': 0.03},
#   'education': {'tertiary': 0.82, 'secondary': 0.15, 'postgraduate': 0.03}
# }

# Top prediction only | Chỉ dự đoán cao nhất
probs = model.classify("audio.wav", top_k=1)
# {'gender': {'female': 0.96}, 'emotion': {'neutral': 0.85}, ...}

Mode 2: Full Analysis with HuggingFace (~15s) | Phân tích Đầy đủ với HF

Best for: Detailed profiling, explainable AI, compliance

Tốt nhất cho: Lập hồ sơ chi tiết, AI giải thích được, tuân thủ

# Full semantic analysis with reasoning
# Phân tích ngữ nghĩa đầy đủ với lý luận
result = model.analyze("audio.wav")

# Structured output with explanations
# Đầu ra có cấu trúc với giải thích
print(result.json["demographics"]["region_reason"])
# "Người nói sử dụng phụ âm đầu 'r' phát âm thành 'z' và thanh điệu 
#  sắc nét đặc trưng của giọng Hà Nội, xác nhận nguồn gốc miền Bắc."

print(result.json["demographics"]["education_reason"])
# "Cấu trúc câu hoàn chỉnh, từ vựng chính xác, phát âm rõ ràng, 
#  không có lỗi ngữ pháp - cho thấy trình độ đại học."

print(result.json["notes"])
# "Người nói thể hiện trạng thái bình tĩnh, không có dấu hiệu 
#  căng thẳng hay lo âu. Giọng nói tự nhiên, nhất quán..."

Mode 3: vLLM Accelerated (~6s) | Tăng tốc vLLM

Best for: Production deployment, high throughput

Tốt nhất cho: Triển khai sản xuất, thông lượng cao

# Load with vLLM backend
# Tải với backend vLLM
model = HattoVoice.load(
    "thusinh1969/BrighTO-Semantic-Social-Audio-Profiler-V1.5.SE",
    use_vllm=True,
    vllm_gpu_memory=0.5,  # 50% GPU for vLLM | 50% GPU cho vLLM
)

# Same API, much faster | Cùng API, nhanh hơn nhiều
result = model.analyze("audio.wav")

Mode 4: Batch Processing | Xử lý Hàng loạt

Best for: Offline analytics, historical data processing

Tốt nhất cho: Phân tích offline, xử lý dữ liệu lịch sử

import glob

# Get all audio files | Lấy tất cả file âm thanh
audio_files = glob.glob("/calls/2024-01/*.wav")

# Batch full analysis | Phân tích đầy đủ hàng loạt
results = model.analyze_batch(audio_files)

for file, result in zip(audio_files, results):
    print(f"{file}: {result.emotion}, {result.social_class}")

# Batch quick classify (much faster) | Phân loại nhanh hàng loạt (nhanh hơn nhiều)
probs_list = model.classify_batch(audio_files) # You can run batch 32 on A6000 48G

Mode 5: Streaming Output | Đầu ra Streaming

Best for: Interactive UIs, real-time display

Tốt nhất cho: UI tương tác, hiển thị thời gian thực

# Stream tokens as generated | Stream token khi được sinh ra
for token in model.stream("audio.wav"):
    print(token, end="", flush=True)

Mode 6: Analyze Audio using online vLLM | Phân tích Âm thanh với vLLM chạy độc lập online

Best for: Most of the time reserve GPU for quick inference, sometime use explainable SSAP

Tốt nhất cho: Thường để dành GPU cho tác vụ dự báo nhanh, thi thoảng dùng SSAP để giải thích tại sao dự báo vậy

🔒 On-Premise Deployment | Triển khai Tại chỗ

Benefits | Lợi ích

Benefit	Description
🔐 Data Privacy	Audio never leaves your servers
Bảo mật Dữ liệu	Âm thanh không bao giờ rời máy chủ của bạn
🌐 Air-Gapped	Works without any internet connection
Cách ly Mạng	Hoạt động không cần kết nối internet
⚡ Low Latency	No network round-trip overhead
Độ trễ Thấp	Không có chi phí round-trip mạng
💰 No API Fees	Unlimited inference after deployment
Không phí API	Suy luận không giới hạn sau triển khai

Setup | Thiết lập

# === STEP 1: Download (requires internet once) ===
# === BƯỚC 1: Tải xuống (cần internet một lần) ===
from hattovoice_v3_prod import HattoVoice

model = HattoVoice.load("thusinh1969/BrighTO-Semantic-Social-Audio-Profiler-V1.5.SE")
model.save("/opt/models/ssap")
print("✅ Model saved for offline use")

# === STEP 2: Deploy (no internet needed) ===
# === BƯỚC 2: Triển khai (không cần internet) ===
model = HattoVoice.load("/opt/models/ssap")

# Works completely offline! | Hoạt động hoàn toàn offline!
result = model.analyze("call.wav")

Docker | Docker

FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime

RUN pip install transformers accelerate peft safetensors torchaudio json-repair

# Copy pre-downloaded model | Sao chép model đã tải trước
COPY ./ssap_local /opt/models/ssap
COPY hattovoice_v3_prod.py /app/

WORKDIR /app
EXPOSE 8000
CMD ["python", "api_server.py"]

📊 Performance | Hiệu suất

Speed Comparison | So sánh Tốc độ

Mode	Latency	Throughput	GPU VRAM
⚡ Quick Classify	~70ms	20/sec	~4GB
🧠 HF Full	~15s	-	~12GB
🚀 vLLM Full	~15s	-	~16GB
📦 vLLM Batch	~1s/sample	128+/min	~16GB

Accuracy by Field | Độ chính xác theo Trường

🎯 Gender | Giới tính (92.14% F1)

Class	Accuracy
Female Nữ	96.0%
Male Nam	92.9%

🌍 Language | Ngôn ngữ (94.30% F1)

Language	Accuracy	Language	Accuracy
🇺🇸🇬🇧 English	98.5%	🇯🇵 Japanese	100%
🇻🇳 Vietnamese	94.5%	🇰🇷 Korean	100%
🇨🇳 Chinese	99.1%	🇩🇪 German	100%
🇫🇷 French	100%	🇮🇹 Italian	100%

🗺️ Regional Accent | Giọng Vùng miền

Region	Accuracy	Region	Accuracy
🇺🇸 en_us	96.7%	🇻🇳 vn_south	94.6%
🇬🇧 en_uk	87.5%	🇻🇳 vn_north	83.7%
🇫🇷 fr_france	96.8%	🇻🇳 vn_central	57.7%
🇨🇳 zh_mandarin	100%	🇭🇰 zh_cantonese	100%

😊 Emotion | Cảm xúc (70.95% F1)

Emotion	Accuracy	Emotion	Accuracy
Neutral Trung tính	90.4%	Happy Vui vẻ	79.5%
Anger Tức giận	86.8%	Concerned Lo lắng	67.6%
Sadness Buồn	82.9%	Calm Bình tĩnh	57.6%

💼 Social Class | Tầng lớp XH (87.82% F1)

Class	Accuracy
Middle Trung lưu	95.3%
Upper Thượng lưu	79.5%
Working Lao động	68.1%

🎓 Education | Học vấn (79.34% F1)

Level	Accuracy
Tertiary Đại học	89.5%
Secondary THPT	81.1%
Postgraduate Sau ĐH	53.3%

🏭 Use Cases | Trường hợp Sử dụng

🏦 Banking | Ngân hàng

result = model.analyze("transaction_call.wav")

# Fraud detection | Phát hiện gian lận
if result.json["qc"]["customer_risk"] > 0.7:
    flag_for_review()

# Voice stress analysis | Phân tích căng thẳng giọng
if result.json["voice"]["tension"] > 0.8:
    alert_supervisor()

📞 Call Center | Tổng đài

# Real-time routing | Định tuyến thời gian thực
probs = model.classify(live_stream)

if probs["emotion"]["anger"] > 0.6:
    route_to_supervisor()
elif probs["language"]["vi"] > 0.9:
    route_to_vietnamese_team()

🏥 Healthcare | Y tế

# Mental health screening | Sàng lọc sức khỏe tâm thần
result = model.analyze("patient_call.wav")

depression_indicators = (
    result.json["voice"]["energy"] < 0.3 and
    result.json["emotion"]["top3"][0]["e"] == "sadness"
)

🔍 Best use:

Recommended Pipeline for Long Audio

Segment: Extract multiple 4–25s clips from the conversation (you should use BrighTO Speaker Verification Model to split precisely between speakers) Analyze: Run SSAP independently on each segment Aggregate: Use a lightweight LLM to synthesize the final speaker profile

This approach eliminates single-sample bias and captures emotional/behavioral changes throughout the interaction.

Quy trình Khuyến nghị cho Audio Dài

Phân đoạn: Trích xuất nhiều clip 4–25 giây từ cuộc hội thoại (bạn nên dùng BrighTO Speaker Verification Model để tách biệt chính xác Người Nói) Phân tích: Chạy SSAP độc lập trên từng đoạn Tổng hợp: Dùng LLM nhẹ để tổng hợp hồ sơ người nói cuối cùng

Cách tiếp cận này loại bỏ bias từ mẫu đơn và nắm bắt thay đổi cảm xúc/hành vi trong suốt cuộc tương tác.

🔧 API Reference | Tham chiếu API

HattoVoice.load()

model = HattoVoice.load(
    path: str,               # HF repo ID or local path | ID repo HF hoặc đường dẫn cục bộ
    device: str = "cuda",    # "cuda" or "cpu"
    use_vllm: bool = False,  # Enable vLLM acceleration | Bật tăng tốc vLLM
    vllm_gpu_memory: float = 0.5,  # GPU fraction for vLLM | Phần GPU cho vLLM
)

model.analyze()

result = model.analyze(
    audio: str | np.ndarray | torch.Tensor,  # Audio input | Đầu vào âm thanh
    temperature: float = 0.0,  # 0 = deterministic | 0 = xác định
    max_tokens: int = 1024,    # Max output length | Độ dài đầu ra tối đa
) -> AnalysisResult

model.classify()

probs = model.classify(
    audio: str | np.ndarray | torch.Tensor,
    top_k: int = 0,  # 0 = all, N = top N | 0 = tất cả, N = top N
) -> Dict[str, Dict[str, float]]

model.save()

model.save(path: str)  # Save for offline use | Lưu để dùng offline

Examples

Audio - Female, Việt Nam Central, Cooperative, Call Center 🙂 🤝 👍 ✅ 🙂‍↔️

Output: { "speaker": { "gender": "female", "age": "young" }, "demographics": { "language": "vi", "language_prob": 1.0, "region": "vn_central", "region_reason": "Người nói sử dụng chất giọng nặng đặc trưng, đặc biệt là cách phát âm các phụ âm đầu hơi cứng và âm điệu đi xuống ở cuối câu hỏi, gợi ý nguồn gốc từ Bắc Trung Bộ (như Nghệ An hoặc Hà Tĩnh).", "region_prob": 0.85, "education": "tertiary", "education_reason": "Sử dụng từ vựng chuyên nghiệp ('xác nhận lại', 'trầy xước') và cấu trúc câu rành mạch, chuẩn xác trong bối cảnh dịch vụ.", "education_prob": 0.9, "social_class": "middle", "social_class_reason": "Môi trường âm thanh nền yên tĩnh, chất giọng lịch sự, kiểm soát âm lượng tốt, phù hợp với nhân viên văn phòng hoặc tổng đài viên chuyên nghiệp.", "social_class_prob": 0.85 }, "qc": { "teller_score": 0.9, "customer_risk": 0.0 }, "emotion": { "valence": 0.3, "arousal": 0.4, "top3": [ { "e": "neutral", "s": 0.7 }, { "e": "calm", "s": 0.2 }, { "e": "concerned", "s": 0.1 } ] }, "attitude": { "top3": [ { "a": "cooperative", "s": 0.6 }, { "a": "friendly", "s": 0.3 }, { "a": "neutral", "s": 0.1 } ] }, "voice": { "pitch": "mid", "energy": 0.5, "speed": 0.55, "tension": 0.2, "harsh": 0.1, "breathy": 0.1 }, "quality": { "noise": 0.1, "distortion": 0.0, "overlap": 0.0, "stutter": 0.0 }, "notes": "Chất giọng nặng và âm sắc đặc trưng ở các từ cuối câu xác nhận nguồn gốc Bắc Trung Bộ dù đang giao tiếp trong bối cảnh chuyên nghiệp. Người nói thể hiện thái độ ân cần, tập trung cao độ vào việc xác minh thông tin khách hàng, không có dấu hiệu xao nhãng hay thiếu kiên nhẫn. Nhịp điệu nói vừa phải, rõ ràng giúp tạo cảm giác tin cậy và chuyên nghiệp trong quy trình xử lý sự cố bảo hiểm hoặc sửa chữa. Không có dấu hiệu rủi ro hay căng thẳng trong giọng nói." }

Audio - Female, Việt Nam Southern 😌 🧘 😊 🙂

Output: { "speaker": { "gender": "female", "age": "young" }, "demographics": { "language": "vi", "language_prob": 1.0, "region": "vn_south", "region_reason": "The speaker demonstrates clear Southern Vietnamese phonetic markers, specifically the merging of 'd' and 'gi' sounds, the relaxed articulation of final consonants, and the characteristic open vowel sounds typical of the Southern dialect.", "region_prob": 0.98, "education": "tertiary", "education_reason": "The vocabulary used is literary and sophisticated (e.g., 'thập ác liên hoa cảnh', 'giam cầm'), suggesting a reading from a novel or script rather than casual speech, indicating a high level of literacy.", "education_prob": 0.9, "social_class": "middle", "social_class_reason": "The recording environment is quiet with no background noise, and the speaker maintains a controlled, polished narrative voice consistent with professional or semi-professional content creation.", "social_class_prob": 0.85 }, "qc": { "teller_score": 0.85, "customer_risk": 0.0 }, "emotion": { "valence": 0.1, "arousal": 0.4, "top3": [ { "e": "neutral", "s": 0.6 }, { "e": "calm", "s": 0.3 }, { "e": "concerned", "s": 0.1 } ] }, "attitude": { "top3": [ { "a": "neutral", "s": 0.7 }, { "a": "authoritative", "s": 0.2 }, { "a": "empathetic", "s": 0.1 } ] }, "voice": { "pitch": "mid", "energy": 0.5, "speed": 0.5, "tension": 0.3, "harsh": 0.1, "breathy": 0.2 }, "quality": { "noise": 0.0, "distortion": 0.0, "overlap": 0.0, "stutter": 0.0 }, "notes": "The speaker's distinct Southern Vietnamese accent is confirmed by the soft pronunciation of initial consonants and the specific intonation used for narrative storytelling. This audio appears to be a narration or audiobook reading rather than a customer service interaction, characterized by a steady, dramatic pacing and literary vocabulary. There are no signs of risk or distress; the voice is composed and professional. The delivery is clear, well-paced, and devoid of hesitation or stuttering." }

Audio - British English, Male 😌 🧘 😊 🙂

Output: { "speaker": { "gender": "male", "age": "adult" }, "demographics": { "language": "en", "language_prob": 1.0, "region": "en_uk", "region_reason": "Người nói sử dụng giọng Anh-Anh (British English) chuẩn mực (RP), đặc trưng bởi cách phát âm non-rhotic (không uốn lưỡi chữ 'r' cuối) và nguyên âm tròn trịa, rõ ràng.", "region_prob": 0.95, "education": "tertiary", "education_reason": "Cách diễn đạt gãy gọn, ngữ pháp chuẩn xác và ngữ điệu tự nhiên cho thấy khả năng ngôn ngữ tốt và trình độ học vấn cao.", "education_prob": 0.9, "social_class": "middle", "social_class_reason": "Giọng nói lịch sự, điềm tĩnh, môi trường thu âm yên tĩnh không có tạp âm nền, gợi ý một bối cảnh chuyên nghiệp hoặc được chuẩn bị kỹ.", "social_class_prob": 0.85 }, "qc": { "teller_score": 0.8, "customer_risk": 0.1 }, "emotion": { "valence": 0.2, "arousal": 0.3, "top3": [ { "e": "neutral", "s": 0.8 }, { "e": "calm", "s": 0.15 }, { "e": "other", "s": 0.05 } ] }, "attitude": { "top3": [ { "a": "neutral", "s": 0.7 }, { "a": "cooperative", "s": 0.2 }, { "a": "friendly", "s": 0.1 } ] }, "voice": { "pitch": "mid", "energy": 0.4, "speed": 0.5, "tension": 0.1, "harsh": 0.0, "breathy": 0.1 }, "quality": { "noise": 0.1, "distortion": 0.0, "overlap": 0.0, "stutter": 0.0 }, "notes": "Người nói sử dụng giọng Anh-Anh chuẩn mực (RP) với ngữ điệu đặc trưng, xác nhận vùng miền en_uk. Giọng điệu hoàn toàn trung tính, mang tính chất trần thuật và cung cấp thông tin, không biểu lộ cảm xúc cá nhân mạnh mẽ. Nhịp điệu nói ổn định, có sự tạm dừng ngắn tự nhiên để suy nghĩ ('largely in about...'), cho thấy sự chân thực và không đọc từ kịch bản cứng nhắc. Không có dấu hiệu căng thẳng hay rủi ro hành vi nào được phát hiện." }

Audio - Indian English, Friendly & Happy 😊 😄 🙂 😃 🥳

Output: { "speaker": { "gender": "male", "age": "young" }, "demographics": { "language": "en", "language_prob": 1.0, "region": "en_in", "region_reason": "Cách phát âm âm retroflex nhẹ ở các phụ âm t và d, đặc biệt trong từ 'understood', cùng với ngữ điệu đặc trưng của vùng Nam Á (syllable-timed rhythm) xác nhận giọng Anh-Ấn.", "region_prob": 0.95, "education": "tertiary", "education_reason": "Người nói sử dụng từ vựng rõ ràng và cấu trúc câu mạch lạc ('passion', 'understood'), thể hiện khả năng diễn đạt trôi chảy và tự tin.", "education_prob": 0.9, "social_class": "middle", "social_class_reason": "Môi trường âm thanh khá sạch nhưng vẫn có tiếng ồn nền nhẹ, phong thái nói chuyện thoải mái nhưng lịch sự, phù hợp với tầng lớp trung lưu.", "social_class_prob": 0.85 }, "qc": { "teller_score": 0.0, "customer_risk": 0.0 }, "emotion": { "valence": 0.6, "arousal": 0.5, "top3": [ { "e": "happy", "s": 0.7 }, { "e": "joy", "s": 0.6 }, { "e": "calm", "s": 0.4 } ] }, "attitude": { "top3": [ { "a": "friendly", "s": 0.8 }, { "a": "cooperative", "s": 0.6 }, { "a": "neutral", "s": 0.3 } ] }, "voice": { "pitch": "mid", "energy": 0.6, "speed": 0.55, "tension": 0.2, "harsh": 0.0, "breathy": 0.1 }, "quality": { "noise": 0.3, "distortion": 0.0, "overlap": 0.0, "stutter": 0.1 }, "notes": "Phát âm retroflex nhẹ ở các phụ âm và ngữ điệu nhịp nhàng xác nhận nguồn gốc Anh-Ấn, bất chấp sự lưu loát trong ngôn ngữ. Người nói thể hiện sự nhiệt huyết chân thành khi nói về sở thích và đam mê, với năng lượng tích cực và giọng điệu cởi mở. Không có dấu hiệu căng thẳng hay ngập ngừng đáng lo ngại, cho thấy sự tự tin và thoải mái khi chia sẻ câu chuyện cá nhân." }

Audio - Japanese Male, Sad 😢 😔 🥺 😞 😿

Output: { "speaker": { "gender": "male", "age": "young" }, "demographics": { "language": "ja", "language_prob": 0.99, "region": "other", "region_reason": "The speaker uses standard Japanese pronunciation and intonation without distinct regional dialect markers.", "region_prob": 0.95, "education": "secondary", "education_reason": "The sentence structure is standard and grammatically correct, though the vocabulary is common and everyday.", "education_prob": 0.7, "social_class": "middle", "social_class_reason": "The recording environment is quiet with no background noise, and the speaker maintains a controlled, polite volume.", "social_class_prob": 0.75 }, "qc": { "teller_score": 0.0, "customer_risk": 0.2 }, "emotion": { "valence": -0.6, "arousal": 0.4, "top3": [ { "e": "sadness", "s": 0.85 }, { "e": "fear", "s": 0.4 }, { "e": "neutral", "s": 0.1 } ] }, "attitude": { "top3": [ { "a": "anxious", "s": 0.7 }, { "a": "concerned", "s": 0.6 }, { "a": "submissive", "s": 0.4 } ] }, "voice": { "pitch": "mid", "energy": 0.4, "speed": 0.5, "tension": 0.6, "harsh": 0.1, "breathy": 0.2 }, "quality": { "noise": 0.1, "distortion": 0.0, "overlap": 0.0, "stutter": 0.0 }, "notes": "The speaker's standard Japanese pronunciation lacks specific regional markers, indicating a general dialect. The vocal tone carries heavy regret and somberness, with a slight drop in energy at the end of the phrase suggesting resignation or guilt. There are no signs of deception in the pitch stability, but the emotional weight suggests genuine distress over the stated action. The delivery is direct and lacks hesitation, implying a sincere admission rather than a constructed lie." }

Audio - Korean Female, Happy 😊 😄 🙂 😃 🥳

Output: { "speaker": { "gender": "female", "age": "young" }, "demographics": { "language": "ko", "language_prob": 1.0, "region": "other", "region_reason": "Người nói sử dụng ngữ điệu chuẩn mực của tiếng Hàn (Seoul dialect), nhưng hệ thống chỉ hỗ trợ phân loại vùng miền chi tiết cho tiếng Việt, Anh, Trung và Pháp.", "region_prob": 0.95, "education": "tertiary", "education_reason": "Người nói sử dụng cấu trúc ngữ pháp chuẩn mực (đuôi câu kính ngữ '이에요'), phát âm rõ ràng từng âm tiết, ngữ điệu tự nhiên và trôi chảy, phản ánh trình độ học vấn tốt.", "education_prob": 0.85, "social_class": "middle", "social_class_reason": "Chất lượng âm thanh sạch, không có tiếng ồn nền, giọng nói nhẹ nhàng, lịch sự và có kiểm soát âm lượng tốt, phù hợp với tầng lớp trung lưu thành thị.", "social_class_prob": 0.8 }, "qc": { "teller_score": 0.8, "customer_risk": 0.05 }, "emotion": { "valence": 0.6, "arousal": 0.5, "top3": [ { "e": "happy", "s": 0.7 }, { "e": "calm", "s": 0.2 }, { "e": "joy", "s": 0.1 } ] }, "attitude": { "top3": [ { "a": "friendly", "s": 0.8 }, { "a": "cooperative", "s": 0.15 }, { "a": "neutral", "s": 0.05 } ] }, "voice": { "pitch": "high", "energy": 0.6, "speed": 0.5, "tension": 0.1, "harsh": 0.0, "breathy": 0.2 }, "quality": { "noise": 0.0, "distortion": 0.0, "overlap": 0.0, "stutter": 0.0 }, "notes": "Người nói sử dụng tiếng Hàn chuẩn giọng Seoul, xác nhận đây không phải là một trong các vùng miền ưu tiên phân tích (Việt/Anh/Trung/Pháp). Giọng nữ trẻ trung thể hiện sự hào hứng nhẹ nhàng và mong đợi, phù hợp với nội dung nói về lần đầu đi du lịch nước ngoài. Không có dấu hiệu do dự hay lo lắng về giá cả; năng lượng tích cực và thái độ cởi mở cho thấy đây là khách hàng tiềm năng cao, đang trong tâm thế sẵn sàng đón nhận thông tin tư vấn." }

Audio - Japanese Female, Angry & Hostile 👿 🔥 ⚔️ 💀

Output: { "speaker": { "gender": "male", "age": "young" }, "demographics": { "language": "ja", "language_prob": 0.99, "region": "other", "region_reason": "Người nói sử dụng tiếng Nhật chuẩn (Hyojungo), phát âm rõ ràng, sắc nét theo phong cách lồng tiếng anime kịch tính.", "region_prob": 0.95, "education": "secondary", "education_reason": "Cấu trúc câu hoàn chỉnh và ngữ pháp chuẩn mực, tuy nhiên từ vựng mang tính đe dọa trực tiếp và bạo lực, không thể hiện sự tinh tế học thuật cao.", "education_prob": 0.85, "social_class": "middle", "social_class_reason": "Chất lượng âm thanh rất rõ ràng, không có tạp âm môi trường, giọng nói có lực và uy quyền, gợi ý môi trường được kiểm soát tốt hoặc diễn xuất chuyên nghiệp.", "social_class_prob": 0.8 }, "qc": { "teller_score": 0.0, "customer_risk": 0.95 }, "emotion": { "valence": -0.8, "arousal": 0.95, "top3": [ { "e": "anger", "s": 0.95 }, { "e": "other", "s": 0.7 }, { "e": "disgust", "s": 0.3 } ] }, "attitude": { "top3": [ { "a": "hostile", "s": 0.98 }, { "a": "confrontational", "s": 0.95 }, { "a": "authoritative", "s": 0.6 } ] }, "voice": { "pitch": "mid", "energy": 0.98, "speed": 0.75, "tension": 0.95, "harsh": 0.85, "breathy": 0.1 }, "quality": { "noise": 0.05, "distortion": 0.1, "overlap": 0.0, "stutter": 0.0 }, "notes": "Người nói thể hiện ngữ điệu tiếng Nhật chuẩn với cường độ cảm xúc cực cao, đặc trưng của phong cách diễn xuất kịch tính. Giọng nói chứa đầy sự phẫn nộ bùng nổ (anger) và thái độ thù địch (hostile) rõ rệt qua việc nhấn mạnh từng từ ngữ. Năng lượng giọng nói rất cao, kèm theo sự căng thẳng (tension) lớn trong dây thanh quản, cho thấy một trạng thái kích động mạnh và quyết tâm thực hiện hành vi bạo lực. Đây là dấu hiệu rủi ro hành vi rất cao, cho thấy sự mất kiểm soát cảm xúc hoặc một ý định tấn công rõ ràng." }

Audio - British English Male, Secondary Education, Working Class 😌 🧘 😊 🙂

Output: { "speaker": { "gender": "male", "age": "middle" }, "demographics": { "language": "en", "language_prob": 0.99, "region": "en_uk", "region_reason": "Người nói sử dụng âm 'glottal stop' thay cho âm /t/ trong từ 'dunnit' và 'standard', ngữ điệu không đều đặc trưng của tiếng Anh-Anh, không có âm R cuối từ (non-rhotic).", "region_prob": 0.95, "education": "secondary", "education_reason": "Từ vựng ở mức giao tiếp thông thường ('dunnit', 'bit more'), cấu trúc câu lặp lại ('know the standard'), thiếu tính học thuật hay trang trọng.", "education_prob": 0.85, "social_class": "working", "social_class_reason": "Giọng nói tự nhiên, trực tiếp, hơi xuề xòa, không trau chuốt theo kiểu 'upper class', môi trường âm thanh có tiếng vang nhẹ nhưng không ồn ào.", "social_class_prob": 0.8 }, "qc": { "teller_score": 0.0, "customer_risk": 0.1 }, "emotion": { "valence": 0.1, "arousal": 0.3, "top3": [ { "e": "neutral", "s": 0.6 }, { "e": "calm", "s": 0.3 }, { "e": "amusement", "s": 0.1 } ] }, "attitude": { "top3": [ { "a": "neutral", "s": 0.5 }, { "a": "friendly", "s": 0.3 }, { "a": "cooperative", "s": 0.2 } ] }, "voice": { "pitch": "mid", "energy": 0.4, "speed": 0.5, "tension": 0.2, "harsh": 0.3, "breathy": 0.1 }, "quality": { "noise": 0.2, "distortion": 0.0, "overlap": 0.0, "stutter": 0.2 }, "notes": "Người nói có ngữ âm Anh-Anh rõ rệt, đặc biệt là cách nuốt âm và ngữ điệu tự nhiên, xác nhận nguồn gốc Vương quốc Anh. Giọng nói mang tính chất suy ngẫm, hồi tưởng ('Three years ago I think...'), có sự ngập ngừng nhẹ ('no, just a bit more') nhưng là do đang nhớ lại chứ không phải dấu hiệu lo âu hay che giấu rủi ro. Tốc độ nói vừa phải, thái độ thoải mái, không có dấu hiệu căng thẳng hay phòng thủ. Đây là một đoạn hội thoại mang tính kể chuyện bình thường, mức độ rủi ro khách hàng rất thấp." }

⚖️ ETHICAL CHARTER & DISCLAIMER

HIẾN CHƯƠNG ĐẠO ĐỨC & MIỄN TRỪ TRÁCH NHIỆM

⚠️ CRITICAL NOTICE: BrighTO-SSAP is a sensitive AI system capable of inferring complex personal attributes from voice. Access to and use of this model is strictly conditional upon adherence to the guidelines below.

⚠️ THÔNG BÁO QUAN TRỌNG: BrighTO-SSAP là hệ thống AI nhạy cảm có khả năng suy luận các thuộc tính cá nhân phức tạp từ giọng nói. Việc truy cập và sử dụng model này phụ thuộc nghiêm ngặt vào việc tuân thủ các hướng dẫn dưới đây.

🚫 PROHIBITED USE CASES | CÁC TRƯỜNG HỢP SỬ DỤNG BỊ CẤM

This model MUST NOT be used for the following purposes. Violation may result in license revocation.

Model này TUYỆT ĐỐI KHÔNG được sử dụng cho các mục đích sau. Vi phạm có thể dẫn đến thu hồi giấy phép.

❌ Discrimination | Phân biệt Đối xử

Making automated decisions regarding hiring, firing, lending, housing, insurance, or essential services based solely on predicted social class, education, accent, or demographics.

Ra quyết định tự động về tuyển dụng, sa thải, cho vay, nhà ở, bảo hiểm hoặc dịch vụ thiết yếu chỉ dựa trên dự đoán về tầng lớp xã hội, học vấn, giọng nói hoặc nhân khẩu học.

❌ Surveillance | Giám sát

Mass surveillance, unauthorized wiretapping, or analyzing individuals in public/private spaces without explicit, informed consent.

Giám sát diện rộng, nghe lén trái phép, hoặc phân tích cá nhân tại không gian công cộng/riêng tư mà không có sự đồng ý rõ ràng.

❌ Law Enforcement | Thực thi Pháp luật

Criminal profiling, predictive policing, voice lie detection (polygraphy), or use as forensic evidence in legal proceedings.

Lập hồ sơ tội phạm, dự báo tội phạm, phát hiện nói dối qua giọng nói, hoặc sử dụng làm bằng chứng pháp y trong các thủ tục tố tụng.

❌ Deception | Lừa dối

Manipulating individuals based on inferred emotional states (e.g., predatory marketing to vulnerable/anxious individuals).

Thao túng cá nhân dựa trên trạng thái cảm xúc được suy luận (ví dụ: tiếp thị săn mồi đối với người dễ bị tổn thương/lo âu).

❌ Protected Groups | Nhóm được Bảo vệ

Profiling minors (children) or individuals with speech impediments/pathologies without parental consent or valid legal basis.

Lập hồ sơ trẻ vị thành niên hoặc người có khiếm khuyết/bệnh lý về giọng nói khi chưa có sự đồng ý của phụ huynh hoặc cơ sở pháp lý hợp lệ.

❌ Medical Diagnosis | Chẩn đoán Y tế

Providing medical or psychiatric diagnoses based on voice biomarkers without oversight from a licensed healthcare professional.

Đưa ra chẩn đoán y tế hoặc tâm thần dựa trên dấu hiệu sinh trắc học giọng nói mà không có sự giám sát của chuyên gia y tế được cấp phép.

⚠️ SYSTEM LIMITATIONS | GIỚI HẠN HỆ THỐNG

Users must acknowledge the following technical limitations before deployment:

Người dùng phải thừa nhận các giới hạn kỹ thuật sau trước khi triển khai:

1. Probabilistic Nature | Bản chất Xác suất

Outputs are statistical predictions (estimates), NOT absolute facts. A high confidence score does not guarantee truth.

Đầu ra là các dự đoán thống kê (ước tính), KHÔNG PHẢI sự thật tuyệt đối. Điểm tin cậy cao không đảm bảo tính chính xác.

2. Inherited Bias | Thiên kiến Kế thừa

The model may reflect socio-economic biases present in the training data. Predictions regarding social class or education are based on acoustic correlations, not actual verification.

Model có thể phản ánh các thiên kiến kinh tế-xã hội có trong dữ liệu huấn luyện. Dự đoán về tầng lớp xã hội hoặc học vấn dựa trên tương quan âm học, không phải xác minh thực tế.

3. Contextual Dependency | Phụ thuộc Ngữ cảnh

Short audio samples (<5s) or poor recording conditions (noise, distortion) significantly reduce accuracy. Emotional states are transient and situational.

Mẫu âm thanh ngắn (<5s) hoặc điều kiện ghi âm kém (ồn, méo tiếng) làm giảm đáng kể độ chính xác. Trạng thái cảm xúc là nhất thời và phụ thuộc tình huống.

4. Human Oversight Required | Yêu cầu Giám sát Con người

MANDATORY: All high-stakes decisions (affecting rights, finance, safety) must involve human review ("Human-in-the-Loop").

BẮT BUỘC: Tất cả quyết định quan trọng (ảnh hưởng đến quyền lợi, tài chính, an toàn) phải có sự xem xét của con người.

📜 COMPLIANCE & SAFEGUARDS | TUÂN THỦ & BIỆN PHÁP BẢO VỆ

Users bear sole responsibility for compliance with applicable laws, including but not limited to GDPR (EU), CCPA (USA), and Vietnam AI Law 134/2025/QH15 regarding biometric data and privacy.

Người dùng chịu hoàn toàn trách nhiệm tuân thủ các luật hiện hành, bao gồm nhưng không giới hạn ở GDPR (EU), CCPA (USA), và Luật Trí Tuệ Nhân tạo Việt Nam 134/2025/QH15 liên quan đến dữ liệu sinh trắc học và quyền riêng tư.

Required Operational Safeguards | Các Biện pháp Vận hành Bắt buộc:

✅ Transparency: Explicitly disclose to end-users that their voice is being analyzed by AI. — Minh bạch: Thông báo rõ ràng cho người dùng cuối rằng giọng nói của họ đang được AI phân tích.
✅ Consent: Obtain valid, informed consent prior to analysis. — Đồng thuận: Phải có sự đồng ý hợp lệ trước khi phân tích.
✅ Right to Explanation: Provide mechanisms for individuals to contest or query AI-generated decisions. — Quyền được giải thích: Cung cấp cơ chế để cá nhân khiếu nại hoặc thắc mắc về các quyết định do AI tạo ra.
✅ Data Minimization: Do not store raw audio or sensitive profiles longer than necessary. — Tối thiểu hóa dữ liệu: Không lưu trữ âm thanh thô hoặc hồ sơ nhạy cảm lâu hơn mức cần thiết.

⚖️ LIMITATION OF LIABILITY | GIỚI HẠN TRÁCH NHIỆM PHÁP LÝ

BRIGHTO TECHNOLOGY HEREBY DISCLAIMS ALL LIABILITY FOR:

ANY MISUSE OF THE MODEL FOR PROHIBITED PURPOSES.
ANY DISCRIMINATORY OUTCOMES, PRIVACY VIOLATIONS, OR REPUTATIONAL DAMAGE ARISING FROM THE DEPLOYMENT OF THIS SYSTEM.
ANY RELIANCE ON THE MODEL'S PREDICTIONS FOR MEDICAL, LEGAL, OR FINANCIAL DECISIONS.

USERS ASSUME FULL LEGAL AND ETHICAL RESPONSIBILITY FOR THEIR USE OF BRIGHTO-SSAP.

BRIGHTO TECHNOLOGY TUYÊN BỐ MIỄN TRỪ MỌI TRÁCH NHIỆM ĐỐI VỚI:

BẤT KỲ VIỆC SỬ DỤNG SAI MỤC ĐÍCH NÀO VÀO CÁC TRƯỜNG HỢP BỊ CẤM.
BẤT KỲ KẾT QUẢ PHÂN BIỆT ĐỐI XỬ, VI PHẠM QUYỀN RIÊNG TƯ, HOẶC THIỆT HẠI DANH TIẾNG NÀO PHÁT SINH TỪ VIỆC TRIỂN KHAI HỆ THỐNG NÀY.
BẤT KỲ SỰ TIN TƯỞNG NÀO VÀO DỰ ĐOÁN CỦA MODEL ĐỂ RA QUYẾT ĐỊNH Y TẾ, PHÁP LÝ, HOẶC TÀI CHÍNH.

NGƯỜI DÙNG CHỊU TRÁCH NHIỆM PHÁP LÝ VÀ ĐẠO ĐỨC HOÀN TOÀN CHO VIỆC SỬ DỤNG BRIGHTO-SSAP.

By using this model, you acknowledge that you have read, understood, and agree to comply with this Ethical Charter.

Bằng việc sử dụng model này, bạn xác nhận rằng bạn đã đọc, hiểu và đồng ý tuân thủ Hiến chương Đạo đức này.

📜 License | Giấy phép

Commercial / Proprietary

All usage requires written approval from BrighTO Technology.

Mọi việc sử dụng cần có chấp thuận bằng văn bản từ BrighTO Technology.

📞 Contact | Liên hệ


Commercial	`nguyen@brighto.ai`, `nghia@brighto.ai`
API/Distribution	`duc@sphinxjsc.com` (SphinX JSC)
Technical	`nguyen@hatto.ai`

📚 Citation | Trích dẫn

@misc{brighto-ssap-2026,
  title={BrighTO-SSAP: Audio-Native Semantics & Social Profiler},
  author={BrighTO Technology},
  year={2026},
  url={https://huggingface.co/thusinh1969/BrighTO-Semantic-Social-Audio-Profiler-V1.5.SE}
}

🏆 World's First Audio-Native Social Profiling

~70ms Quick • vLLM Accelerated • On-Premise Ready • Explainable AI

Downloads last month: -

Evaluation results

Gender F1 on BrighTO Internal (11 languages)
self-reported

92.140
Language F1 on BrighTO Internal (11 languages)
self-reported

94.300
Emotion F1 on BrighTO Internal (11 languages)
self-reported

70.950
Social Class F1 on BrighTO Internal (11 languages)
self-reported

87.820
Education F1 on BrighTO Internal (11 languages)
self-reported

79.340