---
language: vi
library_name: transformers
license: other
pipeline_tag: text-classification
tags:
  - speech-emotion-recognition
  - audio-classification
  - text-classification
  - multimodal
---

# Vietnamese Emotion Models (Text, Voice, Multimodal)

Three Vietnamese emotion recognition models (text, voice, multimodal) packaged for Hugging Face with configs/labels/metrics and inference snippets. Only the best checkpoint is kept for each branch.

## Structure
- `text-phobert-focalloss/`: PhoBERT + focal loss for text emotion classification.
- `voice-wav2vec2-vi-emotion/`: Wav2Vec2-base-vi-250h fine-tuned for Vietnamese SER.
- `multimodal/`: Fusion weights for audio + text (`best.pt`) with `labels.json`.

## Setup
```bash
pip install transformers torch torchaudio soundfile
```
Voice and multimodal require 16 kHz audio; resample if your files differ.


## Text model (PhoBERT focal loss)
```python
from pathlib import Path
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

base = Path(__file__).resolve().parent  # .../hf-release
repo = base / "text-phobert-focalloss"
tok = AutoTokenizer.from_pretrained(repo, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(repo)

inputs = tok("Tôi đang rất vui và hào hứng", return_tensors="pt")
with torch.no_grad():
    probs = model(**inputs).logits.softmax(-1)[0]
pred = model.config.id2label[str(int(probs.argmax()))]
print(pred, float(probs.max()))
```

## Voice model (Wav2Vec2 SER)
```python
from pathlib import Path
import torch, torchaudio
from transformers import Wav2Vec2ForSequenceClassification, AutoProcessor

base = Path(__file__).resolve().parent
repo = base / "voice-wav2vec2-vi-emotion"
processor = AutoProcessor.from_pretrained(repo)
model = Wav2Vec2ForSequenceClassification.from_pretrained(repo)

wav, sr = torchaudio.load("audio.wav")
if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)
inputs = processor(wav.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    probs = model(**inputs).logits.softmax(-1)[0]
pred = model.config.id2label[int(probs.argmax())]
print(pred, float(probs.max()))
```

## Multimodal model (audio + transcript)

```python
from pathlib import Path
import sys, torch, torchaudio
from transformers import AutoTokenizer, Wav2Vec2FeatureExtractor
from multimodal.multimodal_train_eval import FusionXMerlin

base = Path(__file__).resolve().parent          # .../hf-release
sys.path.append(str(base.parent))               # add repo root

text_repo = base / "text-phobert-focalloss"
audio_repo = base / "voice-wav2vec2-vi-emotion"
ckpt_path = base / "multimodal" / "best.pt"

ckpt = torch.load(ckpt_path, map_location="cpu")
label2id = ckpt["label2id"]
id2label = {v: k for k, v in label2id.items()}

tokenizer = AutoTokenizer.from_pretrained(text_repo, use_fast=False)
processor = Wav2Vec2FeatureExtractor.from_pretrained(audio_repo)

model = FusionXMerlin(
    text_model_path=text_repo,
    audio_model_path=audio_repo,
    num_classes=len(label2id),
    freeze_encoders=True,
).eval()
model.load_state_dict(ckpt["model_state"])

transcript = "Tôi rất thất vọng về dịch vụ."
wav, sr = torchaudio.load("audio.wav")
if sr != 16000:
    wav = torchaudio.functional.resample(wav, sr, 16000)

t_inputs = tokenizer(transcript, return_tensors="pt", padding=True, truncation=True, max_length=256)
a_inputs = processor(wav.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    logits, _ = model(
        t_inputs["input_ids"],
        t_inputs["attention_mask"],
        a_inputs["input_values"],
        a_inputs["attention_mask"],
    )
    probs = torch.softmax(logits, dim=-1)[0]
pred = id2label[int(probs.argmax())]
print(pred, float(probs.max()))
```

## Extra info
- Label set: Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise (mappings inside each config/labels.json).