---
license: mit
---

# LF-PhoBERT: Leakage-Free Robust Vietnamese Emotion Classification

LF-PhoBERT is a robust Vietnamese emotion classification model fine-tuned from **PhoBERT-base** using a leakage-free and reproducible training recipe.  
The model is designed for noisy social media text and imbalanced emotion distributions, with a focus on stability and deployment-oriented evaluation.

## 📄 Paper
**A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification**

**Authors:**  
Trung Quang Nguyen⁴, Duc Dat Pham¹˒³, Ngoc Tram Huynh Thi²˒³,  
Nguyen Thi Bich Ngoc²˒³, and Tan Duy Le*²˒³  

¹ University of Science, Ho Chi Minh City, Vietnam  
² International University, VNU-HCM, Vietnam  
³ Vietnam National University, Ho Chi Minh City, Vietnam  
⁴ Ho Chi Minh City University of Economics and Finance, Vietnam  

## 🧠 Model Description
- **Backbone:** PhoBERT-base (`vinai/phobert-base`)
- **Task:** Single-label, multi-class emotion classification
- **Language:** Vietnamese
- **Domain:** Social media text
- **Number of classes:** 7  
  *(Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise)*

LF-PhoBERT is trained using a unified objective that combines:
- Class-Balanced Focal Loss
- R-Drop consistency regularization
- Supervised Contrastive Learning
- FGM-based adversarial training

All class statistics are computed **only on the training split** to prevent information leakage.

## 📊 Performance (SentiV)
Evaluated on the SentiV dataset using a stratified 80/20 split, averaged over 3 random seeds.

- **Macro-F1:** 0.8040 ± 0.0003  
- **Accuracy:** 0.8144 ± 0.0004  

The model outperforms standard PhoBERT fine-tuning with cross-entropy loss and demonstrates stable behavior across random seeds.

## 📦 Files in This Repository
- `model.safetensors` – fine-tuned model weights
- `config.json` – model configuration
- `tokenizer_config.json`, `vocab.txt`, `bpe.codes` – tokenizer files
- `id2label.json` – label mapping
- `special_tokens_map.json`, `added_tokens.json` – tokenizer metadata

## 🚀 Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "ducdatit2002/LF-PhoBERT"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "Chiến dịch này làm tôi rất thất vọng 😡"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)

predicted_label_id = outputs.logits.argmax(dim=-1).item()
label = model.config.id2label[str(predicted_label_id)]
print(label)
````

## 🔁 Reproducibility

* Training performed on a single NVIDIA A100 (80GB)
* PyTorch 2.9.1, CUDA 12.8
* Results reported as mean ± std over 3 random seeds
* Identical preprocessing and optimization settings across runs

This checkpoint is released to support reproducibility and practical deployment.

## ⚠️ Limitations

* Single-label classification cannot fully capture mixed or ambiguous emotions
* Sarcasm and context-dependent expressions remain challenging
* Performance is evaluated on SentiV; cross-domain generalization is not guaranteed

## 📚 Citation

If you use this model, please cite:

```
Nguyen, T.Q., Pham, D.D., Huynh Thi, N.T., Nguyen, N.T.B., & Le, T.D. (2026).
A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification.
```

## 📜 License

This model is released for research and educational purposes.
Please refer to the PhoBERT license and the SentiV dataset terms for downstream usage.