--- license: mit --- # LF-PhoBERT: Leakage-Free Robust Vietnamese Emotion Classification LF-PhoBERT is a robust Vietnamese emotion classification model fine-tuned from **PhoBERT-base** using a leakage-free and reproducible training recipe. The model is designed for noisy social media text and imbalanced emotion distributions, with a focus on stability and deployment-oriented evaluation. ## 📄 Paper **A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification** **Authors:** Trung Quang Nguyen⁴, Duc Dat Pham¹˒³, Ngoc Tram Huynh Thi²˒³, Nguyen Thi Bich Ngoc²˒³, and Tan Duy Le*²˒³ ¹ University of Science, Ho Chi Minh City, Vietnam ² International University, VNU-HCM, Vietnam ³ Vietnam National University, Ho Chi Minh City, Vietnam ⁴ Ho Chi Minh City University of Economics and Finance, Vietnam ## 🧠 Model Description - **Backbone:** PhoBERT-base (`vinai/phobert-base`) - **Task:** Single-label, multi-class emotion classification - **Language:** Vietnamese - **Domain:** Social media text - **Number of classes:** 7 *(Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise)* LF-PhoBERT is trained using a unified objective that combines: - Class-Balanced Focal Loss - R-Drop consistency regularization - Supervised Contrastive Learning - FGM-based adversarial training All class statistics are computed **only on the training split** to prevent information leakage. ## 📊 Performance (SentiV) Evaluated on the SentiV dataset using a stratified 80/20 split, averaged over 3 random seeds. - **Macro-F1:** 0.8040 ± 0.0003 - **Accuracy:** 0.8144 ± 0.0004 The model outperforms standard PhoBERT fine-tuning with cross-entropy loss and demonstrates stable behavior across random seeds. ## 📦 Files in This Repository - `model.safetensors` – fine-tuned model weights - `config.json` – model configuration - `tokenizer_config.json`, `vocab.txt`, `bpe.codes` – tokenizer files - `id2label.json` – label mapping - `special_tokens_map.json`, `added_tokens.json` – tokenizer metadata ## 🚀 Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name = "ducdatit2002/LF-PhoBERT" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) text = "Chiến dịch này làm tôi rất thất vọng 😡" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) with torch.no_grad(): outputs = model(**inputs) predicted_label_id = outputs.logits.argmax(dim=-1).item() label = model.config.id2label[str(predicted_label_id)] print(label) ```` ## 🔁 Reproducibility * Training performed on a single NVIDIA A100 (80GB) * PyTorch 2.9.1, CUDA 12.8 * Results reported as mean ± std over 3 random seeds * Identical preprocessing and optimization settings across runs This checkpoint is released to support reproducibility and practical deployment. ## ⚠️ Limitations * Single-label classification cannot fully capture mixed or ambiguous emotions * Sarcasm and context-dependent expressions remain challenging * Performance is evaluated on SentiV; cross-domain generalization is not guaranteed ## 📚 Citation If you use this model, please cite: ``` Nguyen, T.Q., Pham, D.D., Huynh Thi, N.T., Nguyen, N.T.B., & Le, T.D. (2026). A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification. ``` ## 📜 License This model is released for research and educational purposes. Please refer to the PhoBERT license and the SentiV dataset terms for downstream usage.