feat: update model

Browse files

Files changed (12) hide show

.gitattributes +1 -0
README.md +129 -105
UPLOAD_INSTRUCTIONS.md +195 -0
classification_report.txt +11 -11
config.json +3 -48
confusion_matrix.png +2 -2
model.safetensors +1 -1
test_results.json +3 -22
training_curves.png +2 -2
training_scripts/run_training_auto.sh +115 -0
training_scripts/run_training_manual.sh +124 -0
training_scripts/train_nfqa_model.py +870 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 *.png filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 *.png filter=lfs diff=lfs merge=lfs -text
+*.json filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -14,11 +14,23 @@ This model classifies questions across **49 languages** into **8 categories** of
 - **Categories**: 8 NFQA question types
 - **Parameters**: ~278M parameters
 - **Training Date**: January 2026
-- **License**: [Specify your license]
 ### Developers
-Developed by [Your Name/Organization] for research in multilingual question understanding and classification.
 ## Intended Use
@@ -38,52 +50,27 @@ Developed by [Your Name/Organization] for research in multilingual question unde
 ## Training Data
-### Dataset Composition
-The model was trained on a carefully curated and balanced multilingual dataset:
-- **Total Examples**: 62,932 question-label pairs
-- **Source Data**:
-  - ~49,000 examples from the WebFAQ dataset (LLM-annotated with ensemble voting)
-  - ~14,000 examples generated and validated using LLMs to balance categories and languages
-- **Data Split**:
-  - Training: 44,051 examples (70%)
-  - Validation: 6,294 examples (10%)
-  - Test: 12,587 examples (20%)
-### Data Annotation & Balancing Process
-The dataset was created through a rigorous multi-step process combining LLM annotation and validation:
-**Phase 1: LLM Ensemble Annotation**
-- The original ~49,000 WebFAQ question-answer pairs were annotated using an ensemble of three language models:
-  - **LLaMA 3.1**
-  - **Gemma 2**
-  - **Qwen 2.5**
-**Phase 2: Quality Filtering**
-- Only high-quality annotations were retained using ensemble voting with a minimum confidence threshold of **0.6**:
-  - **Confidence 1.0**: All 3 models agree on the same label (unanimous)
-  - **Confidence 0.67**: At least 2 out of 3 models agree (majority vote)
-  - Annotations below 0.6 confidence were excluded to ensure label reliability
-**Phase 3: Gap Analysis**
-- After filtering, gaps were identified across language-category combinations
-- Target: 125 questions per category per language (1,000 per language total)
-**Phase 4: Synthetic Data Generation**
-- Missing question-answer pairs were generated using **LLaMA 3.1** to fill identified gaps
-- Generation followed category-specific templates and linguistic patterns
-**Phase 5: Validation**
-- All generated pairs underwent the same ensemble validation process (LLaMA 3.1, Gemma 2, Qwen 2.5)
-- Applied the same 0.6 confidence threshold to ensure quality consistency
-**Phase 6: Final Dataset**
-- Combined high-quality annotated and validated data achieving balanced representation:
-  - Each language: ~1,000 questions
-  - Each category per language: ~125 questions
-  - Diverse coverage across all 49 languages and 8 categories
 ### Languages Supported
@@ -113,7 +100,23 @@ Questions seeking factual, objective answers (who, what, when, where).
 - "When was the Eiffel Tower built?"
 - "Who invented the telephone?"
-### 3. INSTRUCTION (Label 2)
 How-to questions requiring step-by-step procedural answers.
 **Examples:**
@@ -121,7 +124,7 @@ How-to questions requiring step-by-step procedural answers.
 - "How to bake chocolate chip cookies?"
 - "How can I install Python on Windows?"
-### 4. REASON (Label 3)
 Why/how questions seeking explanations or reasoning.
 **Examples:**
@@ -129,22 +132,6 @@ Why/how questions seeking explanations or reasoning.
 - "How does photosynthesis work?"
 - "Why do birds migrate?"
-### 5. EVIDENCE-BASED (Label 4)
-Questions about definitions, features, or characteristics.
-**Examples:**
-- "What are the symptoms of flu?"
-- "What features does this phone have?"
-- "What is machine learning?"
-### 6. COMPARISON (Label 5)
-Questions comparing two or more options.
-**Examples:**
-- "iPhone vs Android: which is better?"
-- "What's the difference between RNA and DNA?"
-- "Compare electric and gas cars"
 ### 7. EXPERIENCE (Label 6)
 Questions seeking personal experiences, recommendations, or advice.
@@ -153,48 +140,54 @@ Questions seeking personal experiences, recommendations, or advice.
 - "Has anyone tried this restaurant?"
 - "Which hotel would you recommend?"
-### 8. DEBATE (Label 7)
-Hypothetical, opinion-based, or debatable questions.
 **Examples:**
-- "Is artificial intelligence dangerous?"
-- "Should we colonize Mars?"
-- "Is remote work better than office work?"
 ## Model Performance
-### Test Set Results (12,587 examples)
-- **Overall Accuracy**: 88.6%
-- **Macro-Average F1**: 86.7%
-- **Best Validation F1**: 86.8% (achieved at epoch 27)
 ### Per-Category Performance
 | Category | Precision | Recall | F1-Score | Support |
 |----------|-----------|--------|----------|---------|
-| NOT-A-QUESTION | 0.92 | 0.91 | 0.92 | 957 |
-| FACTOID | 0.92 | 0.92 | 0.92 | 5,679 |
-| INSTRUCTION | 0.89 | 0.87 | 0.88 | 295 |
-| REASON | 0.77 | 0.82 | 0.80 | 664 |
-| EVIDENCE-BASED | 0.85 | 0.92 | 0.88 | 1,466 |
-| COMPARISON | 0.84 | 0.83 | 0.83 | 885 |
-| EXPERIENCE | 0.82 | 0.76 | 0.79 | 1,556 |
-| DEBATE | 0.92 | 0.92 | 0.92 | 1,085 |
 ### Key Observations
-- **Strongest Performance**: FACTOID, DEBATE, and NOT-A-QUESTION categories (F1 ≥ 0.92)
-- **Good Performance**: INSTRUCTION, EVIDENCE-BASED, and COMPARISON categories (F1 ≥ 0.83)
-- **Moderate Performance**: REASON and EXPERIENCE categories (F1 ~ 0.79-0.80)
-- The model generalizes well across all 49 languages despite language imbalance in real-world data
 ## Training Procedure
 ### Hardware
 - Training Device: CUDA-enabled GPU (NVIDIA)
-- Training Time: ~27 epochs to reach best performance
 ### Hyperparameters
@@ -204,26 +197,47 @@ Hypothetical, opinion-based, or debatable questions.
   "max_length": 128,              # Maximum sequence length
   "batch_size": 16,                # Training batch size
   "learning_rate": 2e-5,           # AdamW learning rate
-  "num_epochs": 30,                # Total epochs trained
   "warmup_steps": 500,             # Linear warmup steps
   "weight_decay": 0.01,            # L2 regularization
   "optimizer": "AdamW",            # Optimizer
   "scheduler": "linear_warmup",    # Learning rate scheduler
   "gradient_clipping": 1.0,        # Max gradient norm
-  "test_size": 0.2,                # 20% test split
-  "val_size": 0.1,                 # 10% validation split
   "random_seed": 42                # Reproducibility
 }
 ```
 ### Training Process
-1. **Data Preparation**: Balanced dataset across 49 languages and 8 categories
 2. **Preprocessing**: Tokenization using XLM-RoBERTa tokenizer (max length: 128 tokens)
 3. **Training Strategy**: Supervised fine-tuning with stratified train/val/test splits
 4. **Optimization**: AdamW optimizer with linear warmup and gradient clipping
-5. **Best Model Selection**: Model checkpoint with highest validation F1 score (epoch 27)
-6. **Evaluation**: Comprehensive testing on held-out test set
 ## Usage
@@ -373,11 +387,11 @@ for r in results:
 ### Potential Biases
-- **Annotation Bias**: Labels are based on LLM ensemble predictions (LLaMA 3.1, Gemma 2, Qwen 2.5) rather than human annotations, which may introduce systematic biases from these underlying models
-- **Training Data Bias**: The model inherits biases from the LLM-annotated WebFAQ dataset and LLM-generated examples
-- **Language Representation**: European languages are better represented than other language families in the original WebFAQ data
-- **Category Distribution**: FACTOID questions are more prevalent in training data, which may affect classification thresholds
-- **LLM Consensus Bias**: The 0.6 confidence threshold favors categories where LLMs show higher agreement, potentially underrepresenting ambiguous or nuanced question types
 ### Recommendations for Use
@@ -400,7 +414,7 @@ If you use this model in your research, please cite:
 ```bibtex
 @misc{nfqa-multilingual-2026,
-  author = {[Your Name]},
   title = {NFQA Multilingual Question Classifier},
   year = {2026},
   publisher = {HuggingFace},
@@ -409,12 +423,23 @@ If you use this model in your research, please cite:
 }
 ```
 ## Related Resources
-- **WebFAQ Dataset**:
-- **XLM-RoBERTa**: https://huggingface.co/xlm-roberta-base
-- **Paper**: [Link to your paper if published]
-- **GitHub Repository**: [Link to your code repository]
 ## Model Card Contact
@@ -425,14 +450,13 @@ For questions, feedback, or issues:
 ## Acknowledgments
-- Training data sourced from the WebFAQ dataset
-- LLM annotation ensemble: LLaMA 3.1, Gemma 2, and Qwen 2.5
-- Balanced data generation using LLaMA 3.1
-- Built on the XLM-RoBERTa foundation model by Facebook AI (now Meta AI)
-- Training infrastructure provided by University of Passau LLM inference server
 ---
 **Model Version**: 1.0
-**Last Updated**: January 2026
 **Status**: Production Ready

 - **Categories**: 8 NFQA question types
 - **Parameters**: ~278M parameters
 - **Training Date**: January 2026
+- **License**: apache-2.0
 ### Developers
+Developed by Ali Salman for research in multilingual question understanding and classification.
+### Architecture
+The model is based on XLM-RoBERTa (Cross-lingual Language Model - Robustly Optimized BERT Approach), a transformer-based multilingual encoder:
+- **Base Architecture**: 12-layer transformer encoder
+- **Hidden Size**: 768
+- **Attention Heads**: 12
+- **Parameters**: ~278M
+- **Vocabulary Size**: 250,000 tokens (SentencePiece)
+- **Pre-training**: Trained on 2.5TB of CommonCrawl data in 100 languages
+- **Fine-tuning**: Classification head with dropout (0.2) for 8-class NFQA classification
 ## Intended Use
 ## Training Data
+### Dataset
+The model was trained on the **[NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)**, a large-scale multilingual dataset for non-factoid question classification.
+**Dataset Composition**:
+- **Training**: 33,602 examples (70%)
+- **Validation**: 6,979 examples (15%)
+- **Test**: 7,696 examples (15%)
+- **Total**: 48,277 balanced examples
+**Source Distribution**:
+- 54% from WebFAQ dataset (annotated with LLM ensemble)
+- 46% AI-generated to balance language-category combinations
+**Key Features**:
+- 392 unique (language, category) combinations
+- Target of ~125 examples per combination
+- Stratified sampling to ensure balanced representation
+- Ensemble annotation using Llama 3.1, Gemma 2, and Qwen 2.5
+For detailed information about dataset generation, annotation methodology, and data composition, please visit the [dataset page](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
 ### Languages Supported
 - "When was the Eiffel Tower built?"
 - "Who invented the telephone?"
+### 3. DEBATE (Label 2)
+Hypothetical, opinion-based, or debatable questions.
+**Examples:**
+- "Is artificial intelligence dangerous?"
+- "Should we colonize Mars?"
+- "Is remote work better than office work?"
+### 4. EVIDENCE-BASED (Label 3)
+Questions about definitions, features, or characteristics.
+**Examples:**
+- "What are the symptoms of flu?"
+- "What features does this phone have?"
+- "What is machine learning?"
+### 5. INSTRUCTION (Label 4)
 How-to questions requiring step-by-step procedural answers.
 **Examples:**
 - "How to bake chocolate chip cookies?"
 - "How can I install Python on Windows?"
+### 6. REASON (Label 5)
 Why/how questions seeking explanations or reasoning.
 **Examples:**
 - "How does photosynthesis work?"
 - "Why do birds migrate?"
 ### 7. EXPERIENCE (Label 6)
 Questions seeking personal experiences, recommendations, or advice.
 - "Has anyone tried this restaurant?"
 - "Which hotel would you recommend?"
+### 8. COMPARISON (Label 7)
+Questions comparing two or more options.
 **Examples:**
+- "iPhone vs Android: which is better?"
+- "What's the difference between RNA and DNA?"
+- "Compare electric and gas cars"
 ## Model Performance
+### Test Set Results (7,696 examples)
+- **Overall Accuracy**: 88.1%
+- **Macro-Average F1**: 88.1%
+- **Best Validation F1**: 88.1% (achieved at epoch 6)
 ### Per-Category Performance
 | Category | Precision | Recall | F1-Score | Support |
 |----------|-----------|--------|----------|---------|
+| NOT-A-QUESTION | 0.96 | 0.92 | 0.94 | 950 |
+| FACTOID | 0.84 | 0.79 | 0.81 | 980 |
+| DEBATE | 0.90 | 0.95 | 0.92 | 916 |
+| EVIDENCE-BASED | 0.86 | 0.92 | 0.89 | 950 |
+| INSTRUCTION | 0.85 | 0.92 | 0.88 | 980 |
+| REASON | 0.88 | 0.86 | 0.87 | 960 |
+| EXPERIENCE | 0.82 | 0.76 | 0.79 | 980 |
+| COMPARISON | 0.93 | 0.93 | 0.93 | 980 |
 ### Key Observations
+- **Strongest Performance**: NOT-A-QUESTION, COMPARISON, and DEBATE categories (F1 ≥ 0.92)
+- **Good Performance**: EVIDENCE-BASED, INSTRUCTION, and REASON categories (F1 ≥ 0.87)
+- **Moderate Performance**: FACTOID and EXPERIENCE categories (F1 ~ 0.79-0.81)
+- The model generalizes well across all 49 languages with balanced test set distribution
+### Confusion Matrix
+![Confusion Matrix](confusion_matrix.png)
+The confusion matrix shows the model's prediction patterns across all 8 categories. The diagonal elements represent correct classifications, while off-diagonal elements show misclassifications between categories.
 ## Training Procedure
 ### Hardware
 - Training Device: CUDA-enabled GPU (NVIDIA)
+- Training Time: 6 epochs to reach best performance
 ### Hyperparameters
   "max_length": 128,              # Maximum sequence length
   "batch_size": 16,                # Training batch size
   "learning_rate": 2e-5,           # AdamW learning rate
+  "num_epochs": 6,                 # Total epochs trained
   "warmup_steps": 500,             # Linear warmup steps
   "weight_decay": 0.01,            # L2 regularization
+  "dropout": 0.2,                  # Dropout probability
   "optimizer": "AdamW",            # Optimizer
   "scheduler": "linear_warmup",    # Learning rate scheduler
   "gradient_clipping": 1.0,        # Max gradient norm
   "random_seed": 42                # Reproducibility
 }
 ```
 ### Training Process
+1. **Data Preparation**: Pre-split balanced dataset from [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
+   - Training: 33,602 examples (70%)
+   - Validation: 6,979 examples (15%)
+   - Test: 7,696 examples (15%)
 2. **Preprocessing**: Tokenization using XLM-RoBERTa tokenizer (max length: 128 tokens)
 3. **Training Strategy**: Supervised fine-tuning with stratified train/val/test splits
+   - Stratified by (language, category) combinations to maintain balance
 4. **Optimization**: AdamW optimizer with linear warmup and gradient clipping
+   - Total training steps: 12,606 (33,602 examples × 6 epochs ÷ 16 batch size)
+   - Warmup steps: 500
+5. **Best Model Selection**: Model checkpoint with highest validation F1 score (epoch 6)
+6. **Evaluation**: Comprehensive testing on held-out test set with per-category and per-language analysis
+### Training Curves
+![Training Curves](training_curves.png)
+The training curves show the model's learning progress across 6 epochs:
+- **Left panel**: Training and validation loss over time
+- **Middle panel**: Training and validation accuracy progression
+- **Right panel**: Validation F1 score (macro average) with best checkpoint marked
+The model converged quickly, reaching optimal performance at epoch 6 with minimal overfitting.
 ## Usage
 ### Potential Biases
+- **Annotation Bias**: Labels are based on LLM ensemble predictions (Llama 3.1, Gemma 2, Qwen 2.5) rather than human annotations, which may introduce systematic biases from these underlying models
+- **Training Data Bias**: The model inherits biases from the WebFAQ dataset and AI-generated examples
+- **Language Representation**: While the dataset includes 49 languages, some language families may have different performance characteristics
+- **Category Distribution**: The balanced dataset has similar representation across categories (~980 examples each in test set), which may differ from real-world distributions
+- **Domain Specificity**: Trained primarily on FAQ-style and general questions; performance may vary on domain-specific questions
 ### Recommendations for Use
 ```bibtex
 @misc{nfqa-multilingual-2026,
+  author = {Ali Salman},
   title = {NFQA Multilingual Question Classifier},
   year = {2026},
   publisher = {HuggingFace},
 }
 ```
+Please also cite the training dataset:
+```bibtex
+@dataset{nfqa_multilingual_dataset_2026,
+  author = {Ali Salman},
+  title = {NFQA Multilingual Dataset: A Large-Scale Dataset for Non-Factoid Question Classification},
+  year = {2026},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset}}
+}
+```
 ## Related Resources
+- **Training Dataset**: [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
+- **WebFAQ Dataset**: [PaDaS-Lab/webfaq](https://huggingface.co/datasets/PaDaS-Lab/webfaq)
+- **XLM-RoBERTa**: [xlm-roberta-base](https://huggingface.co/xlm-roberta-base)
 ## Model Card Contact
 ## Acknowledgments
+- Training dataset: [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
+- Source data: [WebFAQ Dataset](https://huggingface.co/datasets/PaDaS-Lab/webfaq)
+- Built on the [XLM-RoBERTa](https://huggingface.co/xlm-roberta-base) foundation model by Meta AI
+- Annotation and generation using Llama 3.1, Gemma 2, and Qwen 2.5
 ---
 **Model Version**: 1.0
+**Last Updated**: February 2026
 **Status**: Production Ready

UPLOAD_INSTRUCTIONS.md ADDED Viewed

	@@ -0,0 +1,195 @@

+# Instructions to Upload to Hugging Face
+This repository is ready to be pushed to Hugging Face Model Hub!
+## Quick Setup (5 minutes)
+### Step 1: Create Hugging Face Repository
+1. Go to https://huggingface.co/new
+2. Fill in:
+   - **Model name**: `nfqa-multilingual-classifier`
+   - **License**: Apache 2.0 (recommended) or your choice
+   - **Visibility**: Public (or Private if you prefer)
+3. Click **"Create model"**
+4. **Important**: Copy your repository URL from the page
+### Step 2: Get Your Access Token
+1. Go to https://huggingface.co/settings/tokens
+2. Click **"New token"**
+3. Name: `model-upload`
+4. Type: **Write** (important!)
+5. Click **"Generate token"**
+6. **Copy the token** (you won't see it again)
+### Step 3: Connect This Repository
+Replace `YOUR_USERNAME` with your actual Hugging Face username:
+```bash
+cd /Users/alisalman/thesis/nfqa-multilingual-classifier
+# Add Hugging Face as remote
+git remote add origin https://huggingface.co/YOUR_USERNAME/nfqa-multilingual-classifier
+# Configure git to use your HF credentials
+git config credential.helper store
+# Push to Hugging Face (you'll be prompted for username and token)
+git push -u origin master
+```
+When prompted:
+- **Username**: Your Hugging Face username
+- **Password**: Paste your access token (not your password!)
+### Step 4: Verify Upload
+1. Go to `https://huggingface.co/YOUR_USERNAME/nfqa-multilingual-classifier`
+2. You should see:
+   - ✅ All model files (11 files)
+   - ✅ README with full documentation
+   - ✅ Training visualizations (confusion matrix, training curves)
+   - ✅ Model card with usage examples
+3. Test the **Inference API** widget with a question
+---
+## Alternative: Use Hugging Face CLI
+If you prefer using the CLI:
+```bash
+# Install if not already installed
+pip install --upgrade huggingface_hub
+# Login
+huggingface-cli login
+# Paste your token when prompted
+# Create repository
+huggingface-cli repo create nfqa-multilingual-classifier --type model
+# Upload
+cd /Users/alisalman/thesis/nfqa-multilingual-classifier
+huggingface-cli upload nfqa-multilingual-classifier . --repo-type model
+```
+---
+## What's Included
+This repository contains:
+✅ **Model Files** (1.1 GB total):
+- `model.safetensors` - Model weights
+- `config.json` - Model configuration
+- `tokenizer.json` - Tokenizer
+- `tokenizer_config.json` - Tokenizer settings
+- `sentencepiece.bpe.model` - Vocabulary
+- `special_tokens_map.json` - Special tokens
+✅ **Documentation**:
+- `README.md` - Comprehensive model card
+- `classification_report.txt` - Per-category performance
+- `test_results.json` - Detailed evaluation metrics
+✅ **Visualizations**:
+- `confusion_matrix.png` - Test set confusion matrix
+- `training_curves.png` - Training/validation curves
+✅ **Git Configuration**:
+- `.gitattributes` - LFS tracking for large files
+- `.gitignore` - Ignore patterns
+---
+## Before You Push
+### Update README Placeholders
+Edit [README.md](README.md) and replace:
+- `[Your Name/Organization]` → Your actual name
+- `[Specify your license]` → Your license choice
+- `your-username/nfqa-multilingual-classifier` → Your actual repo URL
+- `[Your email]` → Your contact email
+- `[Your repository]` → Your GitHub repo (if any)
+You can edit directly on Hugging Face after uploading, or do it now:
+```bash
+nano README.md
+# or use your preferred editor
+```
+---
+## Troubleshooting
+### Error: "Repository not found"
+- Make sure you created the repository on huggingface.co first
+- Check that the username in the URL matches your HF username
+### Error: "Authentication failed"
+- Make sure you're using your **token** as password, not your account password
+- Verify the token has **Write** permissions
+- Try `git credential reject` to clear cached credentials
+### Error: "Large file not properly tracked"
+- LFS is already configured in this repo
+- Just push normally, git-lfs will handle large files automatically
+### Upload is very slow
+- The model is ~1.1 GB, this is normal
+- It may take 5-15 minutes depending on your internet speed
+- Git LFS uploads large files efficiently
+---
+## After Upload
+1. **Test the model**:
+   ```python
+   from transformers import pipeline
+   classifier = pipeline("text-classification",
+                        model="YOUR_USERNAME/nfqa-multilingual-classifier")
+   result = classifier("What is the capital of France?")
+   print(result)
+   ```
+2. **Add widget examples** in the README YAML front matter (optional)
+3. **Share your model** on social media, papers, etc.
+4. **Monitor usage** at `https://huggingface.co/YOUR_USERNAME/nfqa-multilingual-classifier/tree/main`
+---
+## Quick Reference
+```bash
+# View repository status
+cd /Users/alisalman/thesis/nfqa-multilingual-classifier
+git status
+# View commit history
+git log --oneline
+# Check remote URL
+git remote -v
+# Push updates (after making changes)
+git add .
+git commit -m "Update model card"
+git push
+```
+---
+**Need help?**
+- Hugging Face Docs: https://huggingface.co/docs/hub
+- Git LFS Guide: https://git-lfs.github.com/
+**Ready to push?** Follow Step 3 above!

classification_report.txt CHANGED Viewed

@@ -1,14 +1,14 @@
                 precision    recall  f1-score   support
-NOT-A-QUESTION       0.92      0.91      0.92       957
-       FACTOID       0.92      0.92      0.92      5679
-   INSTRUCTION       0.89      0.87      0.88       295
-        REASON       0.77      0.82      0.80       664
-EVIDENCE-BASED       0.85      0.92      0.88      1466
-    COMPARISON       0.84      0.83      0.83       885
-    EXPERIENCE       0.82      0.76      0.79      1556
-        DEBATE       0.92      0.92      0.92      1085
-      accuracy                           0.89     12587
-     macro avg       0.87      0.87      0.87     12587
-  weighted avg       0.89      0.89      0.89     12587

                 precision    recall  f1-score   support
+NOT-A-QUESTION       0.96      0.92      0.94       950
+       FACTOID       0.84      0.79      0.81       980
+        DEBATE       0.90      0.95      0.92       916
+EVIDENCE-BASED       0.86      0.92      0.89       950
+   INSTRUCTION       0.85      0.92      0.88       980
+        REASON       0.88      0.86      0.87       960
+    EXPERIENCE       0.82      0.76      0.79       980
+    COMPARISON       0.93      0.93      0.93       980
+      accuracy                           0.88      7696
+     macro avg       0.88      0.88      0.88      7696
+  weighted avg       0.88      0.88      0.88      7696

config.json CHANGED Viewed

@@ -1,48 +1,3 @@
-{
-  "architectures": [
-    "XLMRobertaForSequenceClassification"
-  ],
-  "attention_probs_dropout_prob": 0.1,
-  "bos_token_id": 0,
-  "classifier_dropout": null,
-  "eos_token_id": 2,
-  "hidden_act": "gelu",
-  "hidden_dropout_prob": 0.1,
-  "hidden_size": 768,
-  "id2label": {
-    "0": "NOT-A-QUESTION",
-    "1": "FACTOID",
-    "2": "INSTRUCTION",
-    "3": "REASON",
-    "4": "EVIDENCE-BASED",
-    "5": "COMPARISON",
-    "6": "EXPERIENCE",
-    "7": "DEBATE"
-  },
-  "initializer_range": 0.02,
-  "intermediate_size": 3072,
-  "label2id": {
-    "COMPARISON": 5,
-    "DEBATE": 7,
-    "EVIDENCE-BASED": 4,
-    "EXPERIENCE": 6,
-    "FACTOID": 1,
-    "INSTRUCTION": 2,
-    "NOT-A-QUESTION": 0,
-    "REASON": 3
-  },
-  "layer_norm_eps": 1e-05,
-  "max_position_embeddings": 514,
-  "model_type": "xlm-roberta",
-  "num_attention_heads": 12,
-  "num_hidden_layers": 12,
-  "output_past": true,
-  "pad_token_id": 1,
-  "position_embedding_type": "absolute",
-  "problem_type": "single_label_classification",
-  "torch_dtype": "float32",
-  "transformers_version": "4.50.3",
-  "type_vocab_size": 1,
-  "use_cache": true,
-  "vocab_size": 250002
-}

+version https://git-lfs.github.com/spec/v1
+oid sha256:d64b32cf7198deee34a207a62d0681ea08b0b2ae51b5d011324791e5b24c6a9a
+size 1118

confusion_matrix.png CHANGED Viewed

Git LFS Details

SHA256: 3d3533b2b7d69ceaa2f48a962c23b309a14e5753a95beaf680f9ad17da404609
Pointer size: 131 Bytes
Size of remote file: 326 kB

Git LFS Details

SHA256: 3547aca051e570d4d55a93059f29f8d3fe322ea24df58cb660dc043e58206be9
Pointer size: 131 Bytes
Size of remote file: 324 kB

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fcf678e92084d404bd4a8055450df84c09e126dfa9d6b2869745895fecd450ff
 size 1112223464

 version https://git-lfs.github.com/spec/v1
+oid sha256:33a25f9cc0e6e82ac88d37fb2ec3bfb4e61c9751e5db98d13f04692a5ab2f734
 size 1112223464

test_results.json CHANGED Viewed

@@ -1,22 +1,3 @@
-{
-  "test_loss": 1.1942661555863936,
-  "test_accuracy": 0.8855168030507666,
-  "test_f1_macro": 0.8672321792444992,
-  "best_epoch": 27,
-  "best_val_f1": 0.8676620754981998,
-  "num_train_examples": 44051,
-  "num_val_examples": 6294,
-  "num_test_examples": 12587,
-  "config": {
-    "model_name": "xlm-roberta-base",
-    "max_length": 128,
-    "batch_size": 16,
-    "learning_rate": 2e-05,
-    "num_epochs": 30,
-    "warmup_steps": 500,
-    "weight_decay": 0.01,
-    "test_size": 0.2,
-    "val_size": 0.1
-  },
-  "timestamp": "2026-01-16T19:09:44.473503"
-}

+version https://git-lfs.github.com/spec/v1
+oid sha256:0b90b48f6007b8ab4fdc46e83e9dcf2561802f5946e48bd665fde14a9ba3fa7d
+size 778

training_curves.png CHANGED Viewed

Git LFS Details

SHA256: 8a7b8b14f9f6d510d1bde14302568ed59f6adc35559b1b4c6722cc7d98caf5ba
Pointer size: 131 Bytes
Size of remote file: 336 kB

Git LFS Details

SHA256: aa8e4ff07702869f2f2b8adff6218020c9d50c15a88e6c645adbf6a48a79e03b
Pointer size: 131 Bytes
Size of remote file: 301 kB

training_scripts/run_training_auto.sh ADDED Viewed

	@@ -0,0 +1,115 @@

+#!/bin/bash
+#
+# Train NFQA Model with Automatic Data Splitting
+#
+# This script trains the NFQA classification model using a single combined
+# dataset that will be automatically split into train/val/test sets.
+#
+# Usage:
+#   bash run_training_auto.sh
+#
+# Or with custom parameters:
+#   bash run_training_auto.sh --epochs 15 --batch-size 32
+#
+set -e  # Exit on error
+# Default paths
+INPUT_FILE="../output/webfaq_nfqa_combined_highquality.jsonl"
+OUTPUT_DIR="../output/training/nfqa_model_auto"
+# Default training parameters
+MODEL_NAME="xlm-roberta-base"
+EPOCHS=6
+BATCH_SIZE=16
+LEARNING_RATE=2e-5
+MAX_LENGTH=128
+WARMUP_STEPS=500
+WEIGHT_DECAY=0.1
+DROPOUT=0.2
+TEST_SIZE=0.2
+VAL_SIZE=0.1
+echo "================================================================================"
+echo "NFQA Model Training - Automatic Split Mode"
+echo "================================================================================"
+echo ""
+echo "Training Configuration:"
+echo "  Input file:       $INPUT_FILE"
+echo "  Output directory: $OUTPUT_DIR"
+echo "  Model:            $MODEL_NAME"
+echo "  Epochs:           $EPOCHS"
+echo "  Batch size:       $BATCH_SIZE"
+echo "  Learning rate:    $LEARNING_RATE"
+echo "  Max length:       $MAX_LENGTH"
+echo "  Weight decay:     $WEIGHT_DECAY"
+echo "  Dropout:          $DROPOUT"
+echo "  Test split:       $TEST_SIZE (20%)"
+echo "  Val split:        $VAL_SIZE (10%)"
+echo ""
+echo "================================================================================"
+echo ""
+# Check if input file exists
+if [ ! -f "$INPUT_FILE" ]; then
+    echo "❌ Error: Input file not found: $INPUT_FILE"
+    echo ""
+    echo "Please ensure the combined dataset exists."
+    echo "You can create it by running:"
+    echo "  cd ../annotator"
+    echo "  python combine_datasets.py"
+    exit 1
+fi
+# Create output directory
+mkdir -p "$OUTPUT_DIR"
+# Run training
+python train_nfqa_model.py \
+    --input "$INPUT_FILE" \
+    --output-dir "$OUTPUT_DIR" \
+    --model-name "$MODEL_NAME" \
+    --epochs "$EPOCHS" \
+    --batch-size "$BATCH_SIZE" \
+    --learning-rate "$LEARNING_RATE" \
+    --max-length "$MAX_LENGTH" \
+    --warmup-steps "$WARMUP_STEPS" \
+    --weight-decay "$WEIGHT_DECAY" \
+    --dropout "$DROPOUT" \
+    --test-size "$TEST_SIZE" \
+    --val-size "$VAL_SIZE" \
+    "$@"  # Pass any additional arguments from command line
+# Check if training was successful
+if [ $? -eq 0 ]; then
+    echo ""
+    echo "================================================================================"
+    echo "✅ Training completed successfully!"
+    echo "================================================================================"
+    echo ""
+    echo "Model saved to: $OUTPUT_DIR"
+    echo ""
+    echo "Generated files:"
+    echo "  - best_model/                  (best checkpoint based on validation F1)"
+    echo "  - final_model/                 (final epoch checkpoint)"
+    echo "  - training_history.json        (training metrics)"
+    echo "  - training_curves.png          (loss/accuracy/F1 plots)"
+    echo "  - test_results.json            (final test metrics)"
+    echo "  - classification_report.txt    (per-category performance)"
+    echo "  - confusion_matrix.png         (confusion matrix visualization)"
+    echo ""
+    echo "Next steps:"
+    echo "  1. Review training curves: $OUTPUT_DIR/training_curves.png"
+    echo "  2. Check test results: $OUTPUT_DIR/test_results.json"
+    echo "  3. Analyze confusion matrix: $OUTPUT_DIR/confusion_matrix.png"
+    echo "  4. Deploy model from: $OUTPUT_DIR/best_model/"
+    echo ""
+else
+    echo ""
+    echo "================================================================================"
+    echo "❌ Training failed!"
+    echo "================================================================================"
+    echo ""
+    echo "Please check the error messages above and try again."
+    exit 1
+fi

training_scripts/run_training_manual.sh ADDED Viewed

	@@ -0,0 +1,124 @@

+#!/bin/bash
+#
+# Train NFQA Model with Pre-Split Datasets
+#
+# This script trains the NFQA classification model using manually split
+# train/validation/test datasets for balanced training.
+#
+# Usage:
+#   bash run_training_manual.sh
+#
+# Or with custom parameters:
+#   bash run_training_manual.sh --epochs 15 --batch-size 32
+#
+set -e  # Exit on error
+# Default paths
+TRAIN_FILE="../output/train_balanced.jsonl"
+VAL_FILE="../output/val_balanced.jsonl"
+TEST_FILE="../output/test_balanced.jsonl"
+OUTPUT_DIR="../output/training/nfqa_model_balanced"
+# Default training parameters
+MODEL_NAME="xlm-roberta-base"
+EPOCHS=6
+BATCH_SIZE=16
+LEARNING_RATE=2e-5
+MAX_LENGTH=128
+WARMUP_STEPS=500
+WEIGHT_DECAY=0.1
+DROPOUT=0.2
+echo "================================================================================"
+echo "NFQA Model Training - Manual Split Mode"
+echo "================================================================================"
+echo ""
+echo "Training Configuration:"
+echo "  Train file:       $TRAIN_FILE"
+echo "  Validation file:  $VAL_FILE"
+echo "  Test file:        $TEST_FILE"
+echo "  Output directory: $OUTPUT_DIR"
+echo "  Model:            $MODEL_NAME"
+echo "  Epochs:           $EPOCHS"
+echo "  Batch size:       $BATCH_SIZE"
+echo "  Learning rate:    $LEARNING_RATE"
+echo "  Max length:       $MAX_LENGTH"
+echo "  Weight decay:     $WEIGHT_DECAY"
+echo "  Dropout:          $DROPOUT"
+echo ""
+echo "================================================================================"
+echo ""
+# Check if required files exist
+if [ ! -f "$TRAIN_FILE" ]; then
+    echo "❌ Error: Training file not found: $TRAIN_FILE"
+    echo ""
+    echo "Please run the data splitting script first:"
+    echo "  cd ../cleaning"
+    echo "  python split_train_test_val.py --input ../output/webfaq_nfqa_combined_highquality.jsonl"
+    exit 1
+fi
+if [ ! -f "$VAL_FILE" ]; then
+    echo "❌ Error: Validation file not found: $VAL_FILE"
+    exit 1
+fi
+if [ ! -f "$TEST_FILE" ]; then
+    echo "❌ Error: Test file not found: $TEST_FILE"
+    exit 1
+fi
+# Create output directory
+mkdir -p "$OUTPUT_DIR"
+# Run training
+python train_nfqa_model.py \
+    --train "$TRAIN_FILE" \
+    --val "$VAL_FILE" \
+    --test "$TEST_FILE" \
+    --output-dir "$OUTPUT_DIR" \
+    --model-name "$MODEL_NAME" \
+    --epochs "$EPOCHS" \
+    --batch-size "$BATCH_SIZE" \
+    --learning-rate "$LEARNING_RATE" \
+    --max-length "$MAX_LENGTH" \
+    --warmup-steps "$WARMUP_STEPS" \
+    --weight-decay "$WEIGHT_DECAY" \
+    --dropout "$DROPOUT" \
+    "$@"  # Pass any additional arguments from command line
+# Check if training was successful
+if [ $? -eq 0 ]; then
+    echo ""
+    echo "================================================================================"
+    echo "✅ Training completed successfully!"
+    echo "================================================================================"
+    echo ""
+    echo "Model saved to: $OUTPUT_DIR"
+    echo ""
+    echo "Generated files:"
+    echo "  - best_model/                  (best checkpoint based on validation F1)"
+    echo "  - final_model/                 (final epoch checkpoint)"
+    echo "  - training_history.json        (training metrics)"
+    echo "  - training_curves.png          (loss/accuracy/F1 plots)"
+    echo "  - test_results.json            (final test metrics)"
+    echo "  - classification_report.txt    (per-category performance)"
+    echo "  - confusion_matrix.png         (confusion matrix visualization)"
+    echo ""
+    echo "Next steps:"
+    echo "  1. Review training curves: $OUTPUT_DIR/training_curves.png"
+    echo "  2. Check test results: $OUTPUT_DIR/test_results.json"
+    echo "  3. Analyze confusion matrix: $OUTPUT_DIR/confusion_matrix.png"
+    echo "  4. Deploy model from: $OUTPUT_DIR/best_model/"
+    echo ""
+else
+    echo ""
+    echo "================================================================================"
+    echo "❌ Training failed!"
+    echo "================================================================================"
+    echo ""
+    echo "Please check the error messages above and try again."
+    exit 1
+fi

training_scripts/train_nfqa_model.py ADDED Viewed

	@@ -0,0 +1,870 @@

+#!/usr/bin/env python3
+"""
+Train NFQA Classification Model from Scratch
+Trains a multilingual NFQA classifier using XLM-RoBERTa on LLM-annotated WebFAQ data.
+Usage (single file with automatic splitting):
+    python train_nfqa_model.py --input data.jsonl --output-dir ./model --epochs 10
+Usage (pre-split files):
+    python train_nfqa_model.py --train train.jsonl --val val.jsonl --test test.jsonl --output-dir ./model --epochs 10
+Author: Ali
+Date: December 2024
+"""
+import pandas as pd
+import numpy as np
+import torch
+import json
+import argparse
+import os
+from collections import Counter
+from datetime import datetime
+from torch.utils.data import Dataset, DataLoader
+from torch.optim import AdamW
+from transformers import (
+    AutoTokenizer,
+    AutoModelForSequenceClassification,
+    get_linear_schedule_with_warmup
+)
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import (
+    classification_report,
+    confusion_matrix,
+    accuracy_score,
+    f1_score
+)
+import matplotlib
+matplotlib.use('Agg')  # Non-interactive backend for server
+import matplotlib.pyplot as plt
+import seaborn as sns
+from tqdm import tqdm
+# Set random seed
+RANDOM_SEED = 42
+np.random.seed(RANDOM_SEED)
+torch.manual_seed(RANDOM_SEED)
+NFQA_CATEGORIES = [
+    'NOT-A-QUESTION',
+    'FACTOID',
+    'DEBATE',
+    'EVIDENCE-BASED',
+    'INSTRUCTION',
+    'REASON',
+    'EXPERIENCE',
+    'COMPARISON'
+]
+# Label mappings
+LABEL2ID = {label: idx for idx, label in enumerate(NFQA_CATEGORIES)}
+ID2LABEL = {idx: label for label, idx in LABEL2ID.items()}
+class NFQADataset(Dataset):
+    """Custom dataset for NFQA classification"""
+    def __init__(self, questions, labels, tokenizer, max_length=128):
+        self.questions = questions
+        self.labels = labels
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+    def __len__(self):
+        return len(self.questions)
+    def __getitem__(self, idx):
+        question = str(self.questions[idx])
+        label = int(self.labels[idx])
+        # Tokenize
+        encoding = self.tokenizer(
+            question,
+            add_special_tokens=True,
+            max_length=self.max_length,
+            padding='max_length',
+            truncation=True,
+            return_attention_mask=True,
+            return_tensors='pt'
+        )
+        return {
+            'input_ids': encoding['input_ids'].flatten(),
+            'attention_mask': encoding['attention_mask'].flatten(),
+            'labels': torch.tensor(label, dtype=torch.long)
+        }
+def train_epoch(model, train_loader, optimizer, scheduler, device):
+    """Train for one epoch"""
+    model.train()
+    total_loss = 0
+    predictions = []
+    true_labels = []
+    progress_bar = tqdm(train_loader, desc="Training")
+    for batch in progress_bar:
+        # Move batch to device
+        input_ids = batch['input_ids'].to(device)
+        attention_mask = batch['attention_mask'].to(device)
+        labels = batch['labels'].to(device)
+        # Forward pass
+        outputs = model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            labels=labels
+        )
+        loss = outputs.loss
+        total_loss += loss.item()
+        # Backward pass
+        optimizer.zero_grad()
+        loss.backward()
+        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+        optimizer.step()
+        scheduler.step()
+        # Track predictions
+        preds = torch.argmax(outputs.logits, dim=1)
+        predictions.extend(preds.cpu().numpy())
+        true_labels.extend(labels.cpu().numpy())
+        # Update progress bar
+        progress_bar.set_postfix({'loss': f'{loss.item():.4f}'})
+    avg_loss = total_loss / len(train_loader)
+    accuracy = accuracy_score(true_labels, predictions)
+    return avg_loss, accuracy
+def evaluate(model, data_loader, device, languages=None, desc="Evaluating", show_analysis=False):
+    """Evaluate model on validation/test set with optional detailed analysis"""
+    model.eval()
+    total_loss = 0
+    predictions = []
+    true_labels = []
+    with torch.no_grad():
+        for batch in tqdm(data_loader, desc=desc):
+            input_ids = batch['input_ids'].to(device)
+            attention_mask = batch['attention_mask'].to(device)
+            labels = batch['labels'].to(device)
+            outputs = model(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                labels=labels
+            )
+            total_loss += outputs.loss.item()
+            preds = torch.argmax(outputs.logits, dim=1)
+            predictions.extend(preds.cpu().numpy())
+            true_labels.extend(labels.cpu().numpy())
+    avg_loss = total_loss / len(data_loader)
+    accuracy = accuracy_score(true_labels, predictions)
+    f1 = f1_score(true_labels, predictions, average='macro')
+    # Run detailed analysis if requested
+    if show_analysis and languages is not None:
+        print("\n" + "-"*70)
+        print("VALIDATION ANALYSIS")
+        print("-"*70)
+        # Analyze by category
+        analyze_performance_by_category(predictions, true_labels)
+        # Analyze by language (top 5)
+        analyze_performance_by_language(predictions, true_labels, languages, top_n=5)
+        # Analyze combinations (top 10)
+        analyze_language_category_combinations(predictions, true_labels, languages, top_n=10)
+        print("-"*70)
+    return avg_loss, accuracy, f1, predictions, true_labels
+def load_data(file_path):
+    """Load annotated data from JSONL file"""
+    print(f"Loading data from: {file_path}\n")
+    try:
+        df = pd.read_json(file_path, lines=True)
+        print(f"✓ Loaded {len(df)} annotated examples")
+        # Check required columns
+        if 'question' not in df.columns:
+            raise ValueError("Missing 'question' column")
+        # Determine label column
+        if 'label_id' in df.columns:
+            label_col = 'label_id'
+        elif 'ensemble_prediction' in df.columns:
+            # Convert category names to IDs
+            df['label_id'] = df['ensemble_prediction'].map(LABEL2ID)
+            label_col = 'label_id'
+        elif 'label' in df.columns:
+            label_col = 'label'
+        else:
+            raise ValueError("No label column found (expected: 'label', 'label_id', or 'ensemble_prediction')")
+        # Remove any rows with missing labels
+        df = df.dropna(subset=['question', label_col])
+        print(f"✓ Data cleaned: {len(df)} examples with valid labels")
+        # Show statistics
+        print("\nLabel distribution:")
+        label_counts = df[label_col].value_counts().sort_index()
+        for label_id, count in label_counts.items():
+            cat_name = ID2LABEL.get(int(label_id), f"UNKNOWN_{label_id}")
+            print(f"  {cat_name:20s}: {count:4d} ({count/len(df)*100:5.1f}%)")
+        # Prepare final dataset with language info
+        questions = df['question'].tolist()
+        labels = df[label_col].astype(int).tolist()
+        languages = df['language'].tolist() if 'language' in df.columns else ['unknown'] * len(df)
+        print(f"\n✓ Prepared {len(questions)} question-label pairs")
+        return questions, labels, languages
+    except FileNotFoundError:
+        print(f"❌ Error: File not found: {file_path}")
+        raise
+    except Exception as e:
+        print(f"❌ Error loading data: {e}")
+        raise
+def create_data_splits(questions, labels, test_size=0.2, val_size=0.1):
+    """Create train/val/test splits"""
+    print("\nCreating data splits...")
+    # First split: separate test set
+    train_val_questions, test_questions, train_val_labels, test_labels = train_test_split(
+        questions,
+        labels,
+        test_size=test_size,
+        random_state=RANDOM_SEED,
+        stratify=labels
+    )
+    # Second split: separate validation from training
+    train_questions, val_questions, train_labels, val_labels = train_test_split(
+        train_val_questions,
+        train_val_labels,
+        test_size=val_size / (1 - test_size),
+        random_state=RANDOM_SEED,
+        stratify=train_val_labels
+    )
+    print(f"\nData splits:")
+    print(f"  Training:   {len(train_questions):4d} examples ({len(train_questions)/len(questions)*100:5.1f}%)")
+    print(f"  Validation: {len(val_questions):4d} examples ({len(val_questions)/len(questions)*100:5.1f}%)")
+    print(f"  Test:       {len(test_questions):4d} examples ({len(test_questions)/len(questions)*100:5.1f}%)")
+    print(f"  Total:      {len(questions):4d} examples")
+    # Verify class distribution
+    print("\nClass distribution per split:")
+    for split_name, split_labels in [('Train', train_labels), ('Val', val_labels), ('Test', test_labels)]:
+        counts = Counter(split_labels)
+        print(f"\n{split_name}:")
+        for label_id in sorted(counts.keys()):
+            cat_name = ID2LABEL[label_id]
+            print(f"  {cat_name:20s}: {counts[label_id]:3d}")
+    return train_questions, val_questions, test_questions, train_labels, val_labels, test_labels
+def plot_training_curves(history, best_val_f1, output_dir):
+    """Plot and save training curves"""
+    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
+    epochs = range(1, len(history['train_loss']) + 1)
+    # Plot 1: Loss
+    axes[0].plot(epochs, history['train_loss'], 'b-', label='Train Loss', linewidth=2)
+    axes[0].plot(epochs, history['val_loss'], 'r-', label='Val Loss', linewidth=2)
+    axes[0].set_xlabel('Epoch')
+    axes[0].set_ylabel('Loss')
+    axes[0].set_title('Training and Validation Loss')
+    axes[0].legend()
+    axes[0].grid(True, alpha=0.3)
+    # Plot 2: Accuracy
+    axes[1].plot(epochs, history['train_accuracy'], 'b-', label='Train Accuracy', linewidth=2)
+    axes[1].plot(epochs, history['val_accuracy'], 'r-', label='Val Accuracy', linewidth=2)
+    axes[1].set_xlabel('Epoch')
+    axes[1].set_ylabel('Accuracy')
+    axes[1].set_title('Training and Validation Accuracy')
+    axes[1].legend()
+    axes[1].grid(True, alpha=0.3)
+    # Plot 3: F1 Score
+    axes[2].plot(epochs, history['val_f1'], 'g-', label='Val F1 (Macro)', linewidth=2)
+    axes[2].axhline(y=best_val_f1, color='r', linestyle='--', label=f'Best F1: {best_val_f1:.4f}')
+    axes[2].set_xlabel('Epoch')
+    axes[2].set_ylabel('F1 Score')
+    axes[2].set_title('Validation F1 Score')
+    axes[2].legend()
+    axes[2].grid(True, alpha=0.3)
+    plt.tight_layout()
+    plot_file = os.path.join(output_dir, 'training_curves.png')
+    plt.savefig(plot_file, dpi=300, bbox_inches='tight')
+    plt.close()
+    print(f"✓ Training curves saved to: {plot_file}")
+def analyze_performance_by_language(predictions, true_labels, languages, top_n=10):
+    """Analyze and print performance by language"""
+    from collections import defaultdict
+    lang_stats = defaultdict(lambda: {'correct': 0, 'total': 0})
+    for pred, true, lang in zip(predictions, true_labels, languages):
+        lang_stats[lang]['total'] += 1
+        if pred == true:
+            lang_stats[lang]['correct'] += 1
+    # Calculate accuracy per language
+    lang_accuracies = []
+    for lang, stats in lang_stats.items():
+        if stats['total'] >= 5:  # Only show languages with at least 5 examples
+            acc = stats['correct'] / stats['total']
+            lang_accuracies.append({
+                'language': lang,
+                'accuracy': acc,
+                'correct': stats['correct'],
+                'total': stats['total'],
+                'errors': stats['total'] - stats['correct']
+            })
+    lang_accuracies.sort(key=lambda x: x['accuracy'])
+    print(f"\n{'='*70}")
+    print(f"WORST {top_n} LANGUAGES (with >= 5 examples)")
+    print(f"{'='*70}")
+    print(f"{'Language':<12} {'Accuracy':<12} {'Errors':<10} {'Total':<10}")
+    print(f"{'-'*70}")
+    for item in lang_accuracies[:top_n]:
+        print(f"{item['language']:<12} {item['accuracy']:>10.2%}   {item['errors']:>8}   {item['total']:>8}")
+    return lang_stats, lang_accuracies
+def analyze_performance_by_category(predictions, true_labels):
+    """Analyze and print performance by category"""
+    from collections import defaultdict
+    cat_stats = defaultdict(lambda: {'correct': 0, 'total': 0})
+    for pred, true in zip(predictions, true_labels):
+        cat_stats[true]['total'] += 1
+        if pred == true:
+            cat_stats[true]['correct'] += 1
+    cat_accuracies = []
+    for cat_id, stats in cat_stats.items():
+        acc = stats['correct'] / stats['total']
+        cat_accuracies.append({
+            'category': ID2LABEL[cat_id],
+            'accuracy': acc,
+            'correct': stats['correct'],
+            'total': stats['total'],
+            'errors': stats['total'] - stats['correct']
+        })
+    cat_accuracies.sort(key=lambda x: x['accuracy'])
+    print(f"\n{'='*70}")
+    print(f"PERFORMANCE BY CATEGORY")
+    print(f"{'='*70}")
+    print(f"{'Category':<20} {'Accuracy':<12} {'Errors':<10} {'Total':<10}")
+    print(f"{'-'*70}")
+    for item in cat_accuracies:
+        print(f"{item['category']:<20} {item['accuracy']:>10.2%}   {item['errors']:>8}   {item['total']:>8}")
+    return cat_stats, cat_accuracies
+def analyze_language_category_combinations(predictions, true_labels, languages, top_n=15):
+    """Analyze performance by (language, category) combinations"""
+    from collections import defaultdict
+    combo_stats = defaultdict(lambda: {'correct': 0, 'total': 0})
+    for pred, true, lang in zip(predictions, true_labels, languages):
+        key = (lang, ID2LABEL[true])
+        combo_stats[key]['total'] += 1
+        if pred == true:
+            combo_stats[key]['correct'] += 1
+    combo_accuracies = []
+    for (lang, cat), stats in combo_stats.items():
+        if stats['total'] >= 3:  # Only show combinations with at least 3 examples
+            acc = stats['correct'] / stats['total']
+            combo_accuracies.append({
+                'language': lang,
+                'category': cat,
+                'accuracy': acc,
+                'correct': stats['correct'],
+                'total': stats['total'],
+                'errors': stats['total'] - stats['correct']
+            })
+    combo_accuracies.sort(key=lambda x: x['accuracy'])
+    print(f"\n{'='*80}")
+    print(f"WORST {top_n} LANGUAGE-CATEGORY COMBINATIONS (with >= 3 examples)")
+    print(f"{'='*80}")
+    print(f"{'Language':<12} {'Category':<20} {'Accuracy':<12} {'Errors':<8} {'Total':<8}")
+    print(f"{'-'*80}")
+    for item in combo_accuracies[:top_n]:
+        print(f"{item['language']:<12} {item['category']:<20} {item['accuracy']:>10.2%}   {item['errors']:>6}   {item['total']:>6}")
+    return combo_stats, combo_accuracies
+def plot_confusion_matrix(test_true, test_preds, output_dir):
+    """Plot and save confusion matrix"""
+    cm = confusion_matrix(test_true, test_preds, labels=list(range(len(NFQA_CATEGORIES))))
+    plt.figure(figsize=(12, 10))
+    sns.heatmap(
+        cm,
+        annot=True,
+        fmt='d',
+        cmap='Blues',
+        xticklabels=NFQA_CATEGORIES,
+        yticklabels=NFQA_CATEGORIES,
+        cbar_kws={'label': 'Count'}
+    )
+    plt.xlabel('Predicted Category')
+    plt.ylabel('True Category')
+    plt.title('Confusion Matrix - Test Set')
+    plt.xticks(rotation=45, ha='right')
+    plt.yticks(rotation=0)
+    plt.tight_layout()
+    cm_file = os.path.join(output_dir, 'confusion_matrix.png')
+    plt.savefig(cm_file, dpi=300, bbox_inches='tight')
+    plt.close()
+    print(f"✓ Confusion matrix saved to: {cm_file}")
+def main():
+    parser = argparse.ArgumentParser(description='Train NFQA Classification Model')
+    # Data arguments - either single input file OR separate train/val/test files
+    parser.add_argument('--input', type=str,
+                        help='Input JSONL file with annotated data (will be split automatically)')
+    parser.add_argument('--train', type=str,
+                        help='Training set JSONL file (use with --val and --test)')
+    parser.add_argument('--val', type=str,
+                        help='Validation set JSONL file (use with --train and --test)')
+    parser.add_argument('--test', type=str,
+                        help='Test set JSONL file (use with --train and --val)')
+    parser.add_argument('--output-dir', type=str, default='./nfqa_model_trained',
+                        help='Output directory for model and results')
+    # Model arguments
+    parser.add_argument('--model-name', type=str, default='xlm-roberta-base',
+                        help='Pretrained model name (default: xlm-roberta-base)')
+    parser.add_argument('--max-length', type=int, default=128,
+                        help='Maximum sequence length (default: 128)')
+    # Training arguments
+    parser.add_argument('--batch-size', type=int, default=16,
+                        help='Batch size (default: 16)')
+    parser.add_argument('--epochs', type=int, default=10,
+                        help='Number of epochs (default: 10)')
+    parser.add_argument('--learning-rate', type=float, default=2e-5,
+                        help='Learning rate (default: 2e-5)')
+    parser.add_argument('--warmup-steps', type=int, default=500,
+                        help='Warmup steps (default: 500)')
+    parser.add_argument('--weight-decay', type=float, default=0.01,
+                        help='Weight decay (default: 0.01)')
+    parser.add_argument('--dropout', type=float, default=0.1,
+                        help='Dropout probability (default: 0.1)')
+    # Split arguments
+    parser.add_argument('--test-size', type=float, default=0.2,
+                        help='Test set size (default: 0.2)')
+    parser.add_argument('--val-size', type=float, default=0.1,
+                        help='Validation set size (default: 0.1)')
+    # Device argument
+    parser.add_argument('--device', type=str, default='auto',
+                        help='Device to use: cuda, cpu, or auto (default: auto)')
+    args = parser.parse_args()
+    # Validate arguments
+    has_single_input = args.input is not None
+    has_split_inputs = all([args.train, args.val, args.test])
+    if not has_single_input and not has_split_inputs:
+        parser.error("Either --input OR (--train, --val, --test) must be provided")
+    if has_single_input and has_split_inputs:
+        parser.error("Cannot use --input together with --train/--val/--test. Choose one approach.")
+    # Print configuration
+    print("="*80)
+    print("NFQA MODEL TRAINING")
+    print("="*80)
+    if has_single_input:
+        print(f"Input file: {args.input}")
+        print(f"Data splitting: automatic (test={args.test_size}, val={args.val_size})")
+    else:
+        print(f"Train file: {args.train}")
+        print(f"Val file: {args.val}")
+        print(f"Test file: {args.test}")
+        print(f"Data splitting: manual (pre-split)")
+    print(f"Output directory: {args.output_dir}")
+    print(f"Model: {args.model_name}")
+    print(f"Epochs: {args.epochs}")
+    print(f"Batch size: {args.batch_size}")
+    print(f"Learning rate: {args.learning_rate}")
+    print(f"Max length: {args.max_length}")
+    print(f"Weight decay: {args.weight_decay}")
+    print(f"Dropout: {args.dropout}")
+    print("="*80 + "\n")
+    # Set device
+    if args.device == 'auto':
+        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    else:
+        device = torch.device(args.device)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(RANDOM_SEED)
+    print(f"Device: {device}")
+    print(f"PyTorch version: {torch.__version__}")
+    if torch.cuda.is_available():
+        print(f"CUDA device: {torch.cuda.get_device_name(0)}\n")
+    # Create output directory
+    os.makedirs(args.output_dir, exist_ok=True)
+    # Load data - either from single file or pre-split files
+    if has_single_input:
+        # Load single file and create splits
+        questions, labels, languages = load_data(args.input)
+        # Create splits (stratify by labels, keep languages aligned)
+        from sklearn.model_selection import train_test_split
+        # First split: separate test set
+        train_val_questions, test_questions, train_val_labels, test_labels, train_val_langs, test_langs = train_test_split(
+            questions, labels, languages,
+            test_size=args.test_size,
+            random_state=RANDOM_SEED,
+            stratify=labels
+        )
+        # Second split: separate validation from training
+        train_questions, val_questions, train_labels, val_labels, train_langs, val_langs = train_test_split(
+            train_val_questions, train_val_labels, train_val_langs,
+            test_size=args.val_size / (1 - args.test_size),
+            random_state=RANDOM_SEED,
+            stratify=train_val_labels
+        )
+        print(f"\nData splits:")
+        print(f"  Training:   {len(train_questions):4d} examples ({len(train_questions)/len(questions)*100:5.1f}%)")
+        print(f"  Validation: {len(val_questions):4d} examples ({len(val_questions)/len(questions)*100:5.1f}%)")
+        print(f"  Test:       {len(test_questions):4d} examples ({len(test_questions)/len(questions)*100:5.1f}%)")
+        print(f"  Total:      {len(questions):4d} examples")
+    else:
+        # Load pre-split files
+        print("Loading pre-split datasets...\n")
+        train_questions, train_labels, train_langs = load_data(args.train)
+        val_questions, val_labels, val_langs = load_data(args.val)
+        test_questions, test_labels, test_langs = load_data(args.test)
+        # Print split summary
+        total_examples = len(train_questions) + len(val_questions) + len(test_questions)
+        print(f"\nData splits:")
+        print(f"  Training:   {len(train_questions):4d} examples ({len(train_questions)/total_examples*100:5.1f}%)")
+        print(f"  Validation: {len(val_questions):4d} examples ({len(val_questions)/total_examples*100:5.1f}%)")
+        print(f"  Test:       {len(test_questions):4d} examples ({len(test_questions)/total_examples*100:5.1f}%)")
+        print(f"  Total:      {total_examples:4d} examples")
+        # Show class distribution per split
+        print("\nClass distribution per split:")
+        for split_name, split_labels in [('Train', train_labels), ('Val', val_labels), ('Test', test_labels)]:
+            counts = Counter(split_labels)
+            print(f"\n{split_name}:")
+            for label_id in sorted(counts.keys()):
+                cat_name = ID2LABEL[label_id]
+                print(f"  {cat_name:20s}: {counts[label_id]:3d}")
+    # Load tokenizer and model
+    print(f"\nLoading tokenizer: {args.model_name}")
+    tokenizer = AutoTokenizer.from_pretrained(args.model_name)
+    print("✓ Tokenizer loaded")
+    print(f"\nLoading model: {args.model_name}")
+    model = AutoModelForSequenceClassification.from_pretrained(
+        args.model_name,
+        num_labels=len(NFQA_CATEGORIES),
+        id2label=ID2LABEL,
+        label2id=LABEL2ID,
+        hidden_dropout_prob=args.dropout,
+        attention_probs_dropout_prob=args.dropout,
+        classifier_dropout=args.dropout
+    )
+    model.to(device)
+    print(f"✓ Model loaded")
+    print(f"  Number of parameters: {sum(p.numel() for p in model.parameters()):,}")
+    print(f"  Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
+    # Create datasets
+    print("\nCreating datasets...")
+    train_dataset = NFQADataset(train_questions, train_labels, tokenizer, args.max_length)
+    val_dataset = NFQADataset(val_questions, val_labels, tokenizer, args.max_length)
+    test_dataset = NFQADataset(test_questions, test_labels, tokenizer, args.max_length)
+    train_loader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True)
+    val_loader = DataLoader(val_dataset, batch_size=args.batch_size)
+    test_loader = DataLoader(test_dataset, batch_size=args.batch_size)
+    print(f"✓ Datasets created")
+    print(f"  Train: {len(train_dataset)} examples ({len(train_loader)} batches)")
+    print(f"  Val:   {len(val_dataset)} examples ({len(val_loader)} batches)")
+    print(f"  Test:  {len(test_dataset)} examples ({len(test_loader)} batches)")
+    # Setup optimizer and scheduler
+    optimizer = AdamW(
+        model.parameters(),
+        lr=args.learning_rate,
+        weight_decay=args.weight_decay
+    )
+    total_steps = len(train_loader) * args.epochs
+    scheduler = get_linear_schedule_with_warmup(
+        optimizer,
+        num_warmup_steps=args.warmup_steps,
+        num_training_steps=total_steps
+    )
+    print(f"\n✓ Optimizer and scheduler configured")
+    print(f"  Total training steps: {total_steps}")
+    print(f"  Warmup steps: {args.warmup_steps}")
+    # Training loop
+    history = {
+        'train_loss': [],
+        'train_accuracy': [],
+        'val_loss': [],
+        'val_accuracy': [],
+        'val_f1': []
+    }
+    best_val_f1 = 0
+    best_epoch = 0
+    print("\n" + "="*80)
+    print("STARTING TRAINING")
+    print("="*80 + "\n")
+    for epoch in range(args.epochs):
+        print(f"\nEpoch {epoch + 1}/{args.epochs}")
+        print("-" * 80)
+        # Train
+        train_loss, train_acc = train_epoch(model, train_loader, optimizer, scheduler, device)
+        # Validate with detailed analysis
+        val_loss, val_acc, val_f1, val_preds, val_true = evaluate(
+            model, val_loader, device,
+            languages=val_langs,
+            desc="Validating",
+            show_analysis=True
+        )
+        # Update history
+        history['train_loss'].append(train_loss)
+        history['train_accuracy'].append(train_acc)
+        history['val_loss'].append(val_loss)
+        history['val_accuracy'].append(val_acc)
+        history['val_f1'].append(val_f1)
+        # Print metrics
+        print(f"\nEpoch {epoch + 1} Summary:")
+        print(f"  Train Loss:     {train_loss:.4f}")
+        print(f"  Train Accuracy: {train_acc:.4f}")
+        print(f"  Val Loss:       {val_loss:.4f}")
+        print(f"  Val Accuracy:   {val_acc:.4f}")
+        print(f"  Val F1 (Macro): {val_f1:.4f}")
+        # Save best model
+        if val_f1 > best_val_f1:
+            best_val_f1 = val_f1
+            best_epoch = epoch + 1
+            # Save model
+            model_path = os.path.join(args.output_dir, 'best_model')
+            model.save_pretrained(model_path)
+            tokenizer.save_pretrained(model_path)
+            print(f"  ✓ New best model saved! (F1: {val_f1:.4f})")
+    print("\n" + "="*80)
+    print("TRAINING COMPLETE")
+    print("="*80)
+    print(f"Best epoch: {best_epoch}")
+    print(f"Best validation F1: {best_val_f1:.4f}")
+    print("="*80)
+    # Save training history
+    history_file = os.path.join(args.output_dir, 'training_history.json')
+    with open(history_file, 'w') as f:
+        json.dump(history, f, indent=2)
+    print(f"\n✓ Training history saved to: {history_file}")
+    # Save final model
+    final_model_path = os.path.join(args.output_dir, 'final_model')
+    model.save_pretrained(final_model_path)
+    tokenizer.save_pretrained(final_model_path)
+    print(f"✓ Final model saved to: {final_model_path}")
+    # Plot training curves
+    plot_training_curves(history, best_val_f1, args.output_dir)
+    # Load best model and evaluate on test set
+    print("\nLoading best model for final evaluation...")
+    best_model_path = os.path.join(args.output_dir, 'best_model')
+    model = AutoModelForSequenceClassification.from_pretrained(best_model_path)
+    model.to(device)
+    test_loss, test_acc, test_f1, test_preds, test_true = evaluate(model, test_loader, device, desc="Testing")
+    print("\n" + "="*80)
+    print("FINAL TEST SET RESULTS")
+    print("="*80)
+    print(f"Test Loss:       {test_loss:.4f}")
+    print(f"Test Accuracy:   {test_acc:.4f}")
+    print(f"Test F1 (Macro): {test_f1:.4f}")
+    print("="*80)
+    # Classification report
+    print("\n" + "="*80)
+    print("PER-CATEGORY PERFORMANCE")
+    print("="*80 + "\n")
+    report = classification_report(
+        test_true,
+        test_preds,
+        labels=list(range(len(NFQA_CATEGORIES))),
+        target_names=NFQA_CATEGORIES,
+        zero_division=0
+    )
+    print(report)
+    # Save report
+    report_file = os.path.join(args.output_dir, 'classification_report.txt')
+    with open(report_file, 'w') as f:
+        f.write(report)
+    print(f"✓ Classification report saved to: {report_file}")
+    # Plot confusion matrix
+    plot_confusion_matrix(test_true, test_preds, args.output_dir)
+    # Detailed performance analysis
+    print("\n" + "="*80)
+    print("DETAILED PERFORMANCE ANALYSIS")
+    print("="*80)
+    # Analyze by category
+    analyze_performance_by_category(test_preds, test_true)
+    # Analyze by language
+    analyze_performance_by_language(test_preds, test_true, test_langs, top_n=10)
+    # Analyze language-category combinations
+    analyze_language_category_combinations(test_preds, test_true, test_langs, top_n=15)
+    print("\n" + "="*80)
+    # Save test results
+    test_results = {
+        'test_loss': float(test_loss),
+        'test_accuracy': float(test_acc),
+        'test_f1_macro': float(test_f1),
+        'best_epoch': int(best_epoch),
+        'best_val_f1': float(best_val_f1),
+        'num_train_examples': len(train_questions),
+        'num_val_examples': len(val_questions),
+        'num_test_examples': len(test_questions),
+        'config': {
+            'model_name': args.model_name,
+            'max_length': args.max_length,
+            'batch_size': args.batch_size,
+            'learning_rate': args.learning_rate,
+            'num_epochs': args.epochs,
+            'warmup_steps': args.warmup_steps,
+            'weight_decay': args.weight_decay,
+            'dropout': args.dropout,
+            'data_source': 'pre-split' if has_split_inputs else 'single_file',
+            'train_file': args.train if has_split_inputs else args.input,
+            'val_file': args.val if has_split_inputs else None,
+            'test_file': args.test if has_split_inputs else None,
+            'auto_split': not has_split_inputs,
+            'test_size': args.test_size if not has_split_inputs else None,
+            'val_size': args.val_size if not has_split_inputs else None
+        },
+        'timestamp': datetime.now().isoformat()
+    }
+    results_file = os.path.join(args.output_dir, 'test_results.json')
+    with open(results_file, 'w') as f:
+        json.dump(test_results, f, indent=2)
+    print(f"✓ Test results saved to: {results_file}")
+    # Summary
+    print("\n" + "="*80)
+    print("TRAINING SUMMARY")
+    print("="*80)
+    print(f"\nModel: {args.model_name}")
+    print(f"Training examples: {len(train_questions)}")
+    print(f"Validation examples: {len(val_questions)}")
+    print(f"Test examples: {len(test_questions)}")
+    print(f"\nBest epoch: {best_epoch}/{args.epochs}")
+    print(f"Best validation F1: {best_val_f1:.4f}")
+    print(f"\nFinal test results:")
+    print(f"  Accuracy: {test_acc:.4f}")
+    print(f"  F1 Score (Macro): {test_f1:.4f}")
+    print(f"\nModel saved to: {args.output_dir}")
+    print(f"\nGenerated files:")
+    print(f"  - best_model/ (best checkpoint)")
+    print(f"  - final_model/ (last epoch)")
+    print(f"  - training_history.json")
+    print(f"  - training_curves.png")
+    print(f"  - test_results.json")
+    print(f"  - classification_report.txt")
+    print(f"  - confusion_matrix.png")
+    print("\n" + "="*80)
+    print("✅ Training complete! Model ready for deployment.")
+    print("="*80)
+if __name__ == '__main__':
+    main()