Commit ·
79d10e6
1
Parent(s): a9f2764
feat: improved accuracy to 92%
Browse files- README.md +47 -43
- classification_report.txt +11 -11
- confusion_matrix.png +2 -2
- model.safetensors +1 -1
- test_results.json +2 -2
- training_curves.png +2 -2
README.md
CHANGED
|
@@ -55,10 +55,11 @@ The model is based on XLM-RoBERTa (Cross-lingual Language Model - Robustly Optim
|
|
| 55 |
The model was trained on the **[NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)**, a large-scale multilingual dataset for non-factoid question classification.
|
| 56 |
|
| 57 |
**Dataset Composition**:
|
| 58 |
-
- **Training**:
|
| 59 |
-
- **Validation**:
|
| 60 |
-
- **Test**:
|
| 61 |
-
- **Total**:
|
|
|
|
| 62 |
|
| 63 |
**Source Distribution**:
|
| 64 |
- 54% from WebFAQ dataset (annotated with LLM ensemble)
|
|
@@ -66,9 +67,9 @@ The model was trained on the **[NFQA Multilingual Dataset](https://huggingface.c
|
|
| 66 |
|
| 67 |
**Key Features**:
|
| 68 |
- 392 unique (language, category) combinations
|
| 69 |
-
- Target of ~
|
| 70 |
- Stratified sampling to ensure balanced representation
|
| 71 |
-
- Ensemble annotation using Llama 3.1, Gemma 2,
|
| 72 |
|
| 73 |
For detailed information about dataset generation, annotation methodology, and data composition, please visit the [dataset page](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
|
| 74 |
|
|
@@ -150,31 +151,32 @@ Questions comparing two or more options.
|
|
| 150 |
|
| 151 |
## Model Performance
|
| 152 |
|
| 153 |
-
### Test Set Results (
|
| 154 |
|
| 155 |
-
- **Overall Accuracy**:
|
| 156 |
-
- **Macro-Average F1**:
|
| 157 |
-
- **Best Validation F1**:
|
| 158 |
|
| 159 |
### Per-Category Performance
|
| 160 |
|
| 161 |
| Category | Precision | Recall | F1-Score | Support |
|
| 162 |
|----------|-----------|--------|----------|---------|
|
| 163 |
-
| NOT-A-QUESTION | 0.
|
| 164 |
-
| FACTOID | 0.
|
| 165 |
-
| DEBATE | 0.
|
| 166 |
-
| EVIDENCE-BASED | 0.86 | 0.92 | 0.89 |
|
| 167 |
-
| INSTRUCTION | 0.
|
| 168 |
-
| REASON | 0.
|
| 169 |
-
| EXPERIENCE | 0.
|
| 170 |
-
| COMPARISON | 0.
|
| 171 |
|
| 172 |
### Key Observations
|
| 173 |
|
| 174 |
-
- **Strongest Performance**: NOT-A-QUESTION, COMPARISON, and DEBATE categories (F1 ≥ 0.
|
| 175 |
-
- **Good Performance**: EVIDENCE-BASED
|
| 176 |
-
- **
|
| 177 |
- The model generalizes well across all 49 languages with balanced test set distribution
|
|
|
|
| 178 |
|
| 179 |
### Confusion Matrix
|
| 180 |
|
|
@@ -187,20 +189,22 @@ The confusion matrix shows the model's prediction patterns across all 8 categori
|
|
| 187 |
### Hardware
|
| 188 |
|
| 189 |
- Training Device: CUDA-enabled GPU (NVIDIA)
|
| 190 |
-
- Training Time:
|
| 191 |
|
| 192 |
### Hyperparameters
|
| 193 |
|
| 194 |
```python
|
| 195 |
{
|
| 196 |
"model_name": "xlm-roberta-base",
|
| 197 |
-
"max_length":
|
| 198 |
-
"batch_size":
|
| 199 |
"learning_rate": 2e-5, # AdamW learning rate
|
| 200 |
-
"num_epochs":
|
| 201 |
-
"
|
| 202 |
-
"
|
| 203 |
-
"
|
|
|
|
|
|
|
| 204 |
"optimizer": "AdamW", # Optimizer
|
| 205 |
"scheduler": "linear_warmup", # Learning rate scheduler
|
| 206 |
"gradient_clipping": 1.0, # Max gradient norm
|
|
@@ -211,20 +215,20 @@ The confusion matrix shows the model's prediction patterns across all 8 categori
|
|
| 211 |
### Training Process
|
| 212 |
|
| 213 |
1. **Data Preparation**: Pre-split balanced dataset from [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
|
| 214 |
-
- Training:
|
| 215 |
-
- Validation:
|
| 216 |
-
- Test:
|
| 217 |
|
| 218 |
-
2. **Preprocessing**: Tokenization using XLM-RoBERTa tokenizer (max length:
|
| 219 |
|
| 220 |
3. **Training Strategy**: Supervised fine-tuning with stratified train/val/test splits
|
| 221 |
- Stratified by (language, category) combinations to maintain balance
|
| 222 |
|
| 223 |
4. **Optimization**: AdamW optimizer with linear warmup and gradient clipping
|
| 224 |
-
- Total training steps:
|
| 225 |
-
- Warmup steps:
|
| 226 |
|
| 227 |
-
5. **Best Model Selection**: Model checkpoint with highest validation F1 score (epoch
|
| 228 |
|
| 229 |
6. **Evaluation**: Comprehensive testing on held-out test set with per-category and per-language analysis
|
| 230 |
|
|
@@ -232,12 +236,12 @@ The confusion matrix shows the model's prediction patterns across all 8 categori
|
|
| 232 |
|
| 233 |

|
| 234 |
|
| 235 |
-
The training curves show the model's learning progress across
|
| 236 |
- **Left panel**: Training and validation loss over time
|
| 237 |
- **Middle panel**: Training and validation accuracy progression
|
| 238 |
- **Right panel**: Validation F1 score (macro average) with best checkpoint marked
|
| 239 |
|
| 240 |
-
The model
|
| 241 |
|
| 242 |
## Usage
|
| 243 |
|
|
@@ -274,7 +278,7 @@ questions = [
|
|
| 274 |
|
| 275 |
# Classify questions
|
| 276 |
for question in questions:
|
| 277 |
-
inputs = tokenizer(question, return_tensors="pt", truncation=True, max_length=
|
| 278 |
|
| 279 |
with torch.no_grad():
|
| 280 |
outputs = model(**inputs)
|
|
@@ -326,7 +330,7 @@ def classify_questions_batch(questions, model, tokenizer, batch_size=32):
|
|
| 326 |
batch,
|
| 327 |
return_tensors="pt",
|
| 328 |
truncation=True,
|
| 329 |
-
max_length=
|
| 330 |
padding=True
|
| 331 |
)
|
| 332 |
|
|
@@ -367,7 +371,7 @@ classifier = pipeline(
|
|
| 367 |
)
|
| 368 |
|
| 369 |
# Classify single question
|
| 370 |
-
result = classifier("How do I learn Python?", truncation=True, max_length=
|
| 371 |
print(result)
|
| 372 |
# Output: [{'label': 'INSTRUCTION', 'score': 0.91}]
|
| 373 |
|
|
@@ -375,7 +379,7 @@ print(result)
|
|
| 375 |
results = classifier(
|
| 376 |
["What is AI?", "Why do cats purr?", "Best pizza in town?"],
|
| 377 |
truncation=True,
|
| 378 |
-
max_length=
|
| 379 |
)
|
| 380 |
for r in results:
|
| 381 |
print(f"{r['label']}: {r['score']:.2%}")
|
|
@@ -463,6 +467,6 @@ For questions, feedback, or issues:
|
|
| 463 |
|
| 464 |
---
|
| 465 |
|
| 466 |
-
**Model Version**:
|
| 467 |
-
**Last Updated**:
|
| 468 |
**Status**: Production Ready
|
|
|
|
| 55 |
The model was trained on the **[NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)**, a large-scale multilingual dataset for non-factoid question classification.
|
| 56 |
|
| 57 |
**Dataset Composition**:
|
| 58 |
+
- **Training**: 28,653 examples (80%)
|
| 59 |
+
- **Validation**: 3,539 examples (10%)
|
| 60 |
+
- **Test**: 3,671 examples (10%)
|
| 61 |
+
- **Total (Balanced)**: 35,863
|
| 62 |
+
- **Total**: 63,647
|
| 63 |
|
| 64 |
**Source Distribution**:
|
| 65 |
- 54% from WebFAQ dataset (annotated with LLM ensemble)
|
|
|
|
| 67 |
|
| 68 |
**Key Features**:
|
| 69 |
- 392 unique (language, category) combinations
|
| 70 |
+
- Target of ~100 examples per combination
|
| 71 |
- Stratified sampling to ensure balanced representation
|
| 72 |
+
- Ensemble annotation using Llama 3.1, Gemma 2, Qwen 2.5, GPT-4o Mini, DeepSeek V3
|
| 73 |
|
| 74 |
For detailed information about dataset generation, annotation methodology, and data composition, please visit the [dataset page](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
|
| 75 |
|
|
|
|
| 151 |
|
| 152 |
## Model Performance
|
| 153 |
|
| 154 |
+
### Test Set Results (3,671 examples)
|
| 155 |
|
| 156 |
+
- **Overall Accuracy**: 92.1%
|
| 157 |
+
- **Macro-Average F1**: 92.1%
|
| 158 |
+
- **Best Validation F1**: 91.8% (achieved at epoch 9)
|
| 159 |
|
| 160 |
### Per-Category Performance
|
| 161 |
|
| 162 |
| Category | Precision | Recall | F1-Score | Support |
|
| 163 |
|----------|-----------|--------|----------|---------|
|
| 164 |
+
| NOT-A-QUESTION | 0.94 | 0.98 | 0.96 | 463 |
|
| 165 |
+
| FACTOID | 0.89 | 0.83 | 0.86 | 635 |
|
| 166 |
+
| DEBATE | 0.94 | 0.94 | 0.94 | 288 |
|
| 167 |
+
| EVIDENCE-BASED | 0.86 | 0.92 | 0.89 | 387 |
|
| 168 |
+
| INSTRUCTION | 0.92 | 0.95 | 0.94 | 480 |
|
| 169 |
+
| REASON | 0.95 | 0.96 | 0.95 | 427 |
|
| 170 |
+
| EXPERIENCE | 0.91 | 0.85 | 0.88 | 530 |
|
| 171 |
+
| COMPARISON | 0.94 | 0.96 | 0.95 | 461 |
|
| 172 |
|
| 173 |
### Key Observations
|
| 174 |
|
| 175 |
+
- **Strongest Performance**: NOT-A-QUESTION, COMPARISON, REASON, INSTRUCTION, and DEBATE categories (F1 ≥ 0.94)
|
| 176 |
+
- **Good Performance**: EVIDENCE-BASED and EXPERIENCE categories (F1 ≥ 0.88)
|
| 177 |
+
- **Solid Performance**: FACTOID category (F1 = 0.86)
|
| 178 |
- The model generalizes well across all 49 languages with balanced test set distribution
|
| 179 |
+
- Overall improvement of 4% in accuracy compared to previous version (92.1% vs 88.1%)
|
| 180 |
|
| 181 |
### Confusion Matrix
|
| 182 |
|
|
|
|
| 189 |
### Hardware
|
| 190 |
|
| 191 |
- Training Device: CUDA-enabled GPU (NVIDIA)
|
| 192 |
+
- Training Time: 9 epochs to reach best performance
|
| 193 |
|
| 194 |
### Hyperparameters
|
| 195 |
|
| 196 |
```python
|
| 197 |
{
|
| 198 |
"model_name": "xlm-roberta-base",
|
| 199 |
+
"max_length": 125, # Maximum sequence length
|
| 200 |
+
"batch_size": 125, # Training batch size
|
| 201 |
"learning_rate": 2e-5, # AdamW learning rate
|
| 202 |
+
"num_epochs": 10, # Total epochs trained
|
| 203 |
+
"best_epoch": 9, # Best performing epoch
|
| 204 |
+
"warmup_ratio": 0.1, # Warmup ratio (10% of total steps)
|
| 205 |
+
"warmup_steps": 230, # Linear warmup steps
|
| 206 |
+
"weight_decay": 0.1, # L2 regularization
|
| 207 |
+
"dropout": 0.1, # Dropout probability
|
| 208 |
"optimizer": "AdamW", # Optimizer
|
| 209 |
"scheduler": "linear_warmup", # Learning rate scheduler
|
| 210 |
"gradient_clipping": 1.0, # Max gradient norm
|
|
|
|
| 215 |
### Training Process
|
| 216 |
|
| 217 |
1. **Data Preparation**: Pre-split balanced dataset from [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
|
| 218 |
+
- Training: 28,653 examples (80%)
|
| 219 |
+
- Validation: 3,539 examples (10%)
|
| 220 |
+
- Test: 3,671 examples (10%)
|
| 221 |
|
| 222 |
+
2. **Preprocessing**: Tokenization using XLM-RoBERTa tokenizer (max length: 125 tokens)
|
| 223 |
|
| 224 |
3. **Training Strategy**: Supervised fine-tuning with stratified train/val/test splits
|
| 225 |
- Stratified by (language, category) combinations to maintain balance
|
| 226 |
|
| 227 |
4. **Optimization**: AdamW optimizer with linear warmup and gradient clipping
|
| 228 |
+
- Total training steps: 2,293 (28,653 examples × 10 epochs ÷ 125 batch size)
|
| 229 |
+
- Warmup steps: 230 (10% warmup ratio)
|
| 230 |
|
| 231 |
+
5. **Best Model Selection**: Model checkpoint with highest validation F1 score (epoch 9)
|
| 232 |
|
| 233 |
6. **Evaluation**: Comprehensive testing on held-out test set with per-category and per-language analysis
|
| 234 |
|
|
|
|
| 236 |
|
| 237 |

|
| 238 |
|
| 239 |
+
The training curves show the model's learning progress across 10 epochs:
|
| 240 |
- **Left panel**: Training and validation loss over time
|
| 241 |
- **Middle panel**: Training and validation accuracy progression
|
| 242 |
- **Right panel**: Validation F1 score (macro average) with best checkpoint marked
|
| 243 |
|
| 244 |
+
The model reached optimal performance at epoch 9 (validation F1: 91.8%) with minimal overfitting.
|
| 245 |
|
| 246 |
## Usage
|
| 247 |
|
|
|
|
| 278 |
|
| 279 |
# Classify questions
|
| 280 |
for question in questions:
|
| 281 |
+
inputs = tokenizer(question, return_tensors="pt", truncation=True, max_length=125)
|
| 282 |
|
| 283 |
with torch.no_grad():
|
| 284 |
outputs = model(**inputs)
|
|
|
|
| 330 |
batch,
|
| 331 |
return_tensors="pt",
|
| 332 |
truncation=True,
|
| 333 |
+
max_length=125,
|
| 334 |
padding=True
|
| 335 |
)
|
| 336 |
|
|
|
|
| 371 |
)
|
| 372 |
|
| 373 |
# Classify single question
|
| 374 |
+
result = classifier("How do I learn Python?", truncation=True, max_length=125)
|
| 375 |
print(result)
|
| 376 |
# Output: [{'label': 'INSTRUCTION', 'score': 0.91}]
|
| 377 |
|
|
|
|
| 379 |
results = classifier(
|
| 380 |
["What is AI?", "Why do cats purr?", "Best pizza in town?"],
|
| 381 |
truncation=True,
|
| 382 |
+
max_length=125
|
| 383 |
)
|
| 384 |
for r in results:
|
| 385 |
print(f"{r['label']}: {r['score']:.2%}")
|
|
|
|
| 467 |
|
| 468 |
---
|
| 469 |
|
| 470 |
+
**Model Version**: 2.0
|
| 471 |
+
**Last Updated**: March 2026
|
| 472 |
**Status**: Production Ready
|
classification_report.txt
CHANGED
|
@@ -1,14 +1,14 @@
|
|
| 1 |
precision recall f1-score support
|
| 2 |
|
| 3 |
-
NOT-A-QUESTION 0.
|
| 4 |
-
FACTOID 0.
|
| 5 |
-
DEBATE 0.
|
| 6 |
-
EVIDENCE-BASED 0.
|
| 7 |
-
INSTRUCTION 0.
|
| 8 |
-
REASON 0.
|
| 9 |
-
EXPERIENCE 0.
|
| 10 |
-
COMPARISON 0.
|
| 11 |
|
| 12 |
-
accuracy 0.92
|
| 13 |
-
macro avg 0.
|
| 14 |
-
weighted avg 0.
|
|
|
|
| 1 |
precision recall f1-score support
|
| 2 |
|
| 3 |
+
NOT-A-QUESTION 0.94 0.98 0.96 463
|
| 4 |
+
FACTOID 0.89 0.83 0.86 635
|
| 5 |
+
DEBATE 0.94 0.94 0.94 288
|
| 6 |
+
EVIDENCE-BASED 0.86 0.92 0.89 387
|
| 7 |
+
INSTRUCTION 0.92 0.95 0.94 480
|
| 8 |
+
REASON 0.95 0.96 0.95 427
|
| 9 |
+
EXPERIENCE 0.91 0.85 0.88 530
|
| 10 |
+
COMPARISON 0.94 0.96 0.95 461
|
| 11 |
|
| 12 |
+
accuracy 0.92 3671
|
| 13 |
+
macro avg 0.92 0.92 0.92 3671
|
| 14 |
+
weighted avg 0.92 0.92 0.92 3671
|
confusion_matrix.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 1112223464
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:13000897f21406fd80c5e3a09db58f39faafea98a000f08160c77793917940c0
|
| 3 |
size 1112223464
|
test_results.json
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bc6939538c759f3628746f4428d760b57381a25896bfb364c2188c119af8e838
|
| 3 |
+
size 804
|
training_curves.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|