AliSalman29 commited on
Commit
79d10e6
·
1 Parent(s): a9f2764

feat: improved accuracy to 92%

Browse files
README.md CHANGED
@@ -55,10 +55,11 @@ The model is based on XLM-RoBERTa (Cross-lingual Language Model - Robustly Optim
55
  The model was trained on the **[NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)**, a large-scale multilingual dataset for non-factoid question classification.
56
 
57
  **Dataset Composition**:
58
- - **Training**: 33,602 examples (70%)
59
- - **Validation**: 6,979 examples (15%)
60
- - **Test**: 7,696 examples (15%)
61
- - **Total**: 48,277 balanced examples
 
62
 
63
  **Source Distribution**:
64
  - 54% from WebFAQ dataset (annotated with LLM ensemble)
@@ -66,9 +67,9 @@ The model was trained on the **[NFQA Multilingual Dataset](https://huggingface.c
66
 
67
  **Key Features**:
68
  - 392 unique (language, category) combinations
69
- - Target of ~125 examples per combination
70
  - Stratified sampling to ensure balanced representation
71
- - Ensemble annotation using Llama 3.1, Gemma 2, and Qwen 2.5
72
 
73
  For detailed information about dataset generation, annotation methodology, and data composition, please visit the [dataset page](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
74
 
@@ -150,31 +151,32 @@ Questions comparing two or more options.
150
 
151
  ## Model Performance
152
 
153
- ### Test Set Results (7,696 examples)
154
 
155
- - **Overall Accuracy**: 88.1%
156
- - **Macro-Average F1**: 88.1%
157
- - **Best Validation F1**: 88.1% (achieved at epoch 6)
158
 
159
  ### Per-Category Performance
160
 
161
  | Category | Precision | Recall | F1-Score | Support |
162
  |----------|-----------|--------|----------|---------|
163
- | NOT-A-QUESTION | 0.96 | 0.92 | 0.94 | 950 |
164
- | FACTOID | 0.84 | 0.79 | 0.81 | 980 |
165
- | DEBATE | 0.90 | 0.95 | 0.92 | 916 |
166
- | EVIDENCE-BASED | 0.86 | 0.92 | 0.89 | 950 |
167
- | INSTRUCTION | 0.85 | 0.92 | 0.88 | 980 |
168
- | REASON | 0.88 | 0.86 | 0.87 | 960 |
169
- | EXPERIENCE | 0.82 | 0.76 | 0.79 | 980 |
170
- | COMPARISON | 0.93 | 0.93 | 0.93 | 980 |
171
 
172
  ### Key Observations
173
 
174
- - **Strongest Performance**: NOT-A-QUESTION, COMPARISON, and DEBATE categories (F1 ≥ 0.92)
175
- - **Good Performance**: EVIDENCE-BASED, INSTRUCTION, and REASON categories (F1 ≥ 0.87)
176
- - **Moderate Performance**: FACTOID and EXPERIENCE categories (F1 ~ 0.79-0.81)
177
  - The model generalizes well across all 49 languages with balanced test set distribution
 
178
 
179
  ### Confusion Matrix
180
 
@@ -187,20 +189,22 @@ The confusion matrix shows the model's prediction patterns across all 8 categori
187
  ### Hardware
188
 
189
  - Training Device: CUDA-enabled GPU (NVIDIA)
190
- - Training Time: 6 epochs to reach best performance
191
 
192
  ### Hyperparameters
193
 
194
  ```python
195
  {
196
  "model_name": "xlm-roberta-base",
197
- "max_length": 128, # Maximum sequence length
198
- "batch_size": 16, # Training batch size
199
  "learning_rate": 2e-5, # AdamW learning rate
200
- "num_epochs": 6, # Total epochs trained
201
- "warmup_steps": 500, # Linear warmup steps
202
- "weight_decay": 0.01, # L2 regularization
203
- "dropout": 0.2, # Dropout probability
 
 
204
  "optimizer": "AdamW", # Optimizer
205
  "scheduler": "linear_warmup", # Learning rate scheduler
206
  "gradient_clipping": 1.0, # Max gradient norm
@@ -211,20 +215,20 @@ The confusion matrix shows the model's prediction patterns across all 8 categori
211
  ### Training Process
212
 
213
  1. **Data Preparation**: Pre-split balanced dataset from [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
214
- - Training: 33,602 examples (70%)
215
- - Validation: 6,979 examples (15%)
216
- - Test: 7,696 examples (15%)
217
 
218
- 2. **Preprocessing**: Tokenization using XLM-RoBERTa tokenizer (max length: 128 tokens)
219
 
220
  3. **Training Strategy**: Supervised fine-tuning with stratified train/val/test splits
221
  - Stratified by (language, category) combinations to maintain balance
222
 
223
  4. **Optimization**: AdamW optimizer with linear warmup and gradient clipping
224
- - Total training steps: 12,606 (33,602 examples × 6 epochs ÷ 16 batch size)
225
- - Warmup steps: 500
226
 
227
- 5. **Best Model Selection**: Model checkpoint with highest validation F1 score (epoch 6)
228
 
229
  6. **Evaluation**: Comprehensive testing on held-out test set with per-category and per-language analysis
230
 
@@ -232,12 +236,12 @@ The confusion matrix shows the model's prediction patterns across all 8 categori
232
 
233
  ![Training Curves](training_curves.png)
234
 
235
- The training curves show the model's learning progress across 6 epochs:
236
  - **Left panel**: Training and validation loss over time
237
  - **Middle panel**: Training and validation accuracy progression
238
  - **Right panel**: Validation F1 score (macro average) with best checkpoint marked
239
 
240
- The model converged quickly, reaching optimal performance at epoch 6 with minimal overfitting.
241
 
242
  ## Usage
243
 
@@ -274,7 +278,7 @@ questions = [
274
 
275
  # Classify questions
276
  for question in questions:
277
- inputs = tokenizer(question, return_tensors="pt", truncation=True, max_length=128)
278
 
279
  with torch.no_grad():
280
  outputs = model(**inputs)
@@ -326,7 +330,7 @@ def classify_questions_batch(questions, model, tokenizer, batch_size=32):
326
  batch,
327
  return_tensors="pt",
328
  truncation=True,
329
- max_length=128,
330
  padding=True
331
  )
332
 
@@ -367,7 +371,7 @@ classifier = pipeline(
367
  )
368
 
369
  # Classify single question
370
- result = classifier("How do I learn Python?", truncation=True, max_length=128)
371
  print(result)
372
  # Output: [{'label': 'INSTRUCTION', 'score': 0.91}]
373
 
@@ -375,7 +379,7 @@ print(result)
375
  results = classifier(
376
  ["What is AI?", "Why do cats purr?", "Best pizza in town?"],
377
  truncation=True,
378
- max_length=128
379
  )
380
  for r in results:
381
  print(f"{r['label']}: {r['score']:.2%}")
@@ -463,6 +467,6 @@ For questions, feedback, or issues:
463
 
464
  ---
465
 
466
- **Model Version**: 1.0
467
- **Last Updated**: February 2026
468
  **Status**: Production Ready
 
55
  The model was trained on the **[NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)**, a large-scale multilingual dataset for non-factoid question classification.
56
 
57
  **Dataset Composition**:
58
+ - **Training**: 28,653 examples (80%)
59
+ - **Validation**: 3,539 examples (10%)
60
+ - **Test**: 3,671 examples (10%)
61
+ - **Total (Balanced)**: 35,863
62
+ - **Total**: 63,647
63
 
64
  **Source Distribution**:
65
  - 54% from WebFAQ dataset (annotated with LLM ensemble)
 
67
 
68
  **Key Features**:
69
  - 392 unique (language, category) combinations
70
+ - Target of ~100 examples per combination
71
  - Stratified sampling to ensure balanced representation
72
+ - Ensemble annotation using Llama 3.1, Gemma 2, Qwen 2.5, GPT-4o Mini, DeepSeek V3
73
 
74
  For detailed information about dataset generation, annotation methodology, and data composition, please visit the [dataset page](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
75
 
 
151
 
152
  ## Model Performance
153
 
154
+ ### Test Set Results (3,671 examples)
155
 
156
+ - **Overall Accuracy**: 92.1%
157
+ - **Macro-Average F1**: 92.1%
158
+ - **Best Validation F1**: 91.8% (achieved at epoch 9)
159
 
160
  ### Per-Category Performance
161
 
162
  | Category | Precision | Recall | F1-Score | Support |
163
  |----------|-----------|--------|----------|---------|
164
+ | NOT-A-QUESTION | 0.94 | 0.98 | 0.96 | 463 |
165
+ | FACTOID | 0.89 | 0.83 | 0.86 | 635 |
166
+ | DEBATE | 0.94 | 0.94 | 0.94 | 288 |
167
+ | EVIDENCE-BASED | 0.86 | 0.92 | 0.89 | 387 |
168
+ | INSTRUCTION | 0.92 | 0.95 | 0.94 | 480 |
169
+ | REASON | 0.95 | 0.96 | 0.95 | 427 |
170
+ | EXPERIENCE | 0.91 | 0.85 | 0.88 | 530 |
171
+ | COMPARISON | 0.94 | 0.96 | 0.95 | 461 |
172
 
173
  ### Key Observations
174
 
175
+ - **Strongest Performance**: NOT-A-QUESTION, COMPARISON, REASON, INSTRUCTION, and DEBATE categories (F1 ≥ 0.94)
176
+ - **Good Performance**: EVIDENCE-BASED and EXPERIENCE categories (F1 ≥ 0.88)
177
+ - **Solid Performance**: FACTOID category (F1 = 0.86)
178
  - The model generalizes well across all 49 languages with balanced test set distribution
179
+ - Overall improvement of 4% in accuracy compared to previous version (92.1% vs 88.1%)
180
 
181
  ### Confusion Matrix
182
 
 
189
  ### Hardware
190
 
191
  - Training Device: CUDA-enabled GPU (NVIDIA)
192
+ - Training Time: 9 epochs to reach best performance
193
 
194
  ### Hyperparameters
195
 
196
  ```python
197
  {
198
  "model_name": "xlm-roberta-base",
199
+ "max_length": 125, # Maximum sequence length
200
+ "batch_size": 125, # Training batch size
201
  "learning_rate": 2e-5, # AdamW learning rate
202
+ "num_epochs": 10, # Total epochs trained
203
+ "best_epoch": 9, # Best performing epoch
204
+ "warmup_ratio": 0.1, # Warmup ratio (10% of total steps)
205
+ "warmup_steps": 230, # Linear warmup steps
206
+ "weight_decay": 0.1, # L2 regularization
207
+ "dropout": 0.1, # Dropout probability
208
  "optimizer": "AdamW", # Optimizer
209
  "scheduler": "linear_warmup", # Learning rate scheduler
210
  "gradient_clipping": 1.0, # Max gradient norm
 
215
  ### Training Process
216
 
217
  1. **Data Preparation**: Pre-split balanced dataset from [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
218
+ - Training: 28,653 examples (80%)
219
+ - Validation: 3,539 examples (10%)
220
+ - Test: 3,671 examples (10%)
221
 
222
+ 2. **Preprocessing**: Tokenization using XLM-RoBERTa tokenizer (max length: 125 tokens)
223
 
224
  3. **Training Strategy**: Supervised fine-tuning with stratified train/val/test splits
225
  - Stratified by (language, category) combinations to maintain balance
226
 
227
  4. **Optimization**: AdamW optimizer with linear warmup and gradient clipping
228
+ - Total training steps: 2,293 (28,653 examples × 10 epochs ÷ 125 batch size)
229
+ - Warmup steps: 230 (10% warmup ratio)
230
 
231
+ 5. **Best Model Selection**: Model checkpoint with highest validation F1 score (epoch 9)
232
 
233
  6. **Evaluation**: Comprehensive testing on held-out test set with per-category and per-language analysis
234
 
 
236
 
237
  ![Training Curves](training_curves.png)
238
 
239
+ The training curves show the model's learning progress across 10 epochs:
240
  - **Left panel**: Training and validation loss over time
241
  - **Middle panel**: Training and validation accuracy progression
242
  - **Right panel**: Validation F1 score (macro average) with best checkpoint marked
243
 
244
+ The model reached optimal performance at epoch 9 (validation F1: 91.8%) with minimal overfitting.
245
 
246
  ## Usage
247
 
 
278
 
279
  # Classify questions
280
  for question in questions:
281
+ inputs = tokenizer(question, return_tensors="pt", truncation=True, max_length=125)
282
 
283
  with torch.no_grad():
284
  outputs = model(**inputs)
 
330
  batch,
331
  return_tensors="pt",
332
  truncation=True,
333
+ max_length=125,
334
  padding=True
335
  )
336
 
 
371
  )
372
 
373
  # Classify single question
374
+ result = classifier("How do I learn Python?", truncation=True, max_length=125)
375
  print(result)
376
  # Output: [{'label': 'INSTRUCTION', 'score': 0.91}]
377
 
 
379
  results = classifier(
380
  ["What is AI?", "Why do cats purr?", "Best pizza in town?"],
381
  truncation=True,
382
+ max_length=125
383
  )
384
  for r in results:
385
  print(f"{r['label']}: {r['score']:.2%}")
 
467
 
468
  ---
469
 
470
+ **Model Version**: 2.0
471
+ **Last Updated**: March 2026
472
  **Status**: Production Ready
classification_report.txt CHANGED
@@ -1,14 +1,14 @@
1
  precision recall f1-score support
2
 
3
- NOT-A-QUESTION 0.99 0.99 0.99 557
4
- FACTOID 0.92 0.87 0.90 896
5
- DEBATE 0.92 0.96 0.94 472
6
- EVIDENCE-BASED 0.88 0.95 0.91 568
7
- INSTRUCTION 0.95 0.94 0.94 662
8
- REASON 0.94 0.94 0.94 493
9
- EXPERIENCE 0.86 0.85 0.85 686
10
- COMPARISON 0.96 0.96 0.96 679
11
 
12
- accuracy 0.92 5013
13
- macro avg 0.93 0.93 0.93 5013
14
- weighted avg 0.93 0.92 0.92 5013
 
1
  precision recall f1-score support
2
 
3
+ NOT-A-QUESTION 0.94 0.98 0.96 463
4
+ FACTOID 0.89 0.83 0.86 635
5
+ DEBATE 0.94 0.94 0.94 288
6
+ EVIDENCE-BASED 0.86 0.92 0.89 387
7
+ INSTRUCTION 0.92 0.95 0.94 480
8
+ REASON 0.95 0.96 0.95 427
9
+ EXPERIENCE 0.91 0.85 0.88 530
10
+ COMPARISON 0.94 0.96 0.95 461
11
 
12
+ accuracy 0.92 3671
13
+ macro avg 0.92 0.92 0.92 3671
14
+ weighted avg 0.92 0.92 0.92 3671
confusion_matrix.png CHANGED

Git LFS Details

  • SHA256: 8f0f532f1ac2c188ff308cda4f70f375242a81448aaaa52a1cde4dd4e26ce61f
  • Pointer size: 131 Bytes
  • Size of remote file: 311 kB

Git LFS Details

  • SHA256: 6d93190fbcc02d9dc3070759105d5308da3f097f1d0665bc6ee31ddc4205710f
  • Pointer size: 131 Bytes
  • Size of remote file: 297 kB
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:061cc36ce4649a1ca2c988c042eece9a041a7bae6589619c177ef053fdbadeb5
3
  size 1112223464
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:13000897f21406fd80c5e3a09db58f39faafea98a000f08160c77793917940c0
3
  size 1112223464
test_results.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4f85874cae37b57c474bb4450d166414c18f4e141ad6eefc3233030664397ecc
3
- size 805
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bc6939538c759f3628746f4428d760b57381a25896bfb364c2188c119af8e838
3
+ size 804
training_curves.png CHANGED

Git LFS Details

  • SHA256: 004ef9ab1a7bded3914e5b78f6b89621bfb202bd65d96d2a9ba639d28c6c9606
  • Pointer size: 131 Bytes
  • Size of remote file: 278 kB

Git LFS Details

  • SHA256: 55eb995a8c14353b1b9e8b52ec017b2d2a8d8a3145ec8958eee1145dfa7434cf
  • Pointer size: 131 Bytes
  • Size of remote file: 281 kB