feat: improved accuracy to 92%

Browse files

Files changed (6) hide show

README.md +47 -43
classification_report.txt +11 -11
confusion_matrix.png +2 -2
model.safetensors +1 -1
test_results.json +2 -2
training_curves.png +2 -2

README.md CHANGED Viewed

@@ -55,10 +55,11 @@ The model is based on XLM-RoBERTa (Cross-lingual Language Model - Robustly Optim
 The model was trained on the **[NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)**, a large-scale multilingual dataset for non-factoid question classification.
 **Dataset Composition**:
-- **Training**: 33,602 examples (70%)
-- **Validation**: 6,979 examples (15%)
-- **Test**: 7,696 examples (15%)
-- **Total**: 48,277 balanced examples
 **Source Distribution**:
 - 54% from WebFAQ dataset (annotated with LLM ensemble)
@@ -66,9 +67,9 @@ The model was trained on the **[NFQA Multilingual Dataset](https://huggingface.c
 **Key Features**:
 - 392 unique (language, category) combinations
-- Target of ~125 examples per combination
 - Stratified sampling to ensure balanced representation
-- Ensemble annotation using Llama 3.1, Gemma 2, and Qwen 2.5
 For detailed information about dataset generation, annotation methodology, and data composition, please visit the [dataset page](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
@@ -150,31 +151,32 @@ Questions comparing two or more options.
 ## Model Performance
-### Test Set Results (7,696 examples)
-- **Overall Accuracy**: 88.1%
-- **Macro-Average F1**: 88.1%
-- **Best Validation F1**: 88.1% (achieved at epoch 6)
 ### Per-Category Performance
 | Category | Precision | Recall | F1-Score | Support |
 |----------|-----------|--------|----------|---------|
-| NOT-A-QUESTION | 0.96 | 0.92 | 0.94 | 950 |
-| FACTOID | 0.84 | 0.79 | 0.81 | 980 |
-| DEBATE | 0.90 | 0.95 | 0.92 | 916 |
-| EVIDENCE-BASED | 0.86 | 0.92 | 0.89 | 950 |
-| INSTRUCTION | 0.85 | 0.92 | 0.88 | 980 |
-| REASON | 0.88 | 0.86 | 0.87 | 960 |
-| EXPERIENCE | 0.82 | 0.76 | 0.79 | 980 |
-| COMPARISON | 0.93 | 0.93 | 0.93 | 980 |
 ### Key Observations
-- **Strongest Performance**: NOT-A-QUESTION, COMPARISON, and DEBATE categories (F1 ≥ 0.92)
-- **Good Performance**: EVIDENCE-BASED, INSTRUCTION, and REASON categories (F1 ≥ 0.87)
-- **Moderate Performance**: FACTOID and EXPERIENCE categories (F1 ~ 0.79-0.81)
 - The model generalizes well across all 49 languages with balanced test set distribution
 ### Confusion Matrix
@@ -187,20 +189,22 @@ The confusion matrix shows the model's prediction patterns across all 8 categori
 ### Hardware
 - Training Device: CUDA-enabled GPU (NVIDIA)
-- Training Time: 6 epochs to reach best performance
 ### Hyperparameters
 ```python
 {
   "model_name": "xlm-roberta-base",
-  "max_length": 128,              # Maximum sequence length
-  "batch_size": 16,                # Training batch size
   "learning_rate": 2e-5,           # AdamW learning rate
-  "num_epochs": 6,                 # Total epochs trained
-  "warmup_steps": 500,             # Linear warmup steps
-  "weight_decay": 0.01,            # L2 regularization
-  "dropout": 0.2,                  # Dropout probability
   "optimizer": "AdamW",            # Optimizer
   "scheduler": "linear_warmup",    # Learning rate scheduler
   "gradient_clipping": 1.0,        # Max gradient norm
@@ -211,20 +215,20 @@ The confusion matrix shows the model's prediction patterns across all 8 categori
 ### Training Process
 1. **Data Preparation**: Pre-split balanced dataset from [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
-   - Training: 33,602 examples (70%)
-   - Validation: 6,979 examples (15%)
-   - Test: 7,696 examples (15%)
-2. **Preprocessing**: Tokenization using XLM-RoBERTa tokenizer (max length: 128 tokens)
 3. **Training Strategy**: Supervised fine-tuning with stratified train/val/test splits
    - Stratified by (language, category) combinations to maintain balance
 4. **Optimization**: AdamW optimizer with linear warmup and gradient clipping
-   - Total training steps: 12,606 (33,602 examples × 6 epochs ÷ 16 batch size)
-   - Warmup steps: 500
-5. **Best Model Selection**: Model checkpoint with highest validation F1 score (epoch 6)
 6. **Evaluation**: Comprehensive testing on held-out test set with per-category and per-language analysis
@@ -232,12 +236,12 @@ The confusion matrix shows the model's prediction patterns across all 8 categori
 ![Training Curves](training_curves.png)
-The training curves show the model's learning progress across 6 epochs:
 - **Left panel**: Training and validation loss over time
 - **Middle panel**: Training and validation accuracy progression
 - **Right panel**: Validation F1 score (macro average) with best checkpoint marked
-The model converged quickly, reaching optimal performance at epoch 6 with minimal overfitting.
 ## Usage
@@ -274,7 +278,7 @@ questions = [
 # Classify questions
 for question in questions:
-    inputs = tokenizer(question, return_tensors="pt", truncation=True, max_length=128)
     with torch.no_grad():
         outputs = model(**inputs)
@@ -326,7 +330,7 @@ def classify_questions_batch(questions, model, tokenizer, batch_size=32):
             batch,
             return_tensors="pt",
             truncation=True,
-            max_length=128,
             padding=True
         )
@@ -367,7 +371,7 @@ classifier = pipeline(
 )
 # Classify single question
-result = classifier("How do I learn Python?", truncation=True, max_length=128)
 print(result)
 # Output: [{'label': 'INSTRUCTION', 'score': 0.91}]
@@ -375,7 +379,7 @@ print(result)
 results = classifier(
     ["What is AI?", "Why do cats purr?", "Best pizza in town?"],
     truncation=True,
-    max_length=128
 )
 for r in results:
     print(f"{r['label']}: {r['score']:.2%}")
@@ -463,6 +467,6 @@ For questions, feedback, or issues:
 ---
-**Model Version**: 1.0
-**Last Updated**: February 2026
 **Status**: Production Ready

 The model was trained on the **[NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)**, a large-scale multilingual dataset for non-factoid question classification.
 **Dataset Composition**:
+- **Training**: 28,653 examples (80%)
+- **Validation**: 3,539 examples (10%)
+- **Test**: 3,671 examples (10%)
+- **Total (Balanced)**: 35,863
+- **Total**: 63,647
 **Source Distribution**:
 - 54% from WebFAQ dataset (annotated with LLM ensemble)
 **Key Features**:
 - 392 unique (language, category) combinations
+- Target of ~100 examples per combination
 - Stratified sampling to ensure balanced representation
+- Ensemble annotation using Llama 3.1, Gemma 2, Qwen 2.5, GPT-4o Mini, DeepSeek V3
 For detailed information about dataset generation, annotation methodology, and data composition, please visit the [dataset page](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
 ## Model Performance
+### Test Set Results (3,671 examples)
+- **Overall Accuracy**: 92.1%
+- **Macro-Average F1**: 92.1%
+- **Best Validation F1**: 91.8% (achieved at epoch 9)
 ### Per-Category Performance
 | Category | Precision | Recall | F1-Score | Support |
 |----------|-----------|--------|----------|---------|
+| NOT-A-QUESTION | 0.94 | 0.98 | 0.96 | 463 |
+| FACTOID | 0.89 | 0.83 | 0.86 | 635 |
+| DEBATE | 0.94 | 0.94 | 0.94 | 288 |
+| EVIDENCE-BASED | 0.86 | 0.92 | 0.89 | 387 |
+| INSTRUCTION | 0.92 | 0.95 | 0.94 | 480 |
+| REASON | 0.95 | 0.96 | 0.95 | 427 |
+| EXPERIENCE | 0.91 | 0.85 | 0.88 | 530 |
+| COMPARISON | 0.94 | 0.96 | 0.95 | 461 |
 ### Key Observations
+- **Strongest Performance**: NOT-A-QUESTION, COMPARISON, REASON, INSTRUCTION, and DEBATE categories (F1 ≥ 0.94)
+- **Good Performance**: EVIDENCE-BASED and EXPERIENCE categories (F1 ≥ 0.88)
+- **Solid Performance**: FACTOID category (F1 = 0.86)
 - The model generalizes well across all 49 languages with balanced test set distribution
+- Overall improvement of 4% in accuracy compared to previous version (92.1% vs 88.1%)
 ### Confusion Matrix
 ### Hardware
 - Training Device: CUDA-enabled GPU (NVIDIA)
+- Training Time: 9 epochs to reach best performance
 ### Hyperparameters
 ```python
 {
   "model_name": "xlm-roberta-base",
+  "max_length": 125,              # Maximum sequence length
+  "batch_size": 125,               # Training batch size
   "learning_rate": 2e-5,           # AdamW learning rate
+  "num_epochs": 10,                # Total epochs trained
+  "best_epoch": 9,                 # Best performing epoch
+  "warmup_ratio": 0.1,             # Warmup ratio (10% of total steps)
+  "warmup_steps": 230,             # Linear warmup steps
+  "weight_decay": 0.1,             # L2 regularization
+  "dropout": 0.1,                  # Dropout probability
   "optimizer": "AdamW",            # Optimizer
   "scheduler": "linear_warmup",    # Learning rate scheduler
   "gradient_clipping": 1.0,        # Max gradient norm
 ### Training Process
 1. **Data Preparation**: Pre-split balanced dataset from [NFQA Multilingual Dataset](https://huggingface.co/datasets/AliSalman29/nfqa-multilingual-dataset)
+   - Training: 28,653 examples (80%)
+   - Validation: 3,539 examples (10%)
+   - Test: 3,671 examples (10%)
+2. **Preprocessing**: Tokenization using XLM-RoBERTa tokenizer (max length: 125 tokens)
 3. **Training Strategy**: Supervised fine-tuning with stratified train/val/test splits
    - Stratified by (language, category) combinations to maintain balance
 4. **Optimization**: AdamW optimizer with linear warmup and gradient clipping
+   - Total training steps: 2,293 (28,653 examples × 10 epochs ÷ 125 batch size)
+   - Warmup steps: 230 (10% warmup ratio)
+5. **Best Model Selection**: Model checkpoint with highest validation F1 score (epoch 9)
 6. **Evaluation**: Comprehensive testing on held-out test set with per-category and per-language analysis
 ![Training Curves](training_curves.png)
+The training curves show the model's learning progress across 10 epochs:
 - **Left panel**: Training and validation loss over time
 - **Middle panel**: Training and validation accuracy progression
 - **Right panel**: Validation F1 score (macro average) with best checkpoint marked
+The model reached optimal performance at epoch 9 (validation F1: 91.8%) with minimal overfitting.
 ## Usage
 # Classify questions
 for question in questions:
+    inputs = tokenizer(question, return_tensors="pt", truncation=True, max_length=125)
     with torch.no_grad():
         outputs = model(**inputs)
             batch,
             return_tensors="pt",
             truncation=True,
+            max_length=125,
             padding=True
         )
 )
 # Classify single question
+result = classifier("How do I learn Python?", truncation=True, max_length=125)
 print(result)
 # Output: [{'label': 'INSTRUCTION', 'score': 0.91}]
 results = classifier(
     ["What is AI?", "Why do cats purr?", "Best pizza in town?"],
     truncation=True,
+    max_length=125
 )
 for r in results:
     print(f"{r['label']}: {r['score']:.2%}")
 ---
+**Model Version**: 2.0
+**Last Updated**: March 2026
 **Status**: Production Ready

classification_report.txt CHANGED Viewed

@@ -1,14 +1,14 @@
                 precision    recall  f1-score   support
-NOT-A-QUESTION       0.99      0.99      0.99       557
-       FACTOID       0.92      0.87      0.90       896
-        DEBATE       0.92      0.96      0.94       472
-EVIDENCE-BASED       0.88      0.95      0.91       568
-   INSTRUCTION       0.95      0.94      0.94       662
-        REASON       0.94      0.94      0.94       493
-    EXPERIENCE       0.86      0.85      0.85       686
-    COMPARISON       0.96      0.96      0.96       679
-      accuracy                           0.92      5013
-     macro avg       0.93      0.93      0.93      5013
-  weighted avg       0.93      0.92      0.92      5013

                 precision    recall  f1-score   support
+NOT-A-QUESTION       0.94      0.98      0.96       463
+       FACTOID       0.89      0.83      0.86       635
+        DEBATE       0.94      0.94      0.94       288
+EVIDENCE-BASED       0.86      0.92      0.89       387
+   INSTRUCTION       0.92      0.95      0.94       480
+        REASON       0.95      0.96      0.95       427
+    EXPERIENCE       0.91      0.85      0.88       530
+    COMPARISON       0.94      0.96      0.95       461
+      accuracy                           0.92      3671
+     macro avg       0.92      0.92      0.92      3671
+  weighted avg       0.92      0.92      0.92      3671

confusion_matrix.png CHANGED Viewed

Git LFS Details

SHA256: 8f0f532f1ac2c188ff308cda4f70f375242a81448aaaa52a1cde4dd4e26ce61f
Pointer size: 131 Bytes
Size of remote file: 311 kB

Git LFS Details

SHA256: 6d93190fbcc02d9dc3070759105d5308da3f097f1d0665bc6ee31ddc4205710f
Pointer size: 131 Bytes
Size of remote file: 297 kB

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:061cc36ce4649a1ca2c988c042eece9a041a7bae6589619c177ef053fdbadeb5
 size 1112223464

 version https://git-lfs.github.com/spec/v1
+oid sha256:13000897f21406fd80c5e3a09db58f39faafea98a000f08160c77793917940c0
 size 1112223464

test_results.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4f85874cae37b57c474bb4450d166414c18f4e141ad6eefc3233030664397ecc
-size 805

 version https://git-lfs.github.com/spec/v1
+oid sha256:bc6939538c759f3628746f4428d760b57381a25896bfb364c2188c119af8e838
+size 804

training_curves.png CHANGED Viewed

Git LFS Details

SHA256: 004ef9ab1a7bded3914e5b78f6b89621bfb202bd65d96d2a9ba639d28c6c9606
Pointer size: 131 Bytes
Size of remote file: 278 kB

Git LFS Details

SHA256: 55eb995a8c14353b1b9e8b52ec017b2d2a8d8a3145ec8958eee1145dfa7434cf
Pointer size: 131 Bytes
Size of remote file: 281 kB