WikiCulturalClassifier
Task Overview
The goal is to classify Wikipedia/Wikidata entities according to their cultural specificity. Each item belongs to one of three classes:
- CA โ Cultural Agnostic: globally known with no identifiable cultural origin
- CR โ Cultural Representative: originates from a specific culture but widely recognized
- CE โ Cultural Exclusive: strongly linked to a single culture and limited global diffusion
The dataset used for training and evaluation is the following: Cultural Dataset
Dataset Usage
Key dataset facts:
- Source: Wikidata entity metadata and descriptions
- ~6k silver-annotated training examples + ~300 gold-annotated validation examples
- Contains entity URL, name, textual description, type, and category
- Culturality labels: CA, CR, CE
Project-specific split strategy:
The provided training set was split into:
- 80% for training
- 20% for internal validation
- Stratified by label to preserve distribution
- Code reference:
train_classifier.py
The official validation set (gold labels) was used as the final evaluation set, treated as test data
This ensures no leakage from gold labels during model development.
Method Overview
This classifier does not use any large language models. Culturality is inferred through structured statistics from Wikipedia and Wikidata.
Feature engineering pipeline:
Wikipedia statistics:
- number of language editions
- English page length, number of categories, external links
Wikidata claims analysis:
- cultural/geographic properties
- identifier count and diversity
Composite and interaction features:
- cultural specificity score
- global reach score
- polynomial transformations and binary thresholds
Feature extraction and preprocessing code:
extract_features.py, engineer_features() in train_classifier.py
Classifier and training:
- Model: XGBoost with GPU inference
- Objective: Multiclass soft probabilities (3 classes)
- Standardization applied to all continuous features
- Hyperparameters optimized with Optuna
Code reference:
tune_hyperparameters.py
The model artifact (best_model.pkl) includes:
- Trained XGBoost model
- Feature scaler
- Label encoder
- Feature list
Performance Summary
All metrics evaluated on the gold validation set:
| Model Variant | F1 |
|---|---|
| Text-only baseline | 0.50 |
| Default XGBoost | 0.68 |
| Tuned XGBoost (final model) | 0.71 |
Class-wise summary (high level):
- CA: high recall, strong detection of globally-known items
- CE: balanced performance
- CR: most challenging class due to ambiguity between global reach and origin