WikiCulturalClassifier

Task Overview

The goal is to classify Wikipedia/Wikidata entities according to their cultural specificity. Each item belongs to one of three classes:

CA – Cultural Agnostic: globally known with no identifiable cultural origin
CR – Cultural Representative: originates from a specific culture but widely recognized
CE – Cultural Exclusive: strongly linked to a single culture and limited global diffusion

The dataset used for training and evaluation is the following: Cultural Dataset

Dataset Usage

Key dataset facts:

Source: Wikidata entity metadata and descriptions
~6k silver-annotated training examples + ~300 gold-annotated validation examples
Contains entity URL, name, textual description, type, and category
Culturality labels: CA, CR, CE

Project-specific split strategy:

The provided training set was split into:
- 80% for training
- 20% for internal validation
- Stratified by label to preserve distribution
- Code reference: train_classifier.py
The official validation set (gold labels) was used as the final evaluation set, treated as test data

This ensures no leakage from gold labels during model development.

Method Overview

This classifier does not use any large language models. Culturality is inferred through structured statistics from Wikipedia and Wikidata.

Feature engineering pipeline:

Wikipedia statistics:
- number of language editions
- English page length, number of categories, external links
Wikidata claims analysis:
- cultural/geographic properties
- identifier count and diversity
Composite and interaction features:
- cultural specificity score
- global reach score
- polynomial transformations and binary thresholds

Feature extraction and preprocessing code: extract_features.py, engineer_features() in train_classifier.py

Classifier and training:

Model: XGBoost with GPU inference
Objective: Multiclass soft probabilities (3 classes)
Standardization applied to all continuous features
Hyperparameters optimized with Optuna Code reference: tune_hyperparameters.py

The model artifact (best_model.pkl) includes:

Trained XGBoost model
Feature scaler
Label encoder
Feature list

Performance Summary

All metrics evaluated on the gold validation set:

Model Variant	F1
Text-only baseline	0.50
Default XGBoost	0.68
Tuned XGBoost (final model)	0.71

Class-wise summary (high level):

CA: high recall, strong detection of globally-known items
CE: balanced performance
CR: most challenging class due to ambiguity between global reach and origin

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support