WikiCulturalClassifier

Task Overview

The goal is to classify Wikipedia/Wikidata entities according to their cultural specificity. Each item belongs to one of three classes:

  • CA โ€“ Cultural Agnostic: globally known with no identifiable cultural origin
  • CR โ€“ Cultural Representative: originates from a specific culture but widely recognized
  • CE โ€“ Cultural Exclusive: strongly linked to a single culture and limited global diffusion

The dataset used for training and evaluation is the following: Cultural Dataset

Dataset Usage

Key dataset facts:

  • Source: Wikidata entity metadata and descriptions
  • ~6k silver-annotated training examples + ~300 gold-annotated validation examples
  • Contains entity URL, name, textual description, type, and category
  • Culturality labels: CA, CR, CE

Project-specific split strategy:

  • The provided training set was split into:

    • 80% for training
    • 20% for internal validation
    • Stratified by label to preserve distribution
    • Code reference: train_classifier.py
  • The official validation set (gold labels) was used as the final evaluation set, treated as test data

This ensures no leakage from gold labels during model development.


Method Overview

This classifier does not use any large language models. Culturality is inferred through structured statistics from Wikipedia and Wikidata.

Feature engineering pipeline:

  • Wikipedia statistics:

    • number of language editions
    • English page length, number of categories, external links
  • Wikidata claims analysis:

    • cultural/geographic properties
    • identifier count and diversity
  • Composite and interaction features:

    • cultural specificity score
    • global reach score
    • polynomial transformations and binary thresholds

Feature extraction and preprocessing code: extract_features.py, engineer_features() in train_classifier.py

Classifier and training:

  • Model: XGBoost with GPU inference
  • Objective: Multiclass soft probabilities (3 classes)
  • Standardization applied to all continuous features
  • Hyperparameters optimized with Optuna Code reference: tune_hyperparameters.py

The model artifact (best_model.pkl) includes:

  • Trained XGBoost model
  • Feature scaler
  • Label encoder
  • Feature list

Performance Summary

All metrics evaluated on the gold validation set:

Model Variant F1
Text-only baseline 0.50
Default XGBoost 0.68
Tuned XGBoost (final model) 0.71

Class-wise summary (high level):

  • CA: high recall, strong detection of globally-known items
  • CE: balanced performance
  • CR: most challenging class due to ambiguity between global reach and origin
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support