Domain Labeler

This is a Ettin 32M parameter model fined-tuned with the Wikipedia Domain Labels dataset for text classification.

This model classifies text into one of the following classes.

labels = [
    "aerospace", "agronomy", "artistic", "astronomy", "atmospheric_science", "automotive", "beauty",
    "biology", "celebrity", "chemistry", "civil_engineering", "communication_engineering",
    "computer_science_and_technology", "design", "drama_and_film", "economics",
    "electronic_science", "entertainment", "environmental_science", "fashion", "finance",
    "food", "gamble", "game", "geography", "health", "history", "hobby", "hydraulic_engineering", 
    "instrument_science", "journalism_and_media_communication", "landscape_architecture", "law",
    "library", "literature", "materials_science", "mathematics", "mechanical_engineering",
    "medical", "mining_engineering", "movie", "music_and_dance", "news", "nuclear_science",
    "ocean_science", "optical_engineering", "painting", "pet", 
    "petroleum_and_natural_gas_engineering", "philosophy", "photo", "physics", "politics",
    "psychology", "public_administration", "relationship", "religion", "sociology", "sports",
    "statistics", "systems_science", "textile_science", "topicality", "transportation_engineering",
    "travel", "urban_planning", "vulgar_language"
]

Usage (txtai)

This model can be used to classify text into one of the domain labels above with txtai.

from txtai.pipeline import Labels

labels = Label("NeuML/domain-labeler", dynamic=False)
labels("Text to classify")

# Get only the top label
labels("Text to classify", flatten=True)

Usage (Hugging Face Transformers)

The following code is used to run a transformers text-classification pipeline.

labels = pipeline("text-classification", model="NeuML/domain-labeler")
labels("Text to classify")

Evaluation

The following are the metrics for the test dataset. Note that these labels have significant overlap and the overall accuracy is much higher when generalizing the categories. In other words the "wrong" labels aren't always necessarily wrong (i.e. Medical vs Health, Entertainment vs Celebrity etc)

Accuracy F1 Precision Recall PR-ACU
0.8426 83.97 83.96 84.26 90.033

Training code

The training code used to build this model is here.

Downloads last month
-
Safetensors
Model size
32.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NeuML/domain-labeler

Finetuned
(17)
this model

Dataset used to train NeuML/domain-labeler