| --- |
| language: en |
| license: mit |
| tags: |
| - text-classification |
| - multi-label-classification |
| - topic-classification |
| - political-text |
| - tweets |
| - distilbert |
| datasets: |
| - thomasrenault/us_tweet_speech_congress |
| metrics: |
| - f1 |
| base_model: distilbert-base-uncased |
| pipeline_tag: text-classification |
| --- |
| |
| # thomasrenault/topic |
|
|
| A multi-label political topic classifier fine-tuned on US tweets, campaign speeches and congressional speeches. Built on `distilbert-base-uncased` with GPT-4o-mini annotation via the OpenAI Batch API. |
|
|
| ## Labels |
|
|
| The model predicts **7 independent topic indicators** (sigmoid, threshold 0.5). |
| A document can belong to **zero or multiple topics simultaneously**. |
|
|
| | Label | Description | |
| |---|---| |
| | `abortion` | Abortion rights and reproductive policy | |
| | `democracy` | Elections, voting rights, democratic institutions | |
| | `gender equality` | Gender rights, feminism, LGBTQ+ issues | |
| | `gun control` | Firearms regulation, Second Amendment | |
| | `immigration` | Immigration policy, border control, citizenship | |
| | `tax and inequality` | Tax policy, economic inequality, redistribution | |
| | `trade` | Trade policy, tariffs, international commerce | |
|
|
| Documents that match none of the above are implicitly classified as `other topic`. |
|
|
| ## Training |
|
|
| | Setting | Value | |
| |---|---| |
| | Base model | `distilbert-base-uncased` | |
| | Architecture | `DistilBertForSequenceClassification` (multi-label) | |
| | Problem type | `multi_label_classification` | |
| | Training data | ~200,000 labeled documents | |
| | Annotation | GPT-4o-mini (temperature=0) via OpenAI Batch API | |
| | Epochs | 4 | |
| | Learning rate | 2e-5 | |
| | Batch size | 16 | |
| | Max length | 512 tokens | |
| | Classification threshold | 0.5 | |
| | Domain | US tweets about policy, campaign speeches and congressional floor speeches | |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| model_id = "thomasrenault/topic" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForSequenceClassification.from_pretrained(model_id) |
| model.eval() |
| |
| TOPICS = ["abortion", "democracy", "gender equality", "gun control", |
| "immigration", "tax and inequality", "trade"] |
| THRESHOLD = 0.5 |
| |
| def predict(text): |
| enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
| with torch.no_grad(): |
| probs = torch.sigmoid(model(**enc).logits).squeeze().tolist() |
| matched = [t for t, p in zip(TOPICS, probs) if p >= THRESHOLD] |
| return matched or ["other topic"] |
| |
| print(predict("We need stronger border security and immigration reform.")) |
| # ["immigration"] |
| |
| print(predict("Tax cuts for the wealthy only increase inequality in America.")) |
| # ["tax and inequality"] |
| ``` |
|
|
| ## Intended Use |
|
|
| - Academic research on political agenda-setting and issue salience |
| - Topic trend analysis across congressional speeches and social media |
| - Cross-platform comparison of elite vs. citizen political communication |
|
|
| ## Limitations |
|
|
| - Trained on **US English political text** — may not generalise to other political systems or languages |
| - Annotation by GPT-4o-mini introduces model-specific biases in topic boundaries |
| - Topics reflect the specific research agenda of the parent project; other salient topics (healthcare, climate, etc.) are out of scope |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ``` |
| @misc{renault2025topic, |
| author = {Renault, Thomas}, |
| title = {thomasrenault/topic: Multi-label political topic classifier for US political text}, |
| year = {2025}, |
| publisher = {HuggingFace}, |
| url = {https://huggingface.co/thomasrenault/topic} |
| } |
| ``` |
|
|