File size: 3,619 Bytes
8223df6 66c02c5 8223df6 66c02c5 8223df6 66c02c5 8223df6 ae6472e 8223df6 66c02c5 8223df6 66c02c5 8223df6 ae6472e 8223df6 ae6472e 8223df6 66c02c5 8223df6 66c02c5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | ---
language: en
license: mit
tags:
- text-classification
- multi-label-classification
- topic-classification
- political-text
- tweets
- distilbert
datasets:
- thomasrenault/us_tweet_speech_congress
metrics:
- f1
base_model: distilbert-base-uncased
pipeline_tag: text-classification
---
# thomasrenault/topic
A multi-label political topic classifier fine-tuned on US tweets, campaign speeches and congressional speeches. Built on `distilbert-base-uncased` with GPT-4o-mini annotation via the OpenAI Batch API.
## Labels
The model predicts **7 independent topic indicators** (sigmoid, threshold 0.5).
A document can belong to **zero or multiple topics simultaneously**.
| Label | Description |
|---|---|
| `abortion` | Abortion rights and reproductive policy |
| `democracy` | Elections, voting rights, democratic institutions |
| `gender equality` | Gender rights, feminism, LGBTQ+ issues |
| `gun control` | Firearms regulation, Second Amendment |
| `immigration` | Immigration policy, border control, citizenship |
| `tax and inequality` | Tax policy, economic inequality, redistribution |
| `trade` | Trade policy, tariffs, international commerce |
Documents that match none of the above are implicitly classified as `other topic`.
## Training
| Setting | Value |
|---|---|
| Base model | `distilbert-base-uncased` |
| Architecture | `DistilBertForSequenceClassification` (multi-label) |
| Problem type | `multi_label_classification` |
| Training data | ~200,000 labeled documents |
| Annotation | GPT-4o-mini (temperature=0) via OpenAI Batch API |
| Epochs | 4 |
| Learning rate | 2e-5 |
| Batch size | 16 |
| Max length | 512 tokens |
| Classification threshold | 0.5 |
| Domain | US tweets about policy, campaign speeches and congressional floor speeches |
## Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "thomasrenault/topic"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
TOPICS = ["abortion", "democracy", "gender equality", "gun control",
"immigration", "tax and inequality", "trade"]
THRESHOLD = 0.5
def predict(text):
enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
probs = torch.sigmoid(model(**enc).logits).squeeze().tolist()
matched = [t for t, p in zip(TOPICS, probs) if p >= THRESHOLD]
return matched or ["other topic"]
print(predict("We need stronger border security and immigration reform."))
# ["immigration"]
print(predict("Tax cuts for the wealthy only increase inequality in America."))
# ["tax and inequality"]
```
## Intended Use
- Academic research on political agenda-setting and issue salience
- Topic trend analysis across congressional speeches and social media
- Cross-platform comparison of elite vs. citizen political communication
## Limitations
- Trained on **US English political text** — may not generalise to other political systems or languages
- Annotation by GPT-4o-mini introduces model-specific biases in topic boundaries
- Topics reflect the specific research agenda of the parent project; other salient topics (healthcare, climate, etc.) are out of scope
## Citation
If you use this model, please cite:
```
@misc{renault2025topic,
author = {Renault, Thomas},
title = {thomasrenault/topic: Multi-label political topic classifier for US political text},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/thomasrenault/topic}
}
```
|