---
language: en
license: mit
tags:
  - text-classification
  - multi-label-classification
  - topic-classification
  - political-text
  - tweets
  - distilbert
datasets:
  - thomasrenault/us_tweet_speech_congress
metrics:
  - f1
base_model: distilbert-base-uncased
pipeline_tag: text-classification
---

# thomasrenault/topic

A multi-label political topic classifier fine-tuned on US tweets, campaign speeches and congressional speeches.  Built on `distilbert-base-uncased` with GPT-4o-mini annotation via the OpenAI Batch API.

## Labels

The model predicts **7 independent topic indicators** (sigmoid, threshold 0.5).  
A document can belong to **zero or multiple topics simultaneously**.

| Label | Description |
|---|---|
| `abortion` | Abortion rights and reproductive policy |
| `democracy` | Elections, voting rights, democratic institutions |
| `gender equality` | Gender rights, feminism, LGBTQ+ issues |
| `gun control` | Firearms regulation, Second Amendment |
| `immigration` | Immigration policy, border control, citizenship |
| `tax and inequality` | Tax policy, economic inequality, redistribution |
| `trade` | Trade policy, tariffs, international commerce |

Documents that match none of the above are implicitly classified as `other topic`.

## Training

| Setting | Value |
|---|---|
| Base model | `distilbert-base-uncased` |
| Architecture | `DistilBertForSequenceClassification` (multi-label) |
| Problem type | `multi_label_classification` |
| Training data | ~200,000 labeled documents |
| Annotation | GPT-4o-mini (temperature=0) via OpenAI Batch API |
| Epochs | 4 |
| Learning rate | 2e-5 |
| Batch size | 16 |
| Max length | 512 tokens |
| Classification threshold | 0.5 |
| Domain | US tweets about policy, campaign speeches and congressional floor speeches |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "thomasrenault/topic"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

TOPICS    = ["abortion", "democracy", "gender equality", "gun control",
             "immigration", "tax and inequality", "trade"]
THRESHOLD = 0.5

def predict(text):
    enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        probs = torch.sigmoid(model(**enc).logits).squeeze().tolist()
    matched = [t for t, p in zip(TOPICS, probs) if p >= THRESHOLD]
    return matched or ["other topic"]

print(predict("We need stronger border security and immigration reform."))
# ["immigration"]

print(predict("Tax cuts for the wealthy only increase inequality in America."))
# ["tax and inequality"]
```

## Intended Use

- Academic research on political agenda-setting and issue salience
- Topic trend analysis across congressional speeches and social media
- Cross-platform comparison of elite vs. citizen political communication

## Limitations

- Trained on **US English political text** — may not generalise to other political systems or languages
- Annotation by GPT-4o-mini introduces model-specific biases in topic boundaries
- Topics reflect the specific research agenda of the parent project; other salient topics (healthcare, climate, etc.) are out of scope

## Citation

If you use this model, please cite:

```
@misc{renault2025topic,
  author    = {Renault, Thomas},
  title     = {thomasrenault/topic: Multi-label political topic classifier for US political text},
  year      = {2025},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/thomasrenault/topic}
}
```