--- language: en license: mit tags: - text-classification - multi-label-classification - topic-classification - political-text - tweets - distilbert datasets: - thomasrenault/us_tweet_speech_congress metrics: - f1 base_model: distilbert-base-uncased pipeline_tag: text-classification --- # thomasrenault/topic A multi-label political topic classifier fine-tuned on US tweets, campaign speeches and congressional speeches. Built on `distilbert-base-uncased` with GPT-4o-mini annotation via the OpenAI Batch API. ## Labels The model predicts **7 independent topic indicators** (sigmoid, threshold 0.5). A document can belong to **zero or multiple topics simultaneously**. | Label | Description | |---|---| | `abortion` | Abortion rights and reproductive policy | | `democracy` | Elections, voting rights, democratic institutions | | `gender equality` | Gender rights, feminism, LGBTQ+ issues | | `gun control` | Firearms regulation, Second Amendment | | `immigration` | Immigration policy, border control, citizenship | | `tax and inequality` | Tax policy, economic inequality, redistribution | | `trade` | Trade policy, tariffs, international commerce | Documents that match none of the above are implicitly classified as `other topic`. ## Training | Setting | Value | |---|---| | Base model | `distilbert-base-uncased` | | Architecture | `DistilBertForSequenceClassification` (multi-label) | | Problem type | `multi_label_classification` | | Training data | ~200,000 labeled documents | | Annotation | GPT-4o-mini (temperature=0) via OpenAI Batch API | | Epochs | 4 | | Learning rate | 2e-5 | | Batch size | 16 | | Max length | 512 tokens | | Classification threshold | 0.5 | | Domain | US tweets about policy, campaign speeches and congressional floor speeches | ## Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_id = "thomasrenault/topic" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id) model.eval() TOPICS = ["abortion", "democracy", "gender equality", "gun control", "immigration", "tax and inequality", "trade"] THRESHOLD = 0.5 def predict(text): enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): probs = torch.sigmoid(model(**enc).logits).squeeze().tolist() matched = [t for t, p in zip(TOPICS, probs) if p >= THRESHOLD] return matched or ["other topic"] print(predict("We need stronger border security and immigration reform.")) # ["immigration"] print(predict("Tax cuts for the wealthy only increase inequality in America.")) # ["tax and inequality"] ``` ## Intended Use - Academic research on political agenda-setting and issue salience - Topic trend analysis across congressional speeches and social media - Cross-platform comparison of elite vs. citizen political communication ## Limitations - Trained on **US English political text** — may not generalise to other political systems or languages - Annotation by GPT-4o-mini introduces model-specific biases in topic boundaries - Topics reflect the specific research agenda of the parent project; other salient topics (healthcare, climate, etc.) are out of scope ## Citation If you use this model, please cite: ``` @misc{renault2025topic, author = {Renault, Thomas}, title = {thomasrenault/topic: Multi-label political topic classifier for US political text}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/thomasrenault/topic} } ```