topic / README.md

Upload folder using huggingface_hub

ae6472e verified 12 days ago

3.62 kB

	---
	language: en
	license: mit
	tags:
	- text-classification
	- multi-label-classification
	- topic-classification
	- political-text
	- tweets
	- distilbert
	datasets:
	- thomasrenault/us_tweet_speech_congress
	metrics:
	- f1
	base_model: distilbert-base-uncased
	pipeline_tag: text-classification
	---

	# thomasrenault/topic

	A multi-label political topic classifier fine-tuned on US tweets, campaign speeches and congressional speeches. Built on `distilbert-base-uncased` with GPT-4o-mini annotation via the OpenAI Batch API.

	## Labels

	The model predicts 7 independent topic indicators (sigmoid, threshold 0.5).
	A document can belong to zero or multiple topics simultaneously.

	\| Label \| Description \|
	\|---\|---\|
	\| `abortion` \| Abortion rights and reproductive policy \|
	\| `democracy` \| Elections, voting rights, democratic institutions \|
	\| `gender equality` \| Gender rights, feminism, LGBTQ+ issues \|
	\| `gun control` \| Firearms regulation, Second Amendment \|
	\| `immigration` \| Immigration policy, border control, citizenship \|
	\| `tax and inequality` \| Tax policy, economic inequality, redistribution \|
	\| `trade` \| Trade policy, tariffs, international commerce \|

	Documents that match none of the above are implicitly classified as `other topic`.

	## Training

	\| Setting \| Value \|
	\|---\|---\|
	\| Base model \| `distilbert-base-uncased` \|
	\| Architecture \| `DistilBertForSequenceClassification` (multi-label) \|
	\| Problem type \| `multi_label_classification` \|
	\| Training data \| ~200,000 labeled documents \|
	\| Annotation \| GPT-4o-mini (temperature=0) via OpenAI Batch API \|
	\| Epochs \| 4 \|
	\| Learning rate \| 2e-5 \|
	\| Batch size \| 16 \|
	\| Max length \| 512 tokens \|
	\| Classification threshold \| 0.5 \|
	\| Domain \| US tweets about policy, campaign speeches and congressional floor speeches \|

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_id = "thomasrenault/topic"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSequenceClassification.from_pretrained(model_id)
	model.eval()

	TOPICS = ["abortion", "democracy", "gender equality", "gun control",
	"immigration", "tax and inequality", "trade"]
	THRESHOLD = 0.5

	def predict(text):
	enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
	with torch.no_grad():
	probs = torch.sigmoid(model(**enc).logits).squeeze().tolist()
	matched = [t for t, p in zip(TOPICS, probs) if p >= THRESHOLD]
	return matched or ["other topic"]

	print(predict("We need stronger border security and immigration reform."))
	# ["immigration"]

	print(predict("Tax cuts for the wealthy only increase inequality in America."))
	# ["tax and inequality"]
	```

	## Intended Use

	- Academic research on political agenda-setting and issue salience
	- Topic trend analysis across congressional speeches and social media
	- Cross-platform comparison of elite vs. citizen political communication

	## Limitations

	- Trained on US English political text — may not generalise to other political systems or languages
	- Annotation by GPT-4o-mini introduces model-specific biases in topic boundaries
	- Topics reflect the specific research agenda of the parent project; other salient topics (healthcare, climate, etc.) are out of scope

	## Citation

	If you use this model, please cite:

	```
	@misc{renault2025topic,
	author = {Renault, Thomas},
	title = {thomasrenault/topic: Multi-label political topic classifier for US political text},
	year = {2025},
	publisher = {HuggingFace},
	url = {https://huggingface.co/thomasrenault/topic}
	}
	```