ddore14
/

Sentence-RooseBERT

+---
+language:
+  - en
+license: apache-2.0
+tags:
+  - bert
+  - sentence-transformers
+  - sentence-similarity
+  - political-nlp
+  - domain-adaptation
+  - political-debates
+library_name: sentence-transformers
+pipeline_tag: sentence-similarity
+---
+# Sentence-RooseBERT
+**Sentence-RooseBERT** is a Sentence-BERT adaptation of [RooseBERT](https://arxiv.org/abs/2508.03250), a domain-specific language model pre-trained on English political debates and parliamentary speeches. It produces fixed-size sentence embeddings suited for semantic similarity, clustering, and retrieval tasks over political text.
+> ⚠️ **This model has not yet been formally evaluated.** It is released as an experimental variant for the community to explore.
+> 📄 **Paper:** [RooseBERT: A New Deal For Political Language Modelling](https://arxiv.org/abs/2508.03250)
+> 💻 **GitHub:** [https://github.com/deborahdore/RooseBERT](https://github.com/deborahdore/RooseBERT)
+---
+## Training Data
+Sentence-RooseBERT was pre-trained on **11GB** of English political debate transcripts (1919–2025), including debates from Africa, Australia, Canada, Europe, Ireland, New Zealand, Scotland, the United Kingdom, the United States, the UN General Assembly, and the UN Security Council. See the base RooseBERT model cards for full details.
+---
+## Intended Use
+This model is intended for sentence-level tasks over political text, such as:
+- **Semantic textual similarity** between debate passages or speeches
+- **Semantic search and retrieval** over political corpora
+- **Clustering** of political arguments or speeches by topic
+- **Classification via embedding similarity** (e.g., zero-shot or few-shot)
+---
+## How to Use
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("ddore14/Sentence-RooseBERT")
+sentences = [
+    "We must invest in renewable energy to combat climate change.",
+    "The government's climate policy is failing future generations."
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)  # (2, 768)
+```
+---
+## Limitations
+- This model has **not been formally evaluated** on any downstream benchmark. Performance on political NLP tasks is unknown.
+- The model inherits any biases present in official political speech corpora, including geopolitical and linguistic over-representation.
+- Not suitable for generative tasks or token-level labelling.
+---
+## Related Models
+| Model | Training | Casing | HuggingFace ID |
+|---|---|---|---|
+| RooseBERT-cont-cased | Continued pre-training | Cased | `ddore14/RooseBERT-cont-cased` |
+| RooseBERT-cont-uncased | Continued pre-training | Uncased | `ddore14/RooseBERT-cont-uncased` |
+| RooseBERT-scr-cased | From scratch | Cased | `ddore14/RooseBERT-scr-cased` |
+| RooseBERT-scr-uncased | From scratch | Uncased | `ddore14/RooseBERT-scr-uncased` |
+---
+## Citation
+If you use RooseBERT in your research, please cite:
+```bibtex
+@article{dore2025roosebert,
+  title={RooseBERT: A New Deal For Political Language Modelling},
+  author={Dore, Deborah and Cabrio, Elena and Villata, Serena},
+  journal={arXiv preprint arXiv:2508.03250},
+  year={2025}
+}
+```