ddore14 commited on
Commit
619aa7c
·
verified ·
1 Parent(s): 6113166

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -3
README.md CHANGED
@@ -1,3 +1,90 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - bert
7
+ - sentence-transformers
8
+ - sentence-similarity
9
+ - political-nlp
10
+ - domain-adaptation
11
+ - political-debates
12
+ library_name: sentence-transformers
13
+ pipeline_tag: sentence-similarity
14
+ ---
15
+
16
+ # Sentence-RooseBERT
17
+
18
+ **Sentence-RooseBERT** is a Sentence-BERT adaptation of [RooseBERT](https://arxiv.org/abs/2508.03250), a domain-specific language model pre-trained on English political debates and parliamentary speeches. It produces fixed-size sentence embeddings suited for semantic similarity, clustering, and retrieval tasks over political text.
19
+
20
+ > ⚠️ **This model has not yet been formally evaluated.** It is released as an experimental variant for the community to explore.
21
+
22
+ > 📄 **Paper:** [RooseBERT: A New Deal For Political Language Modelling](https://arxiv.org/abs/2508.03250)
23
+ > 💻 **GitHub:** [https://github.com/deborahdore/RooseBERT](https://github.com/deborahdore/RooseBERT)
24
+
25
+ ---
26
+ ## Training Data
27
+
28
+ Sentence-RooseBERT was pre-trained on **11GB** of English political debate transcripts (1919–2025), including debates from Africa, Australia, Canada, Europe, Ireland, New Zealand, Scotland, the United Kingdom, the United States, the UN General Assembly, and the UN Security Council. See the base RooseBERT model cards for full details.
29
+
30
+ ---
31
+
32
+ ## Intended Use
33
+
34
+ This model is intended for sentence-level tasks over political text, such as:
35
+
36
+ - **Semantic textual similarity** between debate passages or speeches
37
+ - **Semantic search and retrieval** over political corpora
38
+ - **Clustering** of political arguments or speeches by topic
39
+ - **Classification via embedding similarity** (e.g., zero-shot or few-shot)
40
+
41
+ ---
42
+
43
+ ## How to Use
44
+
45
+ ```python
46
+ from sentence_transformers import SentenceTransformer
47
+
48
+ model = SentenceTransformer("ddore14/Sentence-RooseBERT")
49
+
50
+ sentences = [
51
+ "We must invest in renewable energy to combat climate change.",
52
+ "The government's climate policy is failing future generations."
53
+ ]
54
+
55
+ embeddings = model.encode(sentences)
56
+ print(embeddings.shape) # (2, 768)
57
+ ```
58
+ ---
59
+
60
+ ## Limitations
61
+
62
+ - This model has **not been formally evaluated** on any downstream benchmark. Performance on political NLP tasks is unknown.
63
+ - The model inherits any biases present in official political speech corpora, including geopolitical and linguistic over-representation.
64
+ - Not suitable for generative tasks or token-level labelling.
65
+
66
+ ---
67
+
68
+ ## Related Models
69
+
70
+ | Model | Training | Casing | HuggingFace ID |
71
+ |---|---|---|---|
72
+ | RooseBERT-cont-cased | Continued pre-training | Cased | `ddore14/RooseBERT-cont-cased` |
73
+ | RooseBERT-cont-uncased | Continued pre-training | Uncased | `ddore14/RooseBERT-cont-uncased` |
74
+ | RooseBERT-scr-cased | From scratch | Cased | `ddore14/RooseBERT-scr-cased` |
75
+ | RooseBERT-scr-uncased | From scratch | Uncased | `ddore14/RooseBERT-scr-uncased` |
76
+
77
+ ---
78
+
79
+ ## Citation
80
+
81
+ If you use RooseBERT in your research, please cite:
82
+
83
+ ```bibtex
84
+ @article{dore2025roosebert,
85
+ title={RooseBERT: A New Deal For Political Language Modelling},
86
+ author={Dore, Deborah and Cabrio, Elena and Villata, Serena},
87
+ journal={arXiv preprint arXiv:2508.03250},
88
+ year={2025}
89
+ }
90
+ ```