DeRooseBERT
DeRooseBERTa is a domain-specific language model pre-trained on English political debates and parliamentary speeches, built on the DeBERTa-v2 architecture. It extends the RooseBERT family by leveraging DeBERTa-v2's disentangled attention mechanism and enhanced mask decoder, which improve upon standard BERT by separately encoding content and positional information at each attention layer.
This variant was trained via continued pre-training (CONT) of microsoft/deberta-v2-base on the same 11GB political debate corpus used to train RooseBERT, adapting its representations to the distinctive features of political discourse β domain-specific terminology, implicit argumentation, and strategic communication patterns.
β οΈ This model has not yet been formally evaluated. It is released as an experimental variant for the community to explore.
π Paper: RooseBERT: A New Deal For Political Language Modelling
π» GitHub: https://github.com/deborahdore/RooseBERT
Training Data
DeRooseBERT was pre-trained on 11GB of English political debate transcripts spanning 1919β2025, drawn from:
| Source | Coverage | Size |
|---|---|---|
| African Parliamentary Debates (Ghana & South Africa) | 1999β2024 | 573 MB |
| Australian Parliamentary Debates | 1998β2025 | 1 GB |
| Canadian Parliamentary Debates | 1994β2025 | 1.1 GB |
| European Parliamentary Debates (EUSpeech) | 2007β2015 | 110 MB |
| Irish Parliamentary Debates | 1919β2019 | ~3.4 GB |
| New Zealand Parliamentary Debates (ParlSpeech) | 1987β2019 | 791 MB |
| Scottish Parliamentary Debates (ParlScot) | β2021 | 443 MB |
| UK House of Commons Debates | 1979β2019 | 2.6 GB |
| UN General Debate Corpus (UNGDC) | 1946β2023 | 186 MB |
| UN Security Council Debates (UNSC) | 1992β2023 | 387 MB |
| US Presidential & Primary Debates | 1960β2024 | 16 MB |
All datasets were sourced from authoritative, official political settings. Pre-processing removed hyperlinks, markup tags, and collapsed whitespace.
Intended Use
DeRooseBERT is intended as a base model for fine-tuning on downstream NLP tasks related to political discourse analysis. It is especially well-suited for:
- Sentiment Analysis of parliamentary speeches and debates
- Stance Detection (support/oppose classification)
- Argument Component Detection and Classification (claims and premises)
- Argument Relation Prediction and Classification (support/attack/no-relation)
- Motion Policy Classification
- Named Entity Recognition in political texts
How to Use
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("ddore14/DeRooseBERTa")
model = AutoModelForMaskedLM.from_pretrained("ddore14/DeRooseBERTa")
For fine-tuning on a downstream classification task:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("ddore14/DeRooseBERTa")
model = AutoModelForSequenceClassification.from_pretrained(
"ddore14/DeRooseBERTa",
num_labels=2
)
# Recommended fine-tuning hyperparameters (following RooseBERT paper):
# learning_rate β {2e-5, 3e-5, 5e-5}
# batch_size β {8, 16, 32}
# epochs β {2, 3, 4}
Note: DeBERTa-v2 uses a SentencePiece tokenizer. Ensure you install the
sentencepiecepackage:pip install sentencepiece.
Limitations
- This model has not been formally evaluated on any downstream benchmark. Performance on political NLP tasks is unknown.
- DeRooseBERTa inherits any biases present in official political speech corpora, including geopolitical and linguistic over-representation.
- Because this is a CONT model, it retains DeBERTa-v2's standard SentencePiece vocabulary. Domain-specific political terms may still be split into sub-tokens.
- Trained exclusively on English political debates. Cross-lingual use is not supported.
- As with all encoder-only models, DeRooseBERTa is best suited to classification and labelling tasks rather than generation.
Related Models
| Model | Architecture | Training | HuggingFace ID |
|---|---|---|---|
| RooseBERT-cont-cased | BERT-base | Continued pre-training, cased | ddore14/RooseBERT-cont-cased |
| RooseBERT-cont-uncased | BERT-base | Continued pre-training, uncased | ddore14/RooseBERT-cont-uncased |
| RooseBERT-scr-cased | BERT-base | From scratch, cased | ddore14/RooseBERT-scr-cased |
| RooseBERT-scr-uncased | BERT-base | From scratch, uncased | ddore14/RooseBERT-scr-uncased |
| RooseBERT-SBERT | BERT-base | Sentence-BERT adaptation | ddore14/RooseBERT-SBERT |
| DeRooseBERTa (this model) | DeBERTa-v2-base | Continued pre-training | ddore14/DeRooseBERT |
Citation
If you use DeRooseBERTa in your research, please cite:
@article{dore2025roosebert,
title={RooseBERT: A New Deal For Political Language Modelling},
author={Dore, Deborah and Cabrio, Elena and Villata, Serena},
journal={arXiv preprint arXiv:2508.03250},
year={2025}
}
- Downloads last month
- 8