DeRooseBERT

DeRooseBERTa is a domain-specific language model pre-trained on English political debates and parliamentary speeches, built on the DeBERTa-v2 architecture. It extends the RooseBERT family by leveraging DeBERTa-v2's disentangled attention mechanism and enhanced mask decoder, which improve upon standard BERT by separately encoding content and positional information at each attention layer.

This variant was trained via continued pre-training (CONT) of microsoft/deberta-v2-base on the same 11GB political debate corpus used to train RooseBERT, adapting its representations to the distinctive features of political discourse — domain-specific terminology, implicit argumentation, and strategic communication patterns.

⚠️ This model has not yet been formally evaluated. It is released as an experimental variant for the community to explore.

📄 Paper: RooseBERT: A New Deal For Political Language Modelling
💻 GitHub: https://github.com/deborahdore/RooseBERT

Training Data

DeRooseBERT was pre-trained on 11GB of English political debate transcripts spanning 1919–2025, drawn from:

Source	Coverage	Size
African Parliamentary Debates (Ghana & South Africa)	1999–2024	573 MB
Australian Parliamentary Debates	1998–2025	1 GB
Canadian Parliamentary Debates	1994–2025	1.1 GB
European Parliamentary Debates (EUSpeech)	2007–2015	110 MB
Irish Parliamentary Debates	1919–2019	~3.4 GB
New Zealand Parliamentary Debates (ParlSpeech)	1987–2019	791 MB
Scottish Parliamentary Debates (ParlScot)	–2021	443 MB
UK House of Commons Debates	1979–2019	2.6 GB
UN General Debate Corpus (UNGDC)	1946–2023	186 MB
UN Security Council Debates (UNSC)	1992–2023	387 MB
US Presidential & Primary Debates	1960–2024	16 MB

All datasets were sourced from authoritative, official political settings. Pre-processing removed hyperlinks, markup tags, and collapsed whitespace.

Intended Use

DeRooseBERT is intended as a base model for fine-tuning on downstream NLP tasks related to political discourse analysis. It is especially well-suited for:

Sentiment Analysis of parliamentary speeches and debates
Stance Detection (support/oppose classification)
Argument Component Detection and Classification (claims and premises)
Argument Relation Prediction and Classification (support/attack/no-relation)
Motion Policy Classification
Named Entity Recognition in political texts

How to Use

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ddore14/DeRooseBERTa")
model = AutoModelForMaskedLM.from_pretrained("ddore14/DeRooseBERTa")

For fine-tuning on a downstream classification task:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("ddore14/DeRooseBERTa")
model = AutoModelForSequenceClassification.from_pretrained(
    "ddore14/DeRooseBERTa",
    num_labels=2
)

# Recommended fine-tuning hyperparameters (following RooseBERT paper):
# learning_rate ∈ {2e-5, 3e-5, 5e-5}
# batch_size ∈ {8, 16, 32}
# epochs ∈ {2, 3, 4}

Note: DeBERTa-v2 uses a SentencePiece tokenizer. Ensure you install the sentencepiece package: pip install sentencepiece.

Limitations

This model has not been formally evaluated on any downstream benchmark. Performance on political NLP tasks is unknown.
DeRooseBERTa inherits any biases present in official political speech corpora, including geopolitical and linguistic over-representation.
Because this is a CONT model, it retains DeBERTa-v2's standard SentencePiece vocabulary. Domain-specific political terms may still be split into sub-tokens.
Trained exclusively on English political debates. Cross-lingual use is not supported.
As with all encoder-only models, DeRooseBERTa is best suited to classification and labelling tasks rather than generation.

Related Models

Model	Architecture	Training	HuggingFace ID
RooseBERT-cont-cased	BERT-base	Continued pre-training, cased	`ddore14/RooseBERT-cont-cased`
RooseBERT-cont-uncased	BERT-base	Continued pre-training, uncased	`ddore14/RooseBERT-cont-uncased`
RooseBERT-scr-cased	BERT-base	From scratch, cased	`ddore14/RooseBERT-scr-cased`
RooseBERT-scr-uncased	BERT-base	From scratch, uncased	`ddore14/RooseBERT-scr-uncased`
RooseBERT-SBERT	BERT-base	Sentence-BERT adaptation	`ddore14/RooseBERT-SBERT`
DeRooseBERTa (this model)	DeBERTa-v2-base	Continued pre-training	`ddore14/DeRooseBERT`

Citation

If you use DeRooseBERTa in your research, please cite:

@article{dore2025roosebert,
  title={RooseBERT: A New Deal For Political Language Modelling},
  author={Dore, Deborah and Cabrio, Elena and Villata, Serena},
  journal={arXiv preprint arXiv:2508.03250},
  year={2025}
}

Downloads last month: 8

Safetensors

Model size

0.3B params

Tensor type

F16

Collection including ddore14/DeRooseBERTa

RooseBERT

Collection

6 items • Updated about 8 hours ago

Paper for ddore14/DeRooseBERTa

RooseBERT: A New Deal For Political Language Modelling

Paper • 2508.03250 • Published Aug 5, 2025 • 1