DeRooseBERT

DeRooseBERTa is a domain-specific language model pre-trained on English political debates and parliamentary speeches, built on the DeBERTa-v2 architecture. It extends the RooseBERT family by leveraging DeBERTa-v2's disentangled attention mechanism and enhanced mask decoder, which improve upon standard BERT by separately encoding content and positional information at each attention layer.

This variant was trained via continued pre-training (CONT) of microsoft/deberta-v2-base on the same 11GB political debate corpus used to train RooseBERT, adapting its representations to the distinctive features of political discourse β€” domain-specific terminology, implicit argumentation, and strategic communication patterns.

⚠️ This model has not yet been formally evaluated. It is released as an experimental variant for the community to explore.

πŸ“„ Paper: RooseBERT: A New Deal For Political Language Modelling
πŸ’» GitHub: https://github.com/deborahdore/RooseBERT


Training Data

DeRooseBERT was pre-trained on 11GB of English political debate transcripts spanning 1919–2025, drawn from:

Source Coverage Size
African Parliamentary Debates (Ghana & South Africa) 1999–2024 573 MB
Australian Parliamentary Debates 1998–2025 1 GB
Canadian Parliamentary Debates 1994–2025 1.1 GB
European Parliamentary Debates (EUSpeech) 2007–2015 110 MB
Irish Parliamentary Debates 1919–2019 ~3.4 GB
New Zealand Parliamentary Debates (ParlSpeech) 1987–2019 791 MB
Scottish Parliamentary Debates (ParlScot) –2021 443 MB
UK House of Commons Debates 1979–2019 2.6 GB
UN General Debate Corpus (UNGDC) 1946–2023 186 MB
UN Security Council Debates (UNSC) 1992–2023 387 MB
US Presidential & Primary Debates 1960–2024 16 MB

All datasets were sourced from authoritative, official political settings. Pre-processing removed hyperlinks, markup tags, and collapsed whitespace.


Intended Use

DeRooseBERT is intended as a base model for fine-tuning on downstream NLP tasks related to political discourse analysis. It is especially well-suited for:

  • Sentiment Analysis of parliamentary speeches and debates
  • Stance Detection (support/oppose classification)
  • Argument Component Detection and Classification (claims and premises)
  • Argument Relation Prediction and Classification (support/attack/no-relation)
  • Motion Policy Classification
  • Named Entity Recognition in political texts

How to Use

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ddore14/DeRooseBERTa")
model = AutoModelForMaskedLM.from_pretrained("ddore14/DeRooseBERTa")

For fine-tuning on a downstream classification task:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("ddore14/DeRooseBERTa")
model = AutoModelForSequenceClassification.from_pretrained(
    "ddore14/DeRooseBERTa",
    num_labels=2
)

# Recommended fine-tuning hyperparameters (following RooseBERT paper):
# learning_rate ∈ {2e-5, 3e-5, 5e-5}
# batch_size ∈ {8, 16, 32}
# epochs ∈ {2, 3, 4}

Note: DeBERTa-v2 uses a SentencePiece tokenizer. Ensure you install the sentencepiece package: pip install sentencepiece.


Limitations

  • This model has not been formally evaluated on any downstream benchmark. Performance on political NLP tasks is unknown.
  • DeRooseBERTa inherits any biases present in official political speech corpora, including geopolitical and linguistic over-representation.
  • Because this is a CONT model, it retains DeBERTa-v2's standard SentencePiece vocabulary. Domain-specific political terms may still be split into sub-tokens.
  • Trained exclusively on English political debates. Cross-lingual use is not supported.
  • As with all encoder-only models, DeRooseBERTa is best suited to classification and labelling tasks rather than generation.

Related Models

Model Architecture Training HuggingFace ID
RooseBERT-cont-cased BERT-base Continued pre-training, cased ddore14/RooseBERT-cont-cased
RooseBERT-cont-uncased BERT-base Continued pre-training, uncased ddore14/RooseBERT-cont-uncased
RooseBERT-scr-cased BERT-base From scratch, cased ddore14/RooseBERT-scr-cased
RooseBERT-scr-uncased BERT-base From scratch, uncased ddore14/RooseBERT-scr-uncased
RooseBERT-SBERT BERT-base Sentence-BERT adaptation ddore14/RooseBERT-SBERT
DeRooseBERTa (this model) DeBERTa-v2-base Continued pre-training ddore14/DeRooseBERT

Citation

If you use DeRooseBERTa in your research, please cite:

@article{dore2025roosebert,
  title={RooseBERT: A New Deal For Political Language Modelling},
  author={Dore, Deborah and Cabrio, Elena and Villata, Serena},
  journal={arXiv preprint arXiv:2508.03250},
  year={2025}
}
Downloads last month
8
Safetensors
Model size
0.3B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including ddore14/DeRooseBERTa

Paper for ddore14/DeRooseBERTa