DeBERTa-CRF-VotIE: Portuguese Voting Information Extraction

This model is a fine-tuned DeBERTa v3 Base with a Conditional Random Fields (CRF) layer for extracting structured voting information from Portuguese municipal meeting minutes. It achieves state-of-the-art performance on the VotIE benchmark dataset.

Model Description

DeBERTa-CRF-VotIE combines the robust contextual representations of Microsoft's DeBERTa v3 multilingual base model with a CRF layer for structured sequence prediction. The model performs token-level classification to identify and extract voting-related entities from Portuguese administrative text.

Key Features

  • Architecture: DeBERTa v3 Base (768-dim, 12 layers) + Linear + CRF
  • Task: Squence Labeling with BIO tagging
  • Language: Portuguese (Portugal)
  • Domain: Municipal meeting minutes and voting records
  • Entity Types: 8 types (17 labels with BIO encoding)
  • Performance: 93.00% entity-level F1 score

Intended Uses

This model is designed for:

  • Extracting voting information from Portuguese municipal documents
  • Identifying participants and their voting positions (favor, against, abstention, absent)
  • Recognizing voting subjects and counting methods
  • Structuring unstructured administrative text
  • Research in information extraction from Portuguese administrative documents

Entity Types

The model recognizes 8 entity types in BIO format (17 labels total):

Entity Type Description Example
VOTER-FAVOR Participants who voted in favor "The Municipal Executive"
VOTER-AGAINST Participants who voted against "João Silva"
VOTER-ABSTENTION Participants who abstained "The councilor from PS"
VOTER-ABSENT Participants who were absent "Ana Simões"
VOTING Voting action expressions "deliberado", "aprovado"
SUBJECT The subject matter being voted on "budget changes"
COUNTING-UNANIMITY Unanimous vote indicators "unanimously"
COUNTING-MAJORITY Majority vote indicators "by majority"

Training Details

Training Data

The model was trained on the VotIE dataset, which consists of Portuguese municipal meeting minutes annotated with voting information:

  • Training set: 1,737 examples
  • Validation set: 433 examples
  • Test set: 433 examples
  • Total tokens: ~300K tokens
  • Total entities: ~5K entities

Training Procedure

Hyperparameters:

  • Base model: microsoft/deberta-v3-base
  • Batch size: 16
  • Learning rate: 5e-5 (linear decay with warmup)
  • Warmup proportion: 10%
  • Weight decay: 0.01
  • Dropout: 0.1
  • Max sequence length: 512 tokens
  • Epochs: 10
  • Optimizer: AdamW
  • Training time: ~1.5 hours on NVIDIA L40 GPU

Training Details:

  • Class imbalance handling with weighted loss (O-tag weight: 0.01)
  • O-tag bias initialization (bias: 6.0) to prevent model collapse
  • Windowing for long documents (512 tokens with 50-token overlap)
  • Early stopping with patience=3 epochs
  • BIO constraint validation during evaluation

Results

Entity-Level Performance (Test Set)

Metric Score
F1 Score 93.00%
Precision 91.08%
Recall 95.01%

Per-Entity Performance

Entity Type Precision Recall F1 Score Support
COUNTING-MAJORITY 92.86% 100.00% 96.30% 52
COUNTING-UNANIMITY 94.47% 100.00% 97.16% 222
SUBJECT 84.22% 84.45% 84.34% 373
VOTER-ABSENT 95.45% 95.45% 95.45% 22
VOTER-ABSTENTION 88.46% 100.00% 93.88% 138
VOTER-AGAINST 97.44% 95.00% 96.20% 40
VOTER-FAVOR 92.19% 97.25% 94.66% 255
VOTING 94.50% 98.26% 96.34% 402

Comparison with Other Models

This model achieves the best performance among all tested architectures on the VotIE dataset:

Model Architecture Entity F1 Event F1
DeBERTa-CRF DeBERTa v3 + CRF 93.0% 90.8%
XLM-R-CRF XLM-RoBERTa + CRF 92.6% 90.3%
BERTimbau-CRF BERTimbau + CRF 92.4% 89.9%
DeBERTa-Linear DeBERTa v3 + Linear 92.1% 88.7%

Full results available in the VotIE paper

Usage

Quick Start

The simplest way to use the model:

from transformers import AutoTokenizer, AutoModel

# Load model
model_name = "Anonymous3445/DeBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

# Analyze text
text = "O Executivo deliberou aprovar o projeto por unanimidade."
inputs = tokenizer(text, return_tensors="pt")
predictions = model.decode(**inputs, tokenizer=tokenizer, text=text)

# Print results
for pred in predictions:
    print(f"{pred['word']:20} {pred['label']}")

Output:

O                    B-VOTER-FAVOR
Executivo            I-VOTER-FAVOR
deliberou            B-VOTING
aprovar              O
o                    O
projeto              O
por                  B-COUNTING-UNANIMITY
unanimidade.         I-COUNTING-UNANIMITY

Extract Entities

Get structured entities from voting documents:

from transformers import AutoTokenizer, AutoModel

model_name = "Anonymous3445/DeBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

text = """A Câmara Municipal deliberou aprovar a proposta apresentada pelo
Senhor Presidente. Votaram a favor os Senhores Vereadores João Silva e
Maria Costa. Votou contra o Senhor Vereador Pedro Santos."""

inputs = tokenizer(text, return_tensors="pt")
predictions = model.decode(**inputs, tokenizer=tokenizer, text=text)

# Extract entities by type
entities = {}
current_entity = []
current_type = None

for pred in predictions:
    label = pred['label']
    word = pred['word']

    if label.startswith('B-'):
        # Save previous entity
        if current_entity:
            entity_type = current_type.replace('B-', '').replace('I-', '')
            if entity_type not in entities:
                entities[entity_type] = []
            entities[entity_type].append(' '.join(current_entity))
        # Start new entity
        current_entity = [word]
        current_type = label

    elif label.startswith('I-') and current_entity:
        current_entity.append(word)

    else:  # O tag
        if current_entity:
            entity_type = current_type.replace('B-', '').replace('I-', '')
            if entity_type not in entities:
                entities[entity_type] = []
            entities[entity_type].append(' '.join(current_entity))
        current_entity = []
        current_type = None

# Save last entity
if current_entity:
    entity_type = current_type.replace('B-', '').replace('I-', '')
    if entity_type not in entities:
        entities[entity_type] = []
    entities[entity_type].append(' '.join(current_entity))

# Print entities
for entity_type, entity_list in entities.items():
    print(f"\n{entity_type}:")
    for entity in entity_list:
        print(f"  - {entity}")

Output:

VOTER-FAVOR:
  - A Câmara Municipal
  - João Silva
  - Maria Costa

VOTING:
  - deliberou

VOTER-AGAINST:
  - Pedro Santos

With Character Offsets

Useful for highlighting entities in your UI:

from transformers import AutoTokenizer, AutoModel

model_name = "Anonymous3445/DeBERTa-CRF-VotIE"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

text = "O Executivo deliberou aprovar o projeto por unanimidade."
inputs = tokenizer(text, return_tensors="pt")

# Get predictions with character positions
predictions = model.decode(**inputs, tokenizer=tokenizer, text=text, return_offsets=True)

# Show only entities (non-O tags)
for pred in predictions:
    if pred['label'] != 'O':
        print(f"{pred['word']:20} {pred['label']:25} [{pred['start']}:{pred['end']}]")

Output:

O                    B-VOTER-FAVOR             [0:1]
Executivo            I-VOTER-FAVOR             [1:11]
deliberou            B-VOTING                  [11:21]
por                  B-COUNTING-UNANIMITY      [39:43]
unanimidade.         I-COUNTING-UNANIMITY      [43:56]

Limitations and Bias

Limitations

  • Domain-specific: Trained specifically on Portuguese municipal meeting minutes; may not generalize well to other document types
  • Portuguese only: Optimized for European Portuguese;
  • Sequence length: Limited to 512 tokens per window (handles longer documents via windowing)
  • Entity types: Limited to 8 predefined voting-related entity types
  • Complex sentences: May struggle with highly complex or nested voting descriptions

Bias Considerations

  • Geographic bias: Training data predominantly from Portuguese municipalities; may not capture regional variations
  • Temporal bias: Training data from municipal minutes of specific time periods
  • Formality bias: Trained on formal administrative language; informal voting descriptions may be less accurate
  • Class imbalance: O-tag (non-entity) and rare voter types tokens significantly outnumber entity tokens; addressed with class weighting

Model Card Authors

  • Anonymous Authors (for blind review)

Model Card Contact

For questions or issues, please open an issue in the GitHub repository.

Additional Resources

License

This project uses a custom dual-license based on AGPL v3.

See the full license terms here: LICENSE

Acknowledgments

This work builds upon:

  • DeBERTa v3: Microsoft's DeBERTa v3 multilingual base model
  • pytorch-crf: CRF implementation by kmkurn
  • Transformers: Hugging Face Transformers library

Model training was conducted on NVIDIA L40 GPU infrastructure.


Version: 1.0
Last Updated: 2025-10-17
Framework: PyTorch + Transformers + torchcrf

Downloads last month
44
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inesctec/CitiLink-DeBERTa-CRF-Votie-pt

Finetuned
(590)
this model

Collection including inesctec/CitiLink-DeBERTa-CRF-Votie-pt

Evaluation results