Citilink-XLMR-large-Metadata-en-baseline: Metadata Extraction for Municipal Meeting Minutes

Model Description

Citilink-XLMR-large-Metadata-en-baseline is a baseline implementation, consisting of a fine-tuned BERT model for Named Entity Recognition (NER), to automatically extract metadata such as meeting number, date, location, participants, and time expressions from English municipal meeting minutes.
It was developed as part of a study on information extraction and indexing of administrative documents.

Key Features

  • 🏛️ Domain-Specific: Trained exclusively on authentic municipal meeting minutes (translated to English)
  • 🧠 Entity-Level Extraction: Identifies key metadata (minute ID, date, location, start/end times, participants)
  • ⚙️ Transformer-based Architecture: XLM-RoBERTa backbone with fine-tuning for token classification
  • 📈 Strong NER Performance: Achieves F1-score above 0.95 on the English dataset

Model Details

  • Base Model: xlm-roberta-large
  • Architecture: BERT for token classification (NER)
  • Parameters: ~335M
    • 24 transformer layers
    • 1024 hidden dimensions
    • 16 attention heads
    • 4096 intermediate size
  • Max Sequence Length: 512 tokens
  • Learning Rate: 5e-5
  • Warmup: 0.1
  • Batch Size: 16
  • Optimizer: AdamW
  • Weight Decay: 0.01
  • Entity Types: minute_id, date, meeting_type, location, begin_time, end_time, participant
  • Training Framework: PyTorch + Hugging Face Transformers
  • Evaluation Metric: Micro F1-score, Precision and Recall (seqeval)

How It Works

The model assigns a label to each token in the input sequence, using the BIO scheme (Begin–Inside–Outside).
It can recognize and extract structured metadata from free-form text, even when expressed with stylistic variation.

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
MODEL_NAME = "inesctec/Citilink-XLMR-large-Metadata-en-baseline"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME)
model.eval()

# Example text
text = "MINUTES NO. 2 – Mandate 2021-2025\nEXTRAORDINARY MEETING 10/22/2021\nMUNICIPALITY OF ALANDROAL\nMr. João Maria Aranha Grilo, Mayor of Alandroal, presided.\nCouncillors João Carlos Camões Roma Balsante\nPaulo Jorge da Silva Gonçalves\nFernanda Manuela Brites Romão\nJosé Francisco Figueira Andrezo Rodrigues\nHe was the secretary of the meeting ************************************************\nIn the Headquarters Building of the Municipality of Alandroal, the Mayor, João Maria Aranha Grilo, declared the meeting open, it was 2.15 pm. \n \n1. REQUEST FOR SCHEDULING AN EXTRAORDINARY MEETING OF THE MUNICIPAL ASSEMBLY.\n"

# Tokenize with offset mapping
inputs = tokenizer(
    text,
    return_tensors="en",
    truncation=True,
    max_length=512,
    return_offsets_mapping=True
)
offsets = inputs.pop("offset_mapping")[0]

# Predict
with torch.no_grad():
    outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=2)[0]
labels = [model.config.id2label[p.item()] for p in predictions]

# Extract entities using character spans
entities = []
current = None

for (start, end), label in zip(offsets.tolist(), labels):
    if label == "O" or start == end:
        if current:
            entities.append(current)
            current = None
        continue

    if label.startswith("B-"):
        if current:
            entities.append(current)
        current = {"label": label[2:], "start": start, "end": end}
    elif label.startswith("I-") and current and label[2:] == current["label"]:
        current["end"] = end
    else:
        if current:
            entities.append(current)
        current = None

if current:
    entities.append(current)

# Print results
print("\nDetected Entities:")
for ent in entities:
    span = text[ent["start"]:ent["end"]]
    print(f"- {ent['label']}: {span}")

Evaluation Results

Municipal Meeting Minutes Test Set

Metric Score
F1 score 0.94
Precision 0.94
Recall 0.94

Limitations

  • Domain Specificity: Best performance on administrative/governmental meeting minutes
  • Language: Optimized for English; Portuguese performance may vary
  • Context Window: Limited to 512 tokens

License

This project uses a custom dual-license based on AGPL v3.

See the full license terms here: LICENSE

Downloads last month
50
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inesctec/CitiLink-XLMR-large-Metadata-en-baseline

Finetuned
(928)
this model

Collection including inesctec/CitiLink-XLMR-large-Metadata-en-baseline