ABB Dual-Head Location Component NER

A dual-head Named Entity Recognition model for decomposing English/Belgian(Dutch)/German municipal decision location strings into their structured address components. Built on XLM-RoBERTa base with two independent CRF-decoded heads.

Model Description

This model simultaneously:

  1. Component head β€” tags each token as one of 12 address component types (street, city, postcode, …)
  2. Location head β€” groups tokens into distinct physical location spans (B-LOCATION / I-LOCATION)

Post-processing then nests the component spans inside their parent location spans, producing a structured JSON output identical to the format used in training.

Entity Types (Component Head)

Label Description
STREET Street name (no house number)
ROAD Road or route name
HOUSENUMBER House/building number(s), ranges or sequences
POSTCODE Postal or ZIP code
CITY City or municipality name
PROVINCE Province or region name
BUILDING Named building, site or facility
INTERSECTION Crossing or intersection of roads
PARCEL Land parcel, section or lot number
DISTRICT District, neighbourhood or borough
GRAVE_LOCATION Plot/row/number within a cemetery
DOMAIN_ZONE_AREA Domain, zone or area name

Location Head

B-LOCATION Β· I-LOCATION Β· O β€” delimits individual distinct locations within a multi-location string.

Evaluation Results

Evaluated on a held-out 10 % split of ~10 000 Belgian municipal decision location strings.

Metric Score
Combined F1 0.9435
Component F1 0.9295
Location F1 0.9576

Component-level report

                  precision    recall  f1-score   support

        BUILDING       0.82      0.84      0.83       166
            CITY       0.94      0.94      0.94       344
        DISTRICT       0.82      0.95      0.88        57
DOMAIN_ZONE_AREA       0.61      0.61      0.61        84
  GRAVE_LOCATION       0.93      1.00      0.97        14
     HOUSENUMBER       0.98      0.99      0.99       366
    INTERSECTION       0.95      0.98      0.96        53
          PARCEL       0.86      0.92      0.89        65
        POSTCODE       0.99      0.99      0.99       150
        PROVINCE       1.00      1.00      1.00        93
            ROAD       0.81      0.91      0.85        55
          STREET       0.95      0.95      0.95       586

       micro avg       0.92      0.94      0.93      2033
       macro avg       0.89      0.92      0.90      2033
    weighted avg       0.92      0.94      0.93      2033

Location-level report

          precision    recall  f1-score   support

LOCATION       0.96      0.96      0.96      1109

 micro avg       0.96      0.96      0.96      1109
 macro avg       0.96      0.96      0.96      1109
weighted avg      0.96      0.96      0.96      1109

Installation

pip install transformers torch pytorch-crf

Usage

Because this model uses a custom architecture, you need to copy the class definitions below before loading it.

Minimal example

import re, json, torch, torch.nn as nn
from dataclasses import dataclass
from typing import List, Tuple, Optional
from transformers import XLMRobertaModel, XLMRobertaConfig, PreTrainedModel, AutoTokenizer
from transformers.modeling_outputs import ModelOutput
from torchcrf import CRF

# ── Label sets ───────────────────────────────────────────────────────────────
ENTITY_TYPES = [
    "STREET", "ROAD", "HOUSENUMBER", "POSTCODE", "CITY", "PROVINCE",
    "BUILDING", "INTERSECTION", "PARCEL", "DISTRICT",
    "GRAVE_LOCATION", "DOMAIN_ZONE_AREA",
]
BIO_LABELS = ["O"] + [f"{p}-{e}" for e in ENTITY_TYPES for p in ("B", "I")]
LOC_BIO_LABELS = ["O", "B-LOCATION", "I-LOCATION"]
label2id = {l: i for i, l in enumerate(BIO_LABELS)}
id2label = {i: l for i, l in enumerate(BIO_LABELS)}
loc_label2id = {l: i for i, l in enumerate(LOC_BIO_LABELS)}
loc_id2label = {i: l for i, l in enumerate(LOC_BIO_LABELS)}
MAX_LENGTH = 256

# ── Model classes ─────────────────────────────────────────────────────────────
class DualNERConfig(XLMRobertaConfig):
    model_type = "dual_ner_xlm_roberta"
    def __init__(self, num_component_labels=len(BIO_LABELS),
                 num_location_labels=len(LOC_BIO_LABELS), **kwargs):
        super().__init__(**kwargs)
        self.num_component_labels = num_component_labels
        self.num_location_labels = num_location_labels

@dataclass
class DualNEROutput(ModelOutput):
    loss: Optional[torch.FloatTensor] = None
    component_logits: torch.FloatTensor = None
    location_logits: torch.FloatTensor = None

class DualHeadLocationNER(PreTrainedModel):
    config_class = DualNERConfig
    base_model_prefix = "roberta"

    def __init__(self, config):
        super().__init__(config)
        self.roberta = XLMRobertaModel(config, add_pooling_layer=False)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.component_head = nn.Sequential(
            nn.Linear(config.hidden_size, 256), nn.GELU(), nn.Dropout(0.1),
            nn.Linear(256, config.num_component_labels),
        )
        self.location_head = nn.Sequential(
            nn.Linear(config.hidden_size, 256), nn.GELU(), nn.Dropout(0.1),
            nn.Linear(256, config.num_location_labels),
        )
        self.component_crf = CRF(config.num_component_labels, batch_first=True)
        self.location_crf = CRF(config.num_location_labels, batch_first=True)
        self.post_init()

    def forward(self, input_ids=None, attention_mask=None, **kwargs):
        h = self.dropout(
            self.roberta(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
        )
        return DualNEROutput(
            component_logits=self.component_head(h),
            location_logits=self.location_head(h),
        )

# ── Tokenizer helpers ─────────────────────────────────────────────────────────
def tokenize_location(text: str) -> Tuple[List[str], List[Tuple[int, int]]]:
    tokens, offsets = [], []
    for m in re.finditer(r'[,;()\[\]{}]|[^\s,;()\[\]{}]+', text):
        tokens.append(m.group())
        offsets.append((m.start(), m.end()))
    return tokens, offsets

def classify_housenumber_type(hn: str) -> str:
    if re.search(r'\d\s*[-–]\s*\d', hn): return "range"
    if re.search(r'[,;]|\band\b|\ben\b', hn): return "sequence"
    return "single"

def extract_bio_spans(tokens, tag_ids, id2l, offsets):
    spans, ent, start = [], None, 0
    for i, tid in enumerate(tag_ids):
        tag = id2l[tid]
        if tag.startswith("B-"):
            if ent: spans.append({"entity": ent, "start_tok": start, "end_tok": i - 1,
                                   "char_start": offsets[start][0], "char_end": offsets[i-1][1]})
            ent, start = tag[2:], i
        elif not (tag.startswith("I-") and ent == tag[2:]):
            if ent: spans.append({"entity": ent, "start_tok": start, "end_tok": i - 1,
                                   "char_start": offsets[start][0], "char_end": offsets[i-1][1]})
            ent = None
    if ent: spans.append({"entity": ent, "start_tok": start, "end_tok": len(tokens)-1,
                           "char_start": offsets[start][0], "char_end": offsets[-1][1]})
    return spans

# ── Inference ─────────────────────────────────────────────────────────────────
def predict_locations(text, model, tokenizer, device="cpu"):
    tokens, offsets = tokenize_location(text)
    if not tokens:
        return {"original": text, "locations": []}
    enc = tokenizer(tokens, is_split_into_words=True, return_tensors="pt",
                    truncation=True, max_length=MAX_LENGTH)
    word_ids = enc.word_ids()
    with torch.no_grad():
        out = model(**{k: v.to(device) for k, v in enc.items()})
    mask = enc["attention_mask"].bool().to(device)
    cpreds = model.component_crf.decode(out.component_logits, mask=mask)[0]
    lpreds = model.location_crf.decode(out.location_logits, mask=mask)[0]
    wcomp, wloc, prev = [], [], None
    for idx, wid in enumerate(word_ids):
        if wid is None: continue
        if wid != prev:
            wcomp.append(cpreds[idx]); wloc.append(lpreds[idx])
        prev = wid
    loc_spans  = extract_bio_spans(tokens, wloc,  loc_id2label, offsets)
    comp_spans = extract_bio_spans(tokens, wcomp, id2label,     offsets)
    locations, assigned = [], set()
    for ls in loc_spans:
        loc = {"location": text[ls["char_start"]:ls["char_end"]]}
        for ci, cs in enumerate(comp_spans):
            if cs["start_tok"] >= ls["start_tok"] and cs["end_tok"] <= ls["end_tok"]:
                loc[cs["entity"].lower()] = text[cs["char_start"]:cs["char_end"]]
                assigned.add(ci)
        if "housenumber" in loc:
            loc["housenumber_type"] = classify_housenumber_type(loc["housenumber"])
        locations.append(loc)
    for ci, cs in enumerate(comp_spans):
        if ci not in assigned:
            loc = {"location": text[cs["char_start"]:cs["char_end"]],
                   cs["entity"].lower(): text[cs["char_start"]:cs["char_end"]]}
            if "housenumber" in loc:
                loc["housenumber_type"] = classify_housenumber_type(loc["housenumber"])
            locations.append(loc)
    return {"original": text, "locations": locations}

# ── Load & run ────────────────────────────────────────────────────────────────
MODEL_REPO = "svercoutere/abb-dual-location-component-ner"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO)
config    = DualNERConfig.from_pretrained(MODEL_REPO,
                num_component_labels=len(BIO_LABELS),
                num_location_labels=len(LOC_BIO_LABELS))
model     = DualHeadLocationNER.from_pretrained(MODEL_REPO, config=config)
model.to(device).eval()

texts = [
    "Scaldisstraat 23-25, 2000 Antwerpen",
    "Cafe den Draak, Lovegemlaan 7, 9000 Gent",
    "Heikeesstraat 2, 9240 Zele and Dorpstraat 7, 8040 Mariakerke",
    "begraafplaats Schoonselhof, perk 27, rij 3",
    "politiezone Antwerpen",
]
for text in texts:
    result = predict_locations(text, model, tokenizer, device)
    print(f"\nInput : {text}")
    for loc in result["locations"]:
        parts = {k: v for k, v in loc.items() if k != "location"}
        print(f"  LOC : {loc['location']}")
        print(f"        {json.dumps(parts, ensure_ascii=False)}")

Expected output

Input : Scaldisstraat 23-25, 2000 Antwerpen
  LOC : Scaldisstraat 23-25, 2000 Antwerpen
        {"street": "Scaldisstraat", "housenumber": "23-25", "housenumber_type": "range", "postcode": "2000", "city": "Antwerpen"}

Input : Cafe den Draak, Lovegemlaan 7, 9000 Gent
  LOC : Cafe den Draak, Lovegemlaan 7, 9000 Gent
        {"building": "Cafe den Draak", "street": "Lovegemlaan", "housenumber": "7", "housenumber_type": "single", "postcode": "9000", "city": "Gent"}

Input : Heikeesstraat 2, 9240 Zele and Dorpstraat 7, 8040 Mariakerke
  LOC : Heikeesstraat 2, 9240 Zele
        {"street": "Heikeesstraat", "housenumber": "2", "housenumber_type": "single", "postcode": "9240", "city": "Zele"}
  LOC : Dorpstraat 7, 8040 Mariakerke
        {"street": "Dorpstraat", "housenumber": "7", "housenumber_type": "single", "postcode": "8040", "city": "Mariakerke"}

Deployment

Local

git clone https://huggingface.co/svercoutere/abb-dual-location-component-ner
pip install transformers torch torchcrf
python test_dual_ner.py

Docker / API

Wrap predict_locations in a FastAPI endpoint:

from fastapi import FastAPI
app = FastAPI()

@app.post("/parse")
def parse(text: str):
    return predict_locations(text, model, tokenizer, device)

Training Details

Setting Value
Base model xlm-roberta-base
Max sequence length 256 tokens
Batch size (train) 16
Batch size (eval) 32
Epochs 50
Learning rate 2e-5
Weight decay 0.01
Warmup ratio 0.1
Loss weighting 1.5 Γ— component + 0.5 Γ— location
Decoding CRF Viterbi (both heads)
Training data ~10 000 Belgian municipal decision location strings
Class balancing Minority-class oversampling to median token count

Limitations

  • Primarily trained on Belgian/Flemish location strings; German accuracy may be lower for rare component types.
  • DOMAIN_ZONE_AREA F1 is lower (0.61) due to high lexical variability and limited training examples.
Downloads last month
80
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results