ABB Dual-Head Location Component NER

A dual-head Named Entity Recognition model for decomposing English/Belgian(Dutch)/German municipal decision location strings into their structured address components. Built on XLM-RoBERTa base with two independent CRF-decoded heads.

Model Description

This model simultaneously:

Component head — tags each token as one of 12 address component types (street, city, postcode, …)
Location head — groups tokens into distinct physical location spans (B-LOCATION / I-LOCATION)

Post-processing then nests the component spans inside their parent location spans, producing a structured JSON output identical to the format used in training.

Entity Types (Component Head)

Label	Description
`STREET`	Street name (no house number)
`ROAD`	Road or route name
`HOUSENUMBER`	House/building number(s), ranges or sequences
`POSTCODE`	Postal or ZIP code
`CITY`	City or municipality name
`PROVINCE`	Province or region name
`BUILDING`	Named building, site or facility
`INTERSECTION`	Crossing or intersection of roads
`PARCEL`	Land parcel, section or lot number
`DISTRICT`	District, neighbourhood or borough
`GRAVE_LOCATION`	Plot/row/number within a cemetery
`DOMAIN_ZONE_AREA`	Domain, zone or area name

Location Head

B-LOCATION · I-LOCATION · O — delimits individual distinct locations within a multi-location string.

Evaluation Results

Evaluated on a held-out 10 % split of ~10 000 Belgian municipal decision location strings.

Metric	Score
Combined F1	0.9435
Component F1	0.9295
Location F1	0.9576

Component-level report

                  precision    recall  f1-score   support

        BUILDING       0.82      0.84      0.83       166
            CITY       0.94      0.94      0.94       344
        DISTRICT       0.82      0.95      0.88        57
DOMAIN_ZONE_AREA       0.61      0.61      0.61        84
  GRAVE_LOCATION       0.93      1.00      0.97        14
     HOUSENUMBER       0.98      0.99      0.99       366
    INTERSECTION       0.95      0.98      0.96        53
          PARCEL       0.86      0.92      0.89        65
        POSTCODE       0.99      0.99      0.99       150
        PROVINCE       1.00      1.00      1.00        93
            ROAD       0.81      0.91      0.85        55
          STREET       0.95      0.95      0.95       586

       micro avg       0.92      0.94      0.93      2033
       macro avg       0.89      0.92      0.90      2033
    weighted avg       0.92      0.94      0.93      2033

Location-level report

          precision    recall  f1-score   support

LOCATION       0.96      0.96      0.96      1109

 micro avg       0.96      0.96      0.96      1109
 macro avg       0.96      0.96      0.96      1109
weighted avg      0.96      0.96      0.96      1109

Installation

pip install transformers torch pytorch-crf

Usage

Because this model uses a custom architecture, you need to copy the class definitions below before loading it.

Minimal example

import re, json, torch, torch.nn as nn
from dataclasses import dataclass
from typing import List, Tuple, Optional
from transformers import XLMRobertaModel, XLMRobertaConfig, PreTrainedModel, AutoTokenizer
from transformers.modeling_outputs import ModelOutput
from torchcrf import CRF

# ── Label sets ───────────────────────────────────────────────────────────────
ENTITY_TYPES = [
    "STREET", "ROAD", "HOUSENUMBER", "POSTCODE", "CITY", "PROVINCE",
    "BUILDING", "INTERSECTION", "PARCEL", "DISTRICT",
    "GRAVE_LOCATION", "DOMAIN_ZONE_AREA",
]
BIO_LABELS = ["O"] + [f"{p}-{e}" for e in ENTITY_TYPES for p in ("B", "I")]
LOC_BIO_LABELS = ["O", "B-LOCATION", "I-LOCATION"]
label2id = {l: i for i, l in enumerate(BIO_LABELS)}
id2label = {i: l for i, l in enumerate(BIO_LABELS)}
loc_label2id = {l: i for i, l in enumerate(LOC_BIO_LABELS)}
loc_id2label = {i: l for i, l in enumerate(LOC_BIO_LABELS)}
MAX_LENGTH = 256

# ── Model classes ─────────────────────────────────────────────────────────────
class DualNERConfig(XLMRobertaConfig):
    model_type = "dual_ner_xlm_roberta"
    def __init__(self, num_component_labels=len(BIO_LABELS),
                 num_location_labels=len(LOC_BIO_LABELS), **kwargs):
        super().__init__(**kwargs)
        self.num_component_labels = num_component_labels
        self.num_location_labels = num_location_labels

@dataclass
class DualNEROutput(ModelOutput):
    loss: Optional[torch.FloatTensor] = None
    component_logits: torch.FloatTensor = None
    location_logits: torch.FloatTensor = None

class DualHeadLocationNER(PreTrainedModel):
    config_class = DualNERConfig
    base_model_prefix = "roberta"

    def __init__(self, config):
        super().__init__(config)
        self.roberta = XLMRobertaModel(config, add_pooling_layer=False)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.component_head = nn.Sequential(
            nn.Linear(config.hidden_size, 256), nn.GELU(), nn.Dropout(0.1),
            nn.Linear(256, config.num_component_labels),
        )
        self.location_head = nn.Sequential(
            nn.Linear(config.hidden_size, 256), nn.GELU(), nn.Dropout(0.1),
            nn.Linear(256, config.num_location_labels),
        )
        self.component_crf = CRF(config.num_component_labels, batch_first=True)
        self.location_crf = CRF(config.num_location_labels, batch_first=True)
        self.post_init()

    def forward(self, input_ids=None, attention_mask=None, **kwargs):
        h = self.dropout(
            self.roberta(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
        )
        return DualNEROutput(
            component_logits=self.component_head(h),
            location_logits=self.location_head(h),
        )

# ── Tokenizer helpers ─────────────────────────────────────────────────────────
def tokenize_location(text: str) -> Tuple[List[str], List[Tuple[int, int]]]:
    tokens, offsets = [], []
    for m in re.finditer(r'[,;()\[\]{}]|[^\s,;()\[\]{}]+', text):
        tokens.append(m.group())
        offsets.append((m.start(), m.end()))
    return tokens, offsets

def classify_housenumber_type(hn: str) -> str:
    if re.search(r'\d\s*[-–]\s*\d', hn): return "range"
    if re.search(r'[,;]|\band\b|\ben\b', hn): return "sequence"
    return "single"

def extract_bio_spans(tokens, tag_ids, id2l, offsets):
    spans, ent, start = [], None, 0
    for i, tid in enumerate(tag_ids):
        tag = id2l[tid]
        if tag.startswith("B-"):
            if ent: spans.append({"entity": ent, "start_tok": start, "end_tok": i - 1,
                                   "char_start": offsets[start][0], "char_end": offsets[i-1][1]})
            ent, start = tag[2:], i
        elif not (tag.startswith("I-") and ent == tag[2:]):
            if ent: spans.append({"entity": ent, "start_tok": start, "end_tok": i - 1,
                                   "char_start": offsets[start][0], "char_end": offsets[i-1][1]})
            ent = None
    if ent: spans.append({"entity": ent, "start_tok": start, "end_tok": len(tokens)-1,
                           "char_start": offsets[start][0], "char_end": offsets[-1][1]})
    return spans

# ── Inference ─────────────────────────────────────────────────────────────────
def predict_locations(text, model, tokenizer, device="cpu"):
    tokens, offsets = tokenize_location(text)
    if not tokens:
        return {"original": text, "locations": []}
    enc = tokenizer(tokens, is_split_into_words=True, return_tensors="pt",
                    truncation=True, max_length=MAX_LENGTH)
    word_ids = enc.word_ids()
    with torch.no_grad():
        out = model(**{k: v.to(device) for k, v in enc.items()})
    mask = enc["attention_mask"].bool().to(device)
    cpreds = model.component_crf.decode(out.component_logits, mask=mask)[0]
    lpreds = model.location_crf.decode(out.location_logits, mask=mask)[0]
    wcomp, wloc, prev = [], [], None
    for idx, wid in enumerate(word_ids):
        if wid is None: continue
        if wid != prev:
            wcomp.append(cpreds[idx]); wloc.append(lpreds[idx])
        prev = wid
    loc_spans  = extract_bio_spans(tokens, wloc,  loc_id2label, offsets)
    comp_spans = extract_bio_spans(tokens, wcomp, id2label,     offsets)
    locations, assigned = [], set()
    for ls in loc_spans:
        loc = {"location": text[ls["char_start"]:ls["char_end"]]}
        for ci, cs in enumerate(comp_spans):
            if cs["start_tok"] >= ls["start_tok"] and cs["end_tok"] <= ls["end_tok"]:
                loc[cs["entity"].lower()] = text[cs["char_start"]:cs["char_end"]]
                assigned.add(ci)
        if "housenumber" in loc:
            loc["housenumber_type"] = classify_housenumber_type(loc["housenumber"])
        locations.append(loc)
    for ci, cs in enumerate(comp_spans):
        if ci not in assigned:
            loc = {"location": text[cs["char_start"]:cs["char_end"]],
                   cs["entity"].lower(): text[cs["char_start"]:cs["char_end"]]}
            if "housenumber" in loc:
                loc["housenumber_type"] = classify_housenumber_type(loc["housenumber"])
            locations.append(loc)
    return {"original": text, "locations": locations}

# ── Load & run ────────────────────────────────────────────────────────────────
MODEL_REPO = "svercoutere/abb-dual-location-component-ner"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO)
config    = DualNERConfig.from_pretrained(MODEL_REPO,
                num_component_labels=len(BIO_LABELS),
                num_location_labels=len(LOC_BIO_LABELS))
model     = DualHeadLocationNER.from_pretrained(MODEL_REPO, config=config)
model.to(device).eval()

texts = [
    "Scaldisstraat 23-25, 2000 Antwerpen",
    "Cafe den Draak, Lovegemlaan 7, 9000 Gent",
    "Heikeesstraat 2, 9240 Zele and Dorpstraat 7, 8040 Mariakerke",
    "begraafplaats Schoonselhof, perk 27, rij 3",
    "politiezone Antwerpen",
]
for text in texts:
    result = predict_locations(text, model, tokenizer, device)
    print(f"\nInput : {text}")
    for loc in result["locations"]:
        parts = {k: v for k, v in loc.items() if k != "location"}
        print(f"  LOC : {loc['location']}")
        print(f"        {json.dumps(parts, ensure_ascii=False)}")

Expected output

Input : Scaldisstraat 23-25, 2000 Antwerpen
  LOC : Scaldisstraat 23-25, 2000 Antwerpen
        {"street": "Scaldisstraat", "housenumber": "23-25", "housenumber_type": "range", "postcode": "2000", "city": "Antwerpen"}

Input : Cafe den Draak, Lovegemlaan 7, 9000 Gent
  LOC : Cafe den Draak, Lovegemlaan 7, 9000 Gent
        {"building": "Cafe den Draak", "street": "Lovegemlaan", "housenumber": "7", "housenumber_type": "single", "postcode": "9000", "city": "Gent"}

Input : Heikeesstraat 2, 9240 Zele and Dorpstraat 7, 8040 Mariakerke
  LOC : Heikeesstraat 2, 9240 Zele
        {"street": "Heikeesstraat", "housenumber": "2", "housenumber_type": "single", "postcode": "9240", "city": "Zele"}
  LOC : Dorpstraat 7, 8040 Mariakerke
        {"street": "Dorpstraat", "housenumber": "7", "housenumber_type": "single", "postcode": "8040", "city": "Mariakerke"}

Deployment

Local

git clone https://huggingface.co/svercoutere/abb-dual-location-component-ner
pip install transformers torch torchcrf
python test_dual_ner.py

Docker / API

Wrap predict_locations in a FastAPI endpoint:

from fastapi import FastAPI
app = FastAPI()

@app.post("/parse")
def parse(text: str):
    return predict_locations(text, model, tokenizer, device)

Training Details

Setting	Value
Base model	`xlm-roberta-base`
Max sequence length	256 tokens
Batch size (train)	16
Batch size (eval)	32
Epochs	50
Learning rate	2e-5
Weight decay	0.01
Warmup ratio	0.1
Loss weighting	1.5 × component + 0.5 × location
Decoding	CRF Viterbi (both heads)
Training data	~10 000 Belgian municipal decision location strings
Class balancing	Minority-class oversampling to median token count

Limitations

Primarily trained on Belgian/Flemish location strings; German accuracy may be lower for rare component types.
DOMAIN_ZONE_AREA F1 is lower (0.61) due to high lexical variability and limited training examples.

Downloads last month: 80

Safetensors

Model size

0.3B params

Tensor type

F32

Evaluation results

Combined F1
self-reported

0.944
Component F1
self-reported

0.929
Location F1
self-reported

0.958