ABB Dual-Head Location Component NER
A dual-head Named Entity Recognition model for decomposing English/Belgian(Dutch)/German municipal decision location strings into their structured address components. Built on XLM-RoBERTa base with two independent CRF-decoded heads.
Model Description
This model simultaneously:
- Component head β tags each token as one of 12 address component types (street, city, postcode, β¦)
- Location head β groups tokens into distinct physical location spans (
B-LOCATION/I-LOCATION)
Post-processing then nests the component spans inside their parent location spans, producing a structured JSON output identical to the format used in training.
Entity Types (Component Head)
| Label | Description |
|---|---|
STREET |
Street name (no house number) |
ROAD |
Road or route name |
HOUSENUMBER |
House/building number(s), ranges or sequences |
POSTCODE |
Postal or ZIP code |
CITY |
City or municipality name |
PROVINCE |
Province or region name |
BUILDING |
Named building, site or facility |
INTERSECTION |
Crossing or intersection of roads |
PARCEL |
Land parcel, section or lot number |
DISTRICT |
District, neighbourhood or borough |
GRAVE_LOCATION |
Plot/row/number within a cemetery |
DOMAIN_ZONE_AREA |
Domain, zone or area name |
Location Head
B-LOCATION Β· I-LOCATION Β· O β delimits individual distinct locations within a multi-location string.
Evaluation Results
Evaluated on a held-out 10 % split of ~10 000 Belgian municipal decision location strings.
| Metric | Score |
|---|---|
| Combined F1 | 0.9435 |
| Component F1 | 0.9295 |
| Location F1 | 0.9576 |
Component-level report
precision recall f1-score support
BUILDING 0.82 0.84 0.83 166
CITY 0.94 0.94 0.94 344
DISTRICT 0.82 0.95 0.88 57
DOMAIN_ZONE_AREA 0.61 0.61 0.61 84
GRAVE_LOCATION 0.93 1.00 0.97 14
HOUSENUMBER 0.98 0.99 0.99 366
INTERSECTION 0.95 0.98 0.96 53
PARCEL 0.86 0.92 0.89 65
POSTCODE 0.99 0.99 0.99 150
PROVINCE 1.00 1.00 1.00 93
ROAD 0.81 0.91 0.85 55
STREET 0.95 0.95 0.95 586
micro avg 0.92 0.94 0.93 2033
macro avg 0.89 0.92 0.90 2033
weighted avg 0.92 0.94 0.93 2033
Location-level report
precision recall f1-score support
LOCATION 0.96 0.96 0.96 1109
micro avg 0.96 0.96 0.96 1109
macro avg 0.96 0.96 0.96 1109
weighted avg 0.96 0.96 0.96 1109
Installation
pip install transformers torch pytorch-crf
Usage
Because this model uses a custom architecture, you need to copy the class definitions below before loading it.
Minimal example
import re, json, torch, torch.nn as nn
from dataclasses import dataclass
from typing import List, Tuple, Optional
from transformers import XLMRobertaModel, XLMRobertaConfig, PreTrainedModel, AutoTokenizer
from transformers.modeling_outputs import ModelOutput
from torchcrf import CRF
# ββ Label sets βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ENTITY_TYPES = [
"STREET", "ROAD", "HOUSENUMBER", "POSTCODE", "CITY", "PROVINCE",
"BUILDING", "INTERSECTION", "PARCEL", "DISTRICT",
"GRAVE_LOCATION", "DOMAIN_ZONE_AREA",
]
BIO_LABELS = ["O"] + [f"{p}-{e}" for e in ENTITY_TYPES for p in ("B", "I")]
LOC_BIO_LABELS = ["O", "B-LOCATION", "I-LOCATION"]
label2id = {l: i for i, l in enumerate(BIO_LABELS)}
id2label = {i: l for i, l in enumerate(BIO_LABELS)}
loc_label2id = {l: i for i, l in enumerate(LOC_BIO_LABELS)}
loc_id2label = {i: l for i, l in enumerate(LOC_BIO_LABELS)}
MAX_LENGTH = 256
# ββ Model classes βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
class DualNERConfig(XLMRobertaConfig):
model_type = "dual_ner_xlm_roberta"
def __init__(self, num_component_labels=len(BIO_LABELS),
num_location_labels=len(LOC_BIO_LABELS), **kwargs):
super().__init__(**kwargs)
self.num_component_labels = num_component_labels
self.num_location_labels = num_location_labels
@dataclass
class DualNEROutput(ModelOutput):
loss: Optional[torch.FloatTensor] = None
component_logits: torch.FloatTensor = None
location_logits: torch.FloatTensor = None
class DualHeadLocationNER(PreTrainedModel):
config_class = DualNERConfig
base_model_prefix = "roberta"
def __init__(self, config):
super().__init__(config)
self.roberta = XLMRobertaModel(config, add_pooling_layer=False)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.component_head = nn.Sequential(
nn.Linear(config.hidden_size, 256), nn.GELU(), nn.Dropout(0.1),
nn.Linear(256, config.num_component_labels),
)
self.location_head = nn.Sequential(
nn.Linear(config.hidden_size, 256), nn.GELU(), nn.Dropout(0.1),
nn.Linear(256, config.num_location_labels),
)
self.component_crf = CRF(config.num_component_labels, batch_first=True)
self.location_crf = CRF(config.num_location_labels, batch_first=True)
self.post_init()
def forward(self, input_ids=None, attention_mask=None, **kwargs):
h = self.dropout(
self.roberta(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
)
return DualNEROutput(
component_logits=self.component_head(h),
location_logits=self.location_head(h),
)
# ββ Tokenizer helpers βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
def tokenize_location(text: str) -> Tuple[List[str], List[Tuple[int, int]]]:
tokens, offsets = [], []
for m in re.finditer(r'[,;()\[\]{}]|[^\s,;()\[\]{}]+', text):
tokens.append(m.group())
offsets.append((m.start(), m.end()))
return tokens, offsets
def classify_housenumber_type(hn: str) -> str:
if re.search(r'\d\s*[-β]\s*\d', hn): return "range"
if re.search(r'[,;]|\band\b|\ben\b', hn): return "sequence"
return "single"
def extract_bio_spans(tokens, tag_ids, id2l, offsets):
spans, ent, start = [], None, 0
for i, tid in enumerate(tag_ids):
tag = id2l[tid]
if tag.startswith("B-"):
if ent: spans.append({"entity": ent, "start_tok": start, "end_tok": i - 1,
"char_start": offsets[start][0], "char_end": offsets[i-1][1]})
ent, start = tag[2:], i
elif not (tag.startswith("I-") and ent == tag[2:]):
if ent: spans.append({"entity": ent, "start_tok": start, "end_tok": i - 1,
"char_start": offsets[start][0], "char_end": offsets[i-1][1]})
ent = None
if ent: spans.append({"entity": ent, "start_tok": start, "end_tok": len(tokens)-1,
"char_start": offsets[start][0], "char_end": offsets[-1][1]})
return spans
# ββ Inference βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
def predict_locations(text, model, tokenizer, device="cpu"):
tokens, offsets = tokenize_location(text)
if not tokens:
return {"original": text, "locations": []}
enc = tokenizer(tokens, is_split_into_words=True, return_tensors="pt",
truncation=True, max_length=MAX_LENGTH)
word_ids = enc.word_ids()
with torch.no_grad():
out = model(**{k: v.to(device) for k, v in enc.items()})
mask = enc["attention_mask"].bool().to(device)
cpreds = model.component_crf.decode(out.component_logits, mask=mask)[0]
lpreds = model.location_crf.decode(out.location_logits, mask=mask)[0]
wcomp, wloc, prev = [], [], None
for idx, wid in enumerate(word_ids):
if wid is None: continue
if wid != prev:
wcomp.append(cpreds[idx]); wloc.append(lpreds[idx])
prev = wid
loc_spans = extract_bio_spans(tokens, wloc, loc_id2label, offsets)
comp_spans = extract_bio_spans(tokens, wcomp, id2label, offsets)
locations, assigned = [], set()
for ls in loc_spans:
loc = {"location": text[ls["char_start"]:ls["char_end"]]}
for ci, cs in enumerate(comp_spans):
if cs["start_tok"] >= ls["start_tok"] and cs["end_tok"] <= ls["end_tok"]:
loc[cs["entity"].lower()] = text[cs["char_start"]:cs["char_end"]]
assigned.add(ci)
if "housenumber" in loc:
loc["housenumber_type"] = classify_housenumber_type(loc["housenumber"])
locations.append(loc)
for ci, cs in enumerate(comp_spans):
if ci not in assigned:
loc = {"location": text[cs["char_start"]:cs["char_end"]],
cs["entity"].lower(): text[cs["char_start"]:cs["char_end"]]}
if "housenumber" in loc:
loc["housenumber_type"] = classify_housenumber_type(loc["housenumber"])
locations.append(loc)
return {"original": text, "locations": locations}
# ββ Load & run ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
MODEL_REPO = "svercoutere/abb-dual-location-component-ner"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO)
config = DualNERConfig.from_pretrained(MODEL_REPO,
num_component_labels=len(BIO_LABELS),
num_location_labels=len(LOC_BIO_LABELS))
model = DualHeadLocationNER.from_pretrained(MODEL_REPO, config=config)
model.to(device).eval()
texts = [
"Scaldisstraat 23-25, 2000 Antwerpen",
"Cafe den Draak, Lovegemlaan 7, 9000 Gent",
"Heikeesstraat 2, 9240 Zele and Dorpstraat 7, 8040 Mariakerke",
"begraafplaats Schoonselhof, perk 27, rij 3",
"politiezone Antwerpen",
]
for text in texts:
result = predict_locations(text, model, tokenizer, device)
print(f"\nInput : {text}")
for loc in result["locations"]:
parts = {k: v for k, v in loc.items() if k != "location"}
print(f" LOC : {loc['location']}")
print(f" {json.dumps(parts, ensure_ascii=False)}")
Expected output
Input : Scaldisstraat 23-25, 2000 Antwerpen
LOC : Scaldisstraat 23-25, 2000 Antwerpen
{"street": "Scaldisstraat", "housenumber": "23-25", "housenumber_type": "range", "postcode": "2000", "city": "Antwerpen"}
Input : Cafe den Draak, Lovegemlaan 7, 9000 Gent
LOC : Cafe den Draak, Lovegemlaan 7, 9000 Gent
{"building": "Cafe den Draak", "street": "Lovegemlaan", "housenumber": "7", "housenumber_type": "single", "postcode": "9000", "city": "Gent"}
Input : Heikeesstraat 2, 9240 Zele and Dorpstraat 7, 8040 Mariakerke
LOC : Heikeesstraat 2, 9240 Zele
{"street": "Heikeesstraat", "housenumber": "2", "housenumber_type": "single", "postcode": "9240", "city": "Zele"}
LOC : Dorpstraat 7, 8040 Mariakerke
{"street": "Dorpstraat", "housenumber": "7", "housenumber_type": "single", "postcode": "8040", "city": "Mariakerke"}
Deployment
Local
git clone https://huggingface.co/svercoutere/abb-dual-location-component-ner
pip install transformers torch torchcrf
python test_dual_ner.py
Docker / API
Wrap predict_locations in a FastAPI endpoint:
from fastapi import FastAPI
app = FastAPI()
@app.post("/parse")
def parse(text: str):
return predict_locations(text, model, tokenizer, device)
Training Details
| Setting | Value |
|---|---|
| Base model | xlm-roberta-base |
| Max sequence length | 256 tokens |
| Batch size (train) | 16 |
| Batch size (eval) | 32 |
| Epochs | 50 |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
| Warmup ratio | 0.1 |
| Loss weighting | 1.5 Γ component + 0.5 Γ location |
| Decoding | CRF Viterbi (both heads) |
| Training data | ~10 000 Belgian municipal decision location strings |
| Class balancing | Minority-class oversampling to median token count |
Limitations
- Primarily trained on Belgian/Flemish location strings; German accuracy may be lower for rare component types.
DOMAIN_ZONE_AREAF1 is lower (0.61) due to high lexical variability and limited training examples.
- Downloads last month
- 80
Evaluation results
- Combined F1self-reported0.944
- Component F1self-reported0.929
- Location F1self-reported0.958