TEI Entity Linker โ Qwen3-14B LoRA
A fine-tuned LoRA adapter for authority file linking in historical TEI editions. Given an entity from a TEI register (person, place, organization) and a list of candidates from Wikidata/GND, the model decides whether a candidate matches, partially matches, or does not match.
Task
The model solves a core problem in digital scholarly editing: linking named entities in historical registers to authority files (GND, Wikidata). This requires:
- Disambiguating entities by biographical data, descriptions, and context
- Handling historical spelling variants (Creveld โ Krefeld, Coeln โ Kรถln, Alderkyrchen โ Aldekerk)
- Distinguishing mythological figures from literary works of the same name
- Recognizing ethnic groups, allegories, and personifications as valid entities
- Rejecting false matches when candidates don't fit
Usage
Start the server
pip install mlx-lm
python3 -m mlx_lm.server \
--model Qwen/Qwen3-14B-MLX-4bit \
--adapter-path ./adapters-14b \
--port 1234
Recommended inference parameters
| Parameter | Value | Notes |
|---|---|---|
temperature |
0.1 |
Low temperature for deterministic JSON output. Higher values cause malformed JSON. |
top_p |
0.9 |
Nucleus sampling, slightly narrowed |
max_tokens |
2048 |
Sufficient for batches of 5 entities. Scale up for larger batches. |
repetition_penalty |
1.0 |
No penalty needed โ the model was trained on structured output |
stop |
] (optional) |
Can be used to stop after the JSON array closes |
The model uses Qwen3's built-in thinking mode (<think>...</think>). With these parameters, the thinking block is typically empty and the model responds directly with JSON. If you want to explicitly disable thinking, set enable_thinking: false in the chat template or prepend <think>\n\n</think>\n\n to the assistant message.
Send a request
The model expects an OpenAI-compatible chat format with a system prompt defining the task and a user prompt listing entities with candidates:
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-14B-MLX-4bit",
"temperature": 0.1,
"messages": [
{
"role": "system",
"content": "Du bist ein Normdaten-Experte fuer historische TEI-Editionen.\nEntscheide ob Normdaten-Kandidaten (Wikidata/GND) zu Entities aus historischen Registern passen.\n\nRegeln:\n- MATCH: >=90% sicher. PARTIAL: 75-90%. NONE: kein Kandidat passt.\n- Lebensdaten und Register-Notizen sind die staerksten Indikatoren.\n- Abstammung/Verwandtschaft MUSS uebereinstimmen.\n- Antworte NUR als JSON-Array.\n\nFormat pro Entity:\n{\"idx\": 1, \"verdict\": \"MATCH\"|\"PARTIAL\"|\"NONE\", \"best\": 1, \"confidence\": 0.95, \"reason\": \"Max 1 Satz\", \"modern_name\": null}"
},
{
"role": "user",
"content": "--- Entity 1 ---\nName: Philipp Melanchthon\nTyp: person\nNote: nachantik/fruehneuzeitlich; Lebensdaten: 1497-1560; Reformator und Humanist\n\nKandidaten:\n 1. GND 118580485 \"Melanchthon, Philipp\" โ Theologe, Humanist, Reformator; 1497-1560\n 2. Wikidata Q76325 \"Philipp Melanchthon\" โ German Protestant reformer (1497-1560)\n 3. GND 1089928572 \"Melanchthon, Philipp\" โ Sohn des vorigen"
}
]
}'
Response
[
{
"idx": 1,
"verdict": "MATCH",
"best": 1,
"confidence": 0.99,
"reason": "Lebensdaten (1497-1560) und Beruf (Reformator, Humanist) stimmen ueberein.",
"modern_name": null
}
]
Input Format
Entity block structure
Each entity in the user prompt follows this structure:
--- Entity N ---
Name: <register name>
Typ: person|place|org
Alternativnamen: <optional, e.g. Greek/Latin forms>
Lebensdaten (Register): geb. 1533, gest. 1613
Amt/Beruf (Register): <optional>
Register-Notiz: <optional biographical/geographical context>
Quelltext-Kontext: <optional, surrounding text from TEI source>
Kandidaten:
1. [wikidata] <label> โ <description> (Score 0.80)
ID: Q128176
2. [gnd] <label> โ <description> (geb. 1497, gest. 1560)
ID: 118580485
Candidate format
Candidates are listed with their source ([wikidata] or [gnd]), a label, description, and an optional pre-computed similarity score. The authority file ID (Wikidata Q-ID or GND number) is passed on a separate ID: line below each candidate.
Important: The model was trained on single-entity prompts (one entity per request). It also works well with batches of 2-5 entities per call, but single-entity prompts match the training distribution most closely.
Entity fields
| Field | Required | Description |
|---|---|---|
Name |
yes | Entity name as it appears in the register |
Typ |
yes | person, place, or org. Subtypes can be annotated: person (PERSONENGRUPPE/VOLK), person (KOERPERSCHAFT) |
Alternativnamen |
no | Alternative name forms (Greek, Latin, historical spellings) |
Lebensdaten (Register) |
no | Birth/death dates from the register |
Amt/Beruf (Register) |
no | Occupation or title from the register |
Register-Notiz |
no | Free-text note with biographical/geographical context. This is the strongest disambiguation signal. |
Quelltext-Kontext |
no | Surrounding text from the TEI source document |
Kandidaten |
yes | Numbered list of candidates with source, label, description, score, and ID |
Candidate fields
| Field | Required | Example |
|---|---|---|
| Source tag | yes | [wikidata] or [gnd] |
| Label | yes | Melanchthon, Philipp |
| Description | no | Theologe, Humanist, Reformator; 1497-1560 |
| Life dates | no | (geb. 1497, gest. 1560) โ appended after description |
| Score | no | (Score 0.77) โ pre-computed similarity score |
| ID line | yes | ID: Q76325 or ID: 118580485 on the next line |
If no candidates were found, use: (Keine Kandidaten gefunden)
Prompt header
The user prompt should start with a verification instruction, optionally including project-specific context:
Verifiziere diese Zuordnungen.
WICHTIG: KEINE zeitliche Filterung durchfuehren!
Pruefe NUR ob der Kandidat zur Register-Note passt.
Antike, mittelalterliche und biblische Figuren sind gueltig.
Prรผfe auf historische Schreibweisen!
--- Entity 1 ---
...
Antwort als JSON-Array:
Output Format
JSON array with one object per entity:
| Field | Type | Description |
|---|---|---|
idx |
int | Entity index (1-based) |
verdict |
string | MATCH (>=90%), PARTIAL (75-90%), NONE |
best |
int | 1-based index of best candidate (0 if NONE) |
confidence |
float | 0.0 - 1.0 |
reason |
string | Brief justification (1 sentence) |
modern_name |
string|null | Modern spelling if register name is a historical variant (e.g. "Krefeld" for "Creveld") |
Training
Method
The model was fine-tuned using LoRA (Low-Rank Adaptation) on Apple Silicon via MLX. Training data was generated through a teacher-student approach: Claude Sonnet 4 served as the verification oracle on real TEI register data, producing high-quality labeled examples. These were deduplicated, quality-filtered, and converted to ChatML format for supervised fine-tuning.
Each training example is a single entity with its candidates โ the model learns to evaluate candidate quality based on name matching, biographical data, descriptions, and register context.
Hyperparameters
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-14B-MLX-4bit |
| Method | LoRA (rank 16, alpha 32, dropout 0.05) |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trained layers | 16 |
| Training examples | 7,098 |
| Validation examples | 789 |
| Iterations | 1,500 (checkpoints every 300) |
| Learning rate | 1.5e-4 |
| Max sequence length | 2,048 |
| Optimizer | Adam |
| Gradient checkpointing | Enabled |
| Batch size | 1 (with grad accumulation) |
Training Data Composition
| Category | Count |
|---|---|
| Persons | 7,398 |
| Places | 489 |
| MATCH verdicts | 6,343 |
| NONE verdicts | 1,544 |
| Total (after dedup) | 7,887 |
Training data sources:
- Historical TEI editions (early modern German correspondence, 16th century)
- Classical philology registers (Greek/Latin named entities)
- Enrichment pipeline output verified by Claude Sonnet 4
The data covers a wide range of entity types: historical persons with life dates, mythological/biblical figures, ethnic groups and peoples, geographic locations with historical spelling variants, and personifications/allegories.
Capabilities
Tested on representative entities from a 16th-century TEI edition:
- Biographical disambiguation: Correctly distinguishes father/son with same name by life dates
- Historical spelling: Recognizes CreveldโKrefeld, CoelnโKรถln, AlderkyrchenโAldekerk
- Mythology vs. literature: Picks Q202990 (Hecuba the queen) over Q1421115 (Hecuba the play)
- Ethnic groups/peoples: Matches "Trojaner" to GND 4060976-3 (Volk), not the city
- Correct rejection: Returns NONE for underspecified entries like "Fridericus, Ilfelder Schรผler"
- Geographic disambiguation: Dillingen an der Donau vs. Dillingen/Saar, Wesel (NRW) vs. Wesel (Niedersachsen)
Integration
This adapter was built for the TEI-NER-Stack (GitHub repository TBA), an automated entity linking pipeline for historical TEI editions. The pipeline handles candidate retrieval (Wikidata/GND API), batching, confidence scoring, and XML output โ the model serves as the verification step.
Requirements
- Apple Silicon Mac (M1 or later) for MLX inference
- ~10 GB RAM for the 4-bit model + adapter
- Python 3.10+,
mlx-lmpackage
For GPU inference on other platforms, convert the adapter to Hugging Face PEFT format and use vLLM or text-generation-inference with the full-precision Qwen3-14B base model.
License
Apache 2.0 (inherited from Qwen3-14B)
Citation
@software{makowski2026tei_entity_linker,
author = {Makowski, Stephan},
title = {TEI Entity Linker: Fine-tuned Qwen3-14B for Authority File Linking in Historical Editions},
year = {2026},
note = {GitHub repository TBA}
}
- Downloads last month
- 746
Quantized