TEI Entity Linker — Qwen3-14B LoRA

A fine-tuned LoRA adapter for authority file linking in historical TEI editions. Given an entity from a TEI register (person, place, organization) and a list of candidates from Wikidata/GND, the model decides whether a candidate matches, partially matches, or does not match.

Task

The model solves a core problem in digital scholarly editing: linking named entities in historical registers to authority files (GND, Wikidata). This requires:

Disambiguating entities by biographical data, descriptions, and context
Handling historical spelling variants (Creveld → Krefeld, Coeln → Köln, Alderkyrchen → Aldekerk)
Distinguishing mythological figures from literary works of the same name
Recognizing ethnic groups, allegories, and personifications as valid entities
Rejecting false matches when candidates don't fit

Usage

Start the server

pip install mlx-lm

python3 -m mlx_lm.server \
  --model Qwen/Qwen3-14B-MLX-4bit \
  --adapter-path ./adapters-14b \
  --port 1234

Recommended inference parameters

Parameter	Value	Notes
`temperature`	`0.1`	Low temperature for deterministic JSON output. Higher values cause malformed JSON.
`top_p`	`0.9`	Nucleus sampling, slightly narrowed
`max_tokens`	`2048`	Sufficient for batches of 5 entities. Scale up for larger batches.
`repetition_penalty`	`1.0`	No penalty needed — the model was trained on structured output
`stop`	`]` (optional)	Can be used to stop after the JSON array closes

The model uses Qwen3's built-in thinking mode (<think>...</think>). With these parameters, the thinking block is typically empty and the model responds directly with JSON. If you want to explicitly disable thinking, set enable_thinking: false in the chat template or prepend <think>\n\n</think>\n\n to the assistant message.

Send a request

The model expects an OpenAI-compatible chat format with a system prompt defining the task and a user prompt listing entities with candidates:

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
  "model": "Qwen/Qwen3-14B-MLX-4bit",
  "temperature": 0.1,
  "messages": [
    {
      "role": "system",
      "content": "Du bist ein Normdaten-Experte fuer historische TEI-Editionen.\nEntscheide ob Normdaten-Kandidaten (Wikidata/GND) zu Entities aus historischen Registern passen.\n\nRegeln:\n- MATCH: >=90% sicher. PARTIAL: 75-90%. NONE: kein Kandidat passt.\n- Lebensdaten und Register-Notizen sind die staerksten Indikatoren.\n- Abstammung/Verwandtschaft MUSS uebereinstimmen.\n- Antworte NUR als JSON-Array.\n\nFormat pro Entity:\n{\"idx\": 1, \"verdict\": \"MATCH\"|\"PARTIAL\"|\"NONE\", \"best\": 1, \"confidence\": 0.95, \"reason\": \"Max 1 Satz\", \"modern_name\": null}"
    },
    {
      "role": "user",
      "content": "--- Entity 1 ---\nName: Philipp Melanchthon\nTyp: person\nNote: nachantik/fruehneuzeitlich; Lebensdaten: 1497-1560; Reformator und Humanist\n\nKandidaten:\n  1. GND 118580485 \"Melanchthon, Philipp\" — Theologe, Humanist, Reformator; 1497-1560\n  2. Wikidata Q76325 \"Philipp Melanchthon\" — German Protestant reformer (1497-1560)\n  3. GND 1089928572 \"Melanchthon, Philipp\" — Sohn des vorigen"
    }
  ]
}'

Response

[
  {
    "idx": 1,
    "verdict": "MATCH",
    "best": 1,
    "confidence": 0.99,
    "reason": "Lebensdaten (1497-1560) und Beruf (Reformator, Humanist) stimmen ueberein.",
    "modern_name": null
  }
]

Input Format

Entity block structure

Each entity in the user prompt follows this structure:

--- Entity N ---
Name: <register name>
Typ: person|place|org
Alternativnamen: <optional, e.g. Greek/Latin forms>
Lebensdaten (Register): geb. 1533, gest. 1613
Amt/Beruf (Register): <optional>
Register-Notiz: <optional biographical/geographical context>
Quelltext-Kontext: <optional, surrounding text from TEI source>
Kandidaten:
  1. [wikidata] <label> — <description> (Score 0.80)
     ID: Q128176
  2. [gnd] <label> — <description> (geb. 1497, gest. 1560)
     ID: 118580485

Candidate format

Candidates are listed with their source ([wikidata] or [gnd]), a label, description, and an optional pre-computed similarity score. The authority file ID (Wikidata Q-ID or GND number) is passed on a separate ID: line below each candidate.

Important: The model was trained on single-entity prompts (one entity per request). It also works well with batches of 2-5 entities per call, but single-entity prompts match the training distribution most closely.

Entity fields

Field	Required	Description
`Name`	yes	Entity name as it appears in the register
`Typ`	yes	`person`, `place`, or `org`. Subtypes can be annotated: `person (PERSONENGRUPPE/VOLK)`, `person (KOERPERSCHAFT)`
`Alternativnamen`	no	Alternative name forms (Greek, Latin, historical spellings)
`Lebensdaten (Register)`	no	Birth/death dates from the register
`Amt/Beruf (Register)`	no	Occupation or title from the register
`Register-Notiz`	no	Free-text note with biographical/geographical context. This is the strongest disambiguation signal.
`Quelltext-Kontext`	no	Surrounding text from the TEI source document
`Kandidaten`	yes	Numbered list of candidates with source, label, description, score, and ID

Candidate fields

Field	Required	Example
Source tag	yes	`[wikidata]` or `[gnd]`
Label	yes	`Melanchthon, Philipp`
Description	no	`Theologe, Humanist, Reformator; 1497-1560`
Life dates	no	`(geb. 1497, gest. 1560)` — appended after description
Score	no	`(Score 0.77)` — pre-computed similarity score
ID line	yes	`ID: Q76325` or `ID: 118580485` on the next line

If no candidates were found, use: (Keine Kandidaten gefunden)

Prompt header

The user prompt should start with a verification instruction, optionally including project-specific context:

Verifiziere diese Zuordnungen.

WICHTIG: KEINE zeitliche Filterung durchfuehren!
Pruefe NUR ob der Kandidat zur Register-Note passt.
Antike, mittelalterliche und biblische Figuren sind gueltig.

Prüfe auf historische Schreibweisen!

--- Entity 1 ---
...

Antwort als JSON-Array:

Output Format

JSON array with one object per entity:

Field	Type	Description
`idx`	int	Entity index (1-based)
`verdict`	string	`MATCH` (>=90%), `PARTIAL` (75-90%), `NONE`
`best`	int	1-based index of best candidate (0 if NONE)
`confidence`	float	0.0 - 1.0
`reason`	string	Brief justification (1 sentence)
`modern_name`	string\|null	Modern spelling if register name is a historical variant (e.g. "Krefeld" for "Creveld")

Training

Method

The model was fine-tuned using LoRA (Low-Rank Adaptation) on Apple Silicon via MLX. Training data was generated through a teacher-student approach: Claude Sonnet 4 served as the verification oracle on real TEI register data, producing high-quality labeled examples. These were deduplicated, quality-filtered, and converted to ChatML format for supervised fine-tuning.

Each training example is a single entity with its candidates — the model learns to evaluate candidate quality based on name matching, biographical data, descriptions, and register context.

Hyperparameters

Parameter	Value
Base model	Qwen/Qwen3-14B-MLX-4bit
Method	LoRA (rank 16, alpha 32, dropout 0.05)
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trained layers	16
Training examples	7,098
Validation examples	789
Iterations	1,500 (checkpoints every 300)
Learning rate	1.5e-4
Max sequence length	2,048
Optimizer	Adam
Gradient checkpointing	Enabled
Batch size	1 (with grad accumulation)

Training Data Composition

Category	Count
Persons	7,398
Places	489
MATCH verdicts	6,343
NONE verdicts	1,544
Total (after dedup)	7,887

Training data sources:

Historical TEI editions (early modern German correspondence, 16th century)
Classical philology registers (Greek/Latin named entities)
Enrichment pipeline output verified by Claude Sonnet 4

The data covers a wide range of entity types: historical persons with life dates, mythological/biblical figures, ethnic groups and peoples, geographic locations with historical spelling variants, and personifications/allegories.

Capabilities

Tested on representative entities from a 16th-century TEI edition:

Biographical disambiguation: Correctly distinguishes father/son with same name by life dates
Historical spelling: Recognizes Creveld→Krefeld, Coeln→Köln, Alderkyrchen→Aldekerk
Mythology vs. literature: Picks Q202990 (Hecuba the queen) over Q1421115 (Hecuba the play)
Ethnic groups/peoples: Matches "Trojaner" to GND 4060976-3 (Volk), not the city
Correct rejection: Returns NONE for underspecified entries like "Fridericus, Ilfelder Schüler"
Geographic disambiguation: Dillingen an der Donau vs. Dillingen/Saar, Wesel (NRW) vs. Wesel (Niedersachsen)

Integration

This adapter was built for the TEI-NER-Stack (GitHub repository TBA), an automated entity linking pipeline for historical TEI editions. The pipeline handles candidate retrieval (Wikidata/GND API), batching, confidence scoring, and XML output — the model serves as the verification step.

Requirements

Apple Silicon Mac (M1 or later) for MLX inference
~10 GB RAM for the 4-bit model + adapter
Python 3.10+, mlx-lm package

For GPU inference on other platforms, convert the adapter to Hugging Face PEFT format and use vLLM or text-generation-inference with the full-precision Qwen3-14B base model.

License

Apache 2.0 (inherited from Qwen3-14B)

Citation

@software{makowski2026tei_entity_linker,
  author = {Makowski, Stephan},
  title = {TEI Entity Linker: Fine-tuned Qwen3-14B for Authority File Linking in Historical Editions},
  year = {2026},
  note = {GitHub repository TBA}
}

Downloads last month: 746

Safetensors

Model size

15B params

Tensor type

BF16

MLX

Hardware compatibility

Quantized

Model tree for Apokalyptikon/tei-entity-linker-qwen3-14b-mlx

Base model

Qwen/Qwen3-14B-Base

Quantized

Qwen/Qwen3-14B-MLX-4bit

Adapter

(3)

this model