TEI Entity Linker โ€” Qwen3-14B LoRA

A fine-tuned LoRA adapter for authority file linking in historical TEI editions. Given an entity from a TEI register (person, place, organization) and a list of candidates from Wikidata/GND, the model decides whether a candidate matches, partially matches, or does not match.

Task

The model solves a core problem in digital scholarly editing: linking named entities in historical registers to authority files (GND, Wikidata). This requires:

  • Disambiguating entities by biographical data, descriptions, and context
  • Handling historical spelling variants (Creveld โ†’ Krefeld, Coeln โ†’ Kรถln, Alderkyrchen โ†’ Aldekerk)
  • Distinguishing mythological figures from literary works of the same name
  • Recognizing ethnic groups, allegories, and personifications as valid entities
  • Rejecting false matches when candidates don't fit

Usage

Start the server

pip install mlx-lm

python3 -m mlx_lm.server \
  --model Qwen/Qwen3-14B-MLX-4bit \
  --adapter-path ./adapters-14b \
  --port 1234

Recommended inference parameters

Parameter Value Notes
temperature 0.1 Low temperature for deterministic JSON output. Higher values cause malformed JSON.
top_p 0.9 Nucleus sampling, slightly narrowed
max_tokens 2048 Sufficient for batches of 5 entities. Scale up for larger batches.
repetition_penalty 1.0 No penalty needed โ€” the model was trained on structured output
stop ] (optional) Can be used to stop after the JSON array closes

The model uses Qwen3's built-in thinking mode (<think>...</think>). With these parameters, the thinking block is typically empty and the model responds directly with JSON. If you want to explicitly disable thinking, set enable_thinking: false in the chat template or prepend <think>\n\n</think>\n\n to the assistant message.

Send a request

The model expects an OpenAI-compatible chat format with a system prompt defining the task and a user prompt listing entities with candidates:

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
  "model": "Qwen/Qwen3-14B-MLX-4bit",
  "temperature": 0.1,
  "messages": [
    {
      "role": "system",
      "content": "Du bist ein Normdaten-Experte fuer historische TEI-Editionen.\nEntscheide ob Normdaten-Kandidaten (Wikidata/GND) zu Entities aus historischen Registern passen.\n\nRegeln:\n- MATCH: >=90% sicher. PARTIAL: 75-90%. NONE: kein Kandidat passt.\n- Lebensdaten und Register-Notizen sind die staerksten Indikatoren.\n- Abstammung/Verwandtschaft MUSS uebereinstimmen.\n- Antworte NUR als JSON-Array.\n\nFormat pro Entity:\n{\"idx\": 1, \"verdict\": \"MATCH\"|\"PARTIAL\"|\"NONE\", \"best\": 1, \"confidence\": 0.95, \"reason\": \"Max 1 Satz\", \"modern_name\": null}"
    },
    {
      "role": "user",
      "content": "--- Entity 1 ---\nName: Philipp Melanchthon\nTyp: person\nNote: nachantik/fruehneuzeitlich; Lebensdaten: 1497-1560; Reformator und Humanist\n\nKandidaten:\n  1. GND 118580485 \"Melanchthon, Philipp\" โ€” Theologe, Humanist, Reformator; 1497-1560\n  2. Wikidata Q76325 \"Philipp Melanchthon\" โ€” German Protestant reformer (1497-1560)\n  3. GND 1089928572 \"Melanchthon, Philipp\" โ€” Sohn des vorigen"
    }
  ]
}'

Response

[
  {
    "idx": 1,
    "verdict": "MATCH",
    "best": 1,
    "confidence": 0.99,
    "reason": "Lebensdaten (1497-1560) und Beruf (Reformator, Humanist) stimmen ueberein.",
    "modern_name": null
  }
]

Input Format

Entity block structure

Each entity in the user prompt follows this structure:

--- Entity N ---
Name: <register name>
Typ: person|place|org
Alternativnamen: <optional, e.g. Greek/Latin forms>
Lebensdaten (Register): geb. 1533, gest. 1613
Amt/Beruf (Register): <optional>
Register-Notiz: <optional biographical/geographical context>
Quelltext-Kontext: <optional, surrounding text from TEI source>
Kandidaten:
  1. [wikidata] <label> โ€” <description> (Score 0.80)
     ID: Q128176
  2. [gnd] <label> โ€” <description> (geb. 1497, gest. 1560)
     ID: 118580485

Candidate format

Candidates are listed with their source ([wikidata] or [gnd]), a label, description, and an optional pre-computed similarity score. The authority file ID (Wikidata Q-ID or GND number) is passed on a separate ID: line below each candidate.

Important: The model was trained on single-entity prompts (one entity per request). It also works well with batches of 2-5 entities per call, but single-entity prompts match the training distribution most closely.

Entity fields

Field Required Description
Name yes Entity name as it appears in the register
Typ yes person, place, or org. Subtypes can be annotated: person (PERSONENGRUPPE/VOLK), person (KOERPERSCHAFT)
Alternativnamen no Alternative name forms (Greek, Latin, historical spellings)
Lebensdaten (Register) no Birth/death dates from the register
Amt/Beruf (Register) no Occupation or title from the register
Register-Notiz no Free-text note with biographical/geographical context. This is the strongest disambiguation signal.
Quelltext-Kontext no Surrounding text from the TEI source document
Kandidaten yes Numbered list of candidates with source, label, description, score, and ID

Candidate fields

Field Required Example
Source tag yes [wikidata] or [gnd]
Label yes Melanchthon, Philipp
Description no Theologe, Humanist, Reformator; 1497-1560
Life dates no (geb. 1497, gest. 1560) โ€” appended after description
Score no (Score 0.77) โ€” pre-computed similarity score
ID line yes ID: Q76325 or ID: 118580485 on the next line

If no candidates were found, use: (Keine Kandidaten gefunden)

Prompt header

The user prompt should start with a verification instruction, optionally including project-specific context:

Verifiziere diese Zuordnungen.

WICHTIG: KEINE zeitliche Filterung durchfuehren!
Pruefe NUR ob der Kandidat zur Register-Note passt.
Antike, mittelalterliche und biblische Figuren sind gueltig.

Prรผfe auf historische Schreibweisen!

--- Entity 1 ---
...

Antwort als JSON-Array:

Output Format

JSON array with one object per entity:

Field Type Description
idx int Entity index (1-based)
verdict string MATCH (>=90%), PARTIAL (75-90%), NONE
best int 1-based index of best candidate (0 if NONE)
confidence float 0.0 - 1.0
reason string Brief justification (1 sentence)
modern_name string|null Modern spelling if register name is a historical variant (e.g. "Krefeld" for "Creveld")

Training

Method

The model was fine-tuned using LoRA (Low-Rank Adaptation) on Apple Silicon via MLX. Training data was generated through a teacher-student approach: Claude Sonnet 4 served as the verification oracle on real TEI register data, producing high-quality labeled examples. These were deduplicated, quality-filtered, and converted to ChatML format for supervised fine-tuning.

Each training example is a single entity with its candidates โ€” the model learns to evaluate candidate quality based on name matching, biographical data, descriptions, and register context.

Hyperparameters

Parameter Value
Base model Qwen/Qwen3-14B-MLX-4bit
Method LoRA (rank 16, alpha 32, dropout 0.05)
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trained layers 16
Training examples 7,098
Validation examples 789
Iterations 1,500 (checkpoints every 300)
Learning rate 1.5e-4
Max sequence length 2,048
Optimizer Adam
Gradient checkpointing Enabled
Batch size 1 (with grad accumulation)

Training Data Composition

Category Count
Persons 7,398
Places 489
MATCH verdicts 6,343
NONE verdicts 1,544
Total (after dedup) 7,887

Training data sources:

  • Historical TEI editions (early modern German correspondence, 16th century)
  • Classical philology registers (Greek/Latin named entities)
  • Enrichment pipeline output verified by Claude Sonnet 4

The data covers a wide range of entity types: historical persons with life dates, mythological/biblical figures, ethnic groups and peoples, geographic locations with historical spelling variants, and personifications/allegories.

Capabilities

Tested on representative entities from a 16th-century TEI edition:

  • Biographical disambiguation: Correctly distinguishes father/son with same name by life dates
  • Historical spelling: Recognizes Creveldโ†’Krefeld, Coelnโ†’Kรถln, Alderkyrchenโ†’Aldekerk
  • Mythology vs. literature: Picks Q202990 (Hecuba the queen) over Q1421115 (Hecuba the play)
  • Ethnic groups/peoples: Matches "Trojaner" to GND 4060976-3 (Volk), not the city
  • Correct rejection: Returns NONE for underspecified entries like "Fridericus, Ilfelder Schรผler"
  • Geographic disambiguation: Dillingen an der Donau vs. Dillingen/Saar, Wesel (NRW) vs. Wesel (Niedersachsen)

Integration

This adapter was built for the TEI-NER-Stack (GitHub repository TBA), an automated entity linking pipeline for historical TEI editions. The pipeline handles candidate retrieval (Wikidata/GND API), batching, confidence scoring, and XML output โ€” the model serves as the verification step.

Requirements

  • Apple Silicon Mac (M1 or later) for MLX inference
  • ~10 GB RAM for the 4-bit model + adapter
  • Python 3.10+, mlx-lm package

For GPU inference on other platforms, convert the adapter to Hugging Face PEFT format and use vLLM or text-generation-inference with the full-precision Qwen3-14B base model.

License

Apache 2.0 (inherited from Qwen3-14B)

Citation

@software{makowski2026tei_entity_linker,
  author = {Makowski, Stephan},
  title = {TEI Entity Linker: Fine-tuned Qwen3-14B for Authority File Linking in Historical Editions},
  year = {2026},
  note = {GitHub repository TBA}
}
Downloads last month
746
Safetensors
Model size
15B params
Tensor type
BF16
ยท
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Apokalyptikon/tei-entity-linker-qwen3-14b-mlx

Adapter
(3)
this model