Qwen2.5-7B · CoNLL-2003 English NER

This model is a LoRA fine-tune of Qwen/Qwen2.5-7B for Named Entity Recognition (NER) on the original English CoNLL-2003 dataset. It was trained using a bracketed inline output format and an Alpaca-style instruction-tuning setup, reproducing the experimental configuration from Zhan et al. (2026).

Training code, evaluation scripts, and full configuration are available at: https://github.com/stefan-it/llms-meet-ner


Bracketed Inline Format

Instead of producing a label sequence, the model rewrites the input sentence by wrapping each named entity in [Entity Text | LABEL] brackets. Plain (non-entity) tokens are left unchanged.

Input:

Hussain , considered surplus to England 's one-day requirements , struck 158 , his first championship century of the season , as Essex reached 372 .

Output:

[Hussain | PER] , considered surplus to [England | LOC] 's one-day requirements , struck 158 , his first championship century of the season , as [Essex | ORG] reached 372 .

Multi-token entities are supported naturally — the entire span appears inside a single bracket pair. Everything after the first newline in the model output is discarded to handle hallucinated continuations.


Instruction Format

The model was fine-tuned in Alpaca-style instruction format. Each training example consists of a system instruction defining the task and label set, followed by the input sentence as the user turn and the bracketed inline annotation as the expected response.

The instruction used for CoNLL-2003 English:

Your task is to identify all named entities in the input sentence and rewrite
the sentence by enclosing each entity using the format [Entity Text | LABEL].
Use only the label tags defined in the Label Set below.
Label Set:
ORG(organization): A collective entity such as a company, institution, brand,
  political or governmental body, publication, or any organized group of people
  acting as a unit.
PER(person): A named individual, including humans, animals, fictional
  characters, and their aliases.
LOC(location): A geographical or spatial entity, including natural features,
  built structures, regions, public or commercial places, assorted buildings,
  and abstract or metaphorical places.
MISC(miscellaneous): Named entities that are not persons, organizations, or
  locations, including derived adjectives, religions, ideologies,
  nationalities, languages, events, programs, wars, titles of works, slogans,
  eras, and types of objects.
Now process the input sentence:

Training

Hyperparameter Value
Base model Qwen/Qwen2.5-7B
Fine-tuning method LoRA (via LLaMA-Factory)
LoRA rank 256
LoRA alpha 512
LoRA target all
Training dataset CoNLL-2003 English train split
Epochs 2
Learning rate 2.0e-5
LR scheduler cosine
Warmup ratio 0.1
Per-device batch size 1
Gradient accumulation steps 8
Effective batch size 8
Max sequence length 2048
Precision bfloat16

Evaluation Setup

Evaluation is performed in two complementary ways, both working from the raw model output (bracketed inline predictions) aligned against gold labels taken directly from the original CoNLL-2003 IOB1 dataset — never from the converted training format, to avoid any annotation artefacts.

  • seqeval — token-level strict span matching: a span is correct only if both its boundaries and entity type match exactly.
  • nervaluate — span-level evaluation reporting four scenarios (strict, exact, partial, ent_type) following the SemEval 2013 Task 9.1 metrics.

Results

Development Set (eng.testa) — 3,466 sentences, 51,578 tokens

seqeval

Entity Precision Recall F1
LOC 0.98 0.98 0.98
MISC 0.92 0.93 0.93
ORG 0.95 0.96 0.96
PER 0.98 0.99 0.99
micro avg 0.9660 0.9704 0.9682

nervaluate (aggregated)

Scenario Precision Recall F1
strict 0.9660 0.9704 0.9682
exact 0.9782 0.9827 0.9804
partial 0.9834 0.9879 0.9856
ent_type 0.9744 0.9788 0.9766

Test Set (eng.testb) — 3,684 sentences, 46,666 tokens

seqeval

Entity Precision Recall F1
LOC 0.95 0.94 0.95
MISC 0.82 0.84 0.83
ORG 0.92 0.95 0.93
PER 0.98 0.97 0.98
micro avg 0.9350 0.9396 0.9373

nervaluate (aggregated)

Scenario Precision Recall F1
strict 0.9350 0.9396 0.9373
exact 0.9639 0.9687 0.9663
partial 0.9716 0.9765 0.9740
ent_type 0.9436 0.9483 0.9460

The test set F1 of 93.73 exactly matches the result reported by Zhan et al. for this model configuration.


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stefan-it/Qwen2.5-7B-CoNLL-2003

Base model

Qwen/Qwen2.5-7B
Adapter
(440)
this model

Dataset used to train stefan-it/Qwen2.5-7B-CoNLL-2003

Paper for stefan-it/Qwen2.5-7B-CoNLL-2003