Qwen2.5-7B · CoNLL-2003 English NER
This model is a LoRA fine-tune of Qwen/Qwen2.5-7B for Named Entity Recognition (NER) on the original English CoNLL-2003 dataset. It was trained using a bracketed inline output format and an Alpaca-style instruction-tuning setup, reproducing the experimental configuration from Zhan et al. (2026).
Training code, evaluation scripts, and full configuration are available at: https://github.com/stefan-it/llms-meet-ner
Bracketed Inline Format
Instead of producing a label sequence, the model rewrites the input sentence by wrapping each named entity in [Entity Text | LABEL] brackets. Plain (non-entity) tokens are left unchanged.
Input:
Hussain , considered surplus to England 's one-day requirements , struck 158 , his first championship century of the season , as Essex reached 372 .
Output:
[Hussain | PER] , considered surplus to [England | LOC] 's one-day requirements , struck 158 , his first championship century of the season , as [Essex | ORG] reached 372 .
Multi-token entities are supported naturally — the entire span appears inside a single bracket pair. Everything after the first newline in the model output is discarded to handle hallucinated continuations.
Instruction Format
The model was fine-tuned in Alpaca-style instruction format. Each training example consists of a system instruction defining the task and label set, followed by the input sentence as the user turn and the bracketed inline annotation as the expected response.
The instruction used for CoNLL-2003 English:
Your task is to identify all named entities in the input sentence and rewrite
the sentence by enclosing each entity using the format [Entity Text | LABEL].
Use only the label tags defined in the Label Set below.
Label Set:
ORG(organization): A collective entity such as a company, institution, brand,
political or governmental body, publication, or any organized group of people
acting as a unit.
PER(person): A named individual, including humans, animals, fictional
characters, and their aliases.
LOC(location): A geographical or spatial entity, including natural features,
built structures, regions, public or commercial places, assorted buildings,
and abstract or metaphorical places.
MISC(miscellaneous): Named entities that are not persons, organizations, or
locations, including derived adjectives, religions, ideologies,
nationalities, languages, events, programs, wars, titles of works, slogans,
eras, and types of objects.
Now process the input sentence:
Training
| Hyperparameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-7B |
| Fine-tuning method | LoRA (via LLaMA-Factory) |
| LoRA rank | 256 |
| LoRA alpha | 512 |
| LoRA target | all |
| Training dataset | CoNLL-2003 English train split |
| Epochs | 2 |
| Learning rate | 2.0e-5 |
| LR scheduler | cosine |
| Warmup ratio | 0.1 |
| Per-device batch size | 1 |
| Gradient accumulation steps | 8 |
| Effective batch size | 8 |
| Max sequence length | 2048 |
| Precision | bfloat16 |
Evaluation Setup
Evaluation is performed in two complementary ways, both working from the raw model output (bracketed inline predictions) aligned against gold labels taken directly from the original CoNLL-2003 IOB1 dataset — never from the converted training format, to avoid any annotation artefacts.
- seqeval — token-level strict span matching: a span is correct only if both its boundaries and entity type match exactly.
- nervaluate — span-level evaluation reporting four scenarios (strict, exact, partial, ent_type) following the SemEval 2013 Task 9.1 metrics.
Results
Development Set (eng.testa) — 3,466 sentences, 51,578 tokens
seqeval
| Entity | Precision | Recall | F1 |
|---|---|---|---|
| LOC | 0.98 | 0.98 | 0.98 |
| MISC | 0.92 | 0.93 | 0.93 |
| ORG | 0.95 | 0.96 | 0.96 |
| PER | 0.98 | 0.99 | 0.99 |
| micro avg | 0.9660 | 0.9704 | 0.9682 |
nervaluate (aggregated)
| Scenario | Precision | Recall | F1 |
|---|---|---|---|
| strict | 0.9660 | 0.9704 | 0.9682 |
| exact | 0.9782 | 0.9827 | 0.9804 |
| partial | 0.9834 | 0.9879 | 0.9856 |
| ent_type | 0.9744 | 0.9788 | 0.9766 |
Test Set (eng.testb) — 3,684 sentences, 46,666 tokens
seqeval
| Entity | Precision | Recall | F1 |
|---|---|---|---|
| LOC | 0.95 | 0.94 | 0.95 |
| MISC | 0.82 | 0.84 | 0.83 |
| ORG | 0.92 | 0.95 | 0.93 |
| PER | 0.98 | 0.97 | 0.98 |
| micro avg | 0.9350 | 0.9396 | 0.9373 |
nervaluate (aggregated)
| Scenario | Precision | Recall | F1 |
|---|---|---|---|
| strict | 0.9350 | 0.9396 | 0.9373 |
| exact | 0.9639 | 0.9687 | 0.9663 |
| partial | 0.9716 | 0.9765 | 0.9740 |
| ent_type | 0.9436 | 0.9483 | 0.9460 |
The test set F1 of 93.73 exactly matches the result reported by Zhan et al. for this model configuration.
Model tree for stefan-it/Qwen2.5-7B-CoNLL-2003
Base model
Qwen/Qwen2.5-7B