Model card for OpenBioNER-base-v2-deid
We introduce OpenBioNER-base-v2-deid, a compact, description-driven encoder for zero-shot clinical de-identification (PHI removal).
Like its biomedical counterpart openbioner-base-v2, it recognizes unseen entity types using only their natural language descriptions, eliminating the need for retraining or label-specific fine-tuning.
Available Models
| Release | Model Name | # Size | Domain | Language | License |
|---|---|---|---|---|---|
| v1 | openbioner-base | 110M | Biomedicine | English | MIT |
| v2 | openbioner-base-v2 openbioner-compact-v2 openbioner-tiny-v2 openbioner-base-v2-deid |
110M 65M 15M 110M |
Biomedicine Biomedicine Biomedicine PHI de-identification |
English | MIT |
Model Details
This model extends the OpenBioNER-v2 design with a specialized pretraining and refinement pipeline -PHIRefine- optimized for identifying personal health information (PHI) in clinical text. We fine-tune the OpenBioNER-base-v2 model from the BroadScan checkpoint to specialize in PHI extraction.
It is trained on disi-unibo-nlp/pii-masking-400k-IOB-med resulting in state-of-the-art zero-shot de-identification performance despite having only 110M parameters.
Evaluated on the newly proposed disi-unibo-nlp/physionet-deid-i2b2-2014 benchmark—the most curated dataset for zero-shot PHI detection—OpenBioNER-base-v2-deid demonstrates strong generalization and outperforms larger LLM-based and domain specific encoders approaches while maintaining a very lightweight footprint. This makes it a practical, open-source solution for scalable PHI protection in clinical workflows.
Installation
To use this model, you must install the IBM Zshot library:
!pip install -U zshot==0.0.11 datasets==3.6.0 gliner
!python -m spacy download en_core_web_sm
Usage
import spacy
from zshot import PipelineConfig, displacy
from zshot.linker import LinkerSMXM
from zshot.evaluation.metrics._seqeval._seqeval import Seqeval
from zshot.utils.data_models import Entity
from zshot.evaluation.zshot_evaluate import evaluate, prettify_evaluate_report
# define your list of candidate entity types
entities = [
Entity(
name='AGE',
description="Identifies numerical or descriptive expressions indicating a person's age. This includes numbers, optionally followed by units like 'YEAR OLD', 'YR', 'YO' or 'Y.O.'. Also includes variations like '57yo' or '74y'. Do not include birthdates or age ranges unless a specific age is mentioned. Examples: '58', '57yo', '74', '67 y.o.",
vocabulary=None
),
Entity(
name='HOSPITAL',
description="Identifies names or abbreviations of healthcare facilities where a patient receives medical care. This includes full names (e.g., 'CALVERT HOSPITAL', 'Holy Cross'), abbreviations (e.g., 'GH', 'CALVERT-'), and misspellings (e.g., 'CALVERT HOSPIATAL'). Do not include general hospital types (e.g., 'hospital', 'ER', 'ICU') unless they are part of a specific named entity (e.g., 'kernan hosp').",
vocabulary=None
),
Entity(
name='DATE',
description="Mark all explicit and implied dates, including full dates (e.g., '1992', '7/22', '7/23'), two-digit years (e.g., '92', ''92', ''95'), and days of the week or parts of the day (e.g., 'Monday', 'MN' for midnight). Do not include relative date expressions like '2 WEEK HISTORY'.",
vocabulary=None
),
Entity(
name='LOCATION',
description="Identifies specific, named locations within a healthcare facility, typically referring to specialized departments, units, or rooms where patient care is provided. This includes common abbreviations for units. Do not tag general terms like 'hospital' or 'clinic' unless they refer to a specific, named part of that facility. Examples: 'ICU', 'CCU', 'cath lab', 'PCU', 'quartermain 2/3', 'IR'.",
vocabulary=None
]
nlp = spacy.blank("en")
nlp_config = PipelineConfig(
linker=LinkerSMXM(model_name="disi-unibo-nlp/openbioner-base-v2-deid"),
entities=entities,
device='cuda' # or 'cpu' if GPU not available
)
nlp.add_pipe("zshot", config=nlp_config, last=True)
sentence = "80-year-old female admitted in transfer GH for mental status changes post fall in the bathroom; HPI: 6-week history of leg weakness; on 3/20, found by on floor—awake but with mental status changes."
doc = nlp(sentence)
displacy.render(doc, style="ent")
Run Evaluation
Let's evaluate the model on PhysioNet-deid-i2b2-2014 dataset.
To use this dataset you must to import the id.text file in your local folder (more information at this link)
from datasets import load_dataset
ds = load_dataset('disi-unibo-nlp/physionet-deid-i2b2-2014', split='train')
print("Tokens:", ds['tokens'][0])
print("Tags:", ds['ner_tags'][0])
print("Unique labels:", set([t[2:] for tag in ds['ner_tags'] for t in tag if t != 'O']))
Evaluation with zshot library
import spacy
from zshot import PipelineConfig, displacy
from zshot.linker import LinkerSMXM
from zshot.evaluation.metrics._seqeval._seqeval import Seqeval
from zshot.utils.data_models import Entity
from zshot.evaluation.zshot_evaluate import evaluate, prettify_evaluate_report
entities = [
Entity(
name='AGE',
description="Identifies numerical or descriptive expressions indicating a person's age. This includes numbers, optionally followed by units like 'YEAR OLD', 'YR', 'YO' or 'Y.O.'. Also includes variations like '57yo' or '74y'. Do not include birthdates or age ranges unless a specific age is mentioned. Examples: '58', '57yo', '74', '67 y.o.",
vocabulary=None
),
Entity(
name='HOSPITAL',
description="Identifies names or abbreviations of healthcare facilities where a patient receives medical care. This includes full names (e.g., 'CALVERT HOSPITAL', 'Holy Cross'), abbreviations (e.g., 'GH', 'CALVERT-'), and misspellings (e.g., 'CALVERT HOSPIATAL'). Do not include general hospital types (e.g., 'hospital', 'ER', 'ICU') unless they are part of a specific named entity (e.g., 'kernan hosp').",
vocabulary=None
),
Entity(
name='DATE',
description="Mark all explicit and implied dates, including full dates (e.g., '1992', '7/22', '7/23'), two-digit years (e.g., '92', ''92', ''95'), and days of the week or parts of the day (e.g., 'Monday', 'MN' for midnight). Do not include relative date expressions like '2 WEEK HISTORY'.",
vocabulary=None
),
Entity(
name='LOCATION',
description="Identifies specific, named locations within a healthcare facility, typically referring to specialized departments, units, or rooms where patient care is provided. This includes common abbreviations for units. Do not tag general terms like 'hospital' or 'clinic' unless they refer to a specific, named part of that facility. Examples: 'ICU', 'CCU', 'cath lab', 'PCU', 'quartermain 2/3', 'IR'.",
vocabulary=None
),
Entity(
name='NAME',
description="Personal names of individuals (patients, healthcare providers, family, etc.). This includes full names (e.g., 'Mary Souza') and last names (e.g., 'Healey', 'Rizzo', 'Bean', 'Vasquez'). Do not include titles (e.g. 'Dr.') unless they are part of a salutation that includes the name and cannot be easily separated.",
vocabulary=None
),
Entity(
name='CITY',
description="A specific, named location of human habitation, generally smaller than a state or country, where the patient or a related person lives or has visited. This includes towns, boroughs, or any named municipality. Do not include general geographical regions (e.g., 'the South'). Examples: 'catonsville', 'San Diego', 'Hampton', 'rome'.",
vocabulary=None
),
Entity(
name='STATE',
description="Identify any mention of a U.S. state, commonwealth, or territory. This includes full names, common abbreviations (e.g., 'CA', 'NY'), and often-used nicknames, even if ambiguous in isolation (e.g., 'NH' for New Hampshire, 'GA' for Georgia). Do not include cities, counties, or other sub-state geographical entities. Examples: 'Florida', 'NY', 'Calif', 'NH', 'Massachusetts'.",
vocabulary=None
),
Entity(
name='ORGANIZATION',
description="Identifies names of companies, institutions, or other organized groups. This includes healthcare providers (e.g., 'GBMC', 'Hospice/VNA', 'S. DOMINICO NURSING AGENCY'), corporations (e.g., 'IBM', 'Genentech'), and product brands/manufacturers (e.g., 'MCDONALDS', 'Big Boy BariAir', 'CRITICARE'). Capture the full official name or recognizable short forms.",
vocabulary=None
),
Entity(
name='PROFESSION',
description="A person's occupation, job, or role, typically implying a regular activity or source of income. This includes specific job titles, general occupational categories, or formal/informal roles within a professional context. Exclude temporary states (e.g., 'UNEMPLOYED' when referring to a lack of employment, not a job title), but include roles like 'CAREGIVER' if it denotes an ongoing, defined responsibility. Examples: 'CEO', 'chaplain', 'interpreter', 'Rabbi', 'neurologist'.",
vocabulary=None
),
Entity(
name='ID',
description="Identifies unique numerical or alphanumeric sequences used for identification. This includes phone numbers, fax numbers, social security numbers, medical record numbers, account numbers, and internal tracking codes like visit numbers. Exclude generic numerical references not intended for identification (e.g., drug dosages, room numbers, dates without a time component). Examples: '201/324/1423', '07-1530', 'MRN: 12345', 'SSN-987-65-4321'.",
vocabulary=None
),
Entity(
name='PHONE',
description="Identifies any sequence of digits representing a telephone, pager, or fax number. This includes standard 10-digit numbers (e.g., '201-561-8910'), shorter pager numbers (e.g., '54321'), and numbers preceded by 'tel', 'PG', or 'Pager #'. Do not include surrounding words or symbols like '#' unless they are part of the number itself.",
vocabulary=None
),
Entity(
name='COUNTRY',
description="Identify names of sovereign states or recognized geopolitical entities with independent governance. Include definite articles (e.g., 'the') if they are part of the common name or closely precede it. Exclude cities, states, or regions within a country. Examples: 'Canada', 'the United States', 'France', 'United Kingdom', 'South Korea'.",
vocabulary=None
),
]
nlp = spacy.blank("en")
nlp_config = PipelineConfig(
linker=LinkerSMXM(model_name="disi-unibo-nlp/openbioner-base-v2-deid"),
entities=entities,
device='cuda'
)
nlp.add_pipe("zshot", config=nlp_config, last=True)
print("Evaluating...")
evaluation = evaluate(nlp, ds, metric=Seqeval())
print("Done!")
print(prettify_evaluate_report(evaluation)[0])
Evaluating..
Done!
+------------------------------------+
| linker - |
| General - span-based |
+-------------------------+----------+
| Metric | |
+-------------------------+----------+
| overall_precision_micro | 0.6493 |
| overall_recall_micro | 0.4598 |
| overall_f1_micro | 0.5382 |
| overall_precision_macro | 0.5106 |
| overall_recall_macro | 0.3175 |
| overall_f1_macro | 0.3706 |
| overall_accuracy | 0.9384 |
| total_time_in_seconds | 254.6109 |
| samples_per_second | 5.5575 |
| latency_in_seconds | 0.1799 |
+-------------------------+----------+
Performance
The updated OpenBioNER-base-v2-deid model achieves state-of-the-art performance, surpassing all the considered baselines across the benchmark, at both the token and span levels.
⚠️ Note: All results above were computed using the
zshotlibrary (v0.0.11), which supports both GLiNER and OpenBioNER architectures. For all GLiNER models, evaluations were performed using lowercase type names and a threshold of 0.5. To reproduce the results, please refer to our GitHub repository.
⚠️ Disclaimer: Please note that running evaluations using the
zshotlibrary may lead to slightly different results on certain benchmarks compared to those reported in the paper (above). This discrepancy is due to differences in token alignment:zshotuses spaCy's character-based span matching, while our experiments use token-level alignment as handled by BERT-based NER pipelines. These differences can affect how entity spans are matched and evaluated, particularly in cases with subword tokenization or punctuation.
Descriptions
Below we provide all the descriptions used to evaluate openbioner-base-v2-deid for each dataset.
Negative Class
This is the description used as NEG class (e.g. not an entity) for the dataset:
Coal, water, oil, etc. are normally used for traditional electricity generation. However using liquefied natural gas as fuel for joint circulatory electricity generation has advantages. The chief financial officer is the only one there taking the fall. It has a very talented team, eh. What will happen to the wildlife? I just tell them, you've got to change. They're here to stay. They have no insurance on their cars. What else would you like? Whether holding an international cultural event or setting the city's cultural policies, she always asks for the participation or input of other cities and counties.
Physionet-DEID
| Type | Description |
|---|---|
| AGE | Identifies numerical or descriptive expressions indicating a person's age. This includes numbers, optionally followed by units like "YEAR OLD", "YR", "YO" or "Y.O.". Also includes variations like "57yo" or "74y". Do not include birthdates or age ranges unless a specific age is mentioned. Examples: "58", "57yo", "74", "67 y.o.". |
| HOSPITAL | Identifies names or abbreviations of healthcare facilities where a patient receives medical care. This includes full names (e.g., "CALVERT HOSPITAL", "Holy Cross"), abbreviations (e.g., "GH", "CALVERT-"), and misspellings (e.g., "CALVERT HOSPIATAL"). Do not include general hospital types (e.g., "hospital", "ER", "ICU") unless they are part of a specific named entity (e.g., "kernan hosp"). |
| DATE | Mark all explicit and implied dates, including full dates (e.g. "1992", "7/22", "7/23"), two-digit years (e.g., "92", "'92", "'95"), and days of the week or parts of the day (e.g., "Monday", "MN" for midnight). Do not include relative date expressions like "2 WEEK HISTORY". |
| LOCATION | Identifies specific, named locations within a healthcare facility, typically referring to specialized departments, units, or rooms where patient care is provided. This includes common abbreviations for units. Do not tag general terms like "hospital" or "clinic" unless they refer to a specific, named part of that facility. Examples: "ICU", "CCU", "cath lab", "PCU", "quartermain 2/3", "IR". |
| NAME | Personal names of individuals (patients, healthcare providers, family, etc.). This includes full names (e.g., "Mary Souza") and last names (e.g., "Healey", "Rizzo", "Bean", "Vasquez"). Do not include titles (e.g. "Dr.") unless they are part of a salutation that includes the name and cannot be easily separated. |
| CITY | A specific, named location of human habitation, generally smaller than a state or country, where the patient or a related person lives or has visited. This includes towns, boroughs, or any named municipality. Do not include general geographical regions (e.g., "the South"). Examples: "catonsville", "San Diego", "Hampton", "rome". |
| STATE | Identify any mention of a U.S. state, commonwealth, or territory. This includes full names, common abbreviations (e.g., "CA", "NY"), and often-used nicknames, even if ambiguous in isolation (e.g., "NH" for New Hampshire, "GA" for Georgia). Do not include cities, counties, or other sub-state geographical entities. Examples: "Florida", "NY", "Calif", "NH", "Massachusetts". |
| ORGANIZATION | Identifies names of companies, institutions, or other organized groups. This includes healthcare providers (e.g., "GBMC", "Hospice/VNA", "S. DOMINICO NURSING AGENCY"), corporations (e.g., "IBM", "Genentech"), and product brands/manufacturers (e.g., "MCDONALDS", "Big Boy BariAir", "CRITICARE"). Capture the full official name or recognizable short forms. |
| PROFESSION | A person's occupation, job, or role, typically implying a regular activity or source of income. This includes specific job titles, general occupational categories, or formal/informal roles within a professional context. Exclude temporary states (e.g., "UNEMPLOYED" when referring to a lack of employment, not a job title), but include roles like "CAREGIVER" if it denotes an ongoing, defined responsibility. Examples: "CEO", "chaplain", "interpreter", "Rabbi", "neurologist". |
| ID | Identifies unique numerical or alphanumeric sequences used for identification. This includes phone numbers, fax numbers, social security numbers, medical record numbers, account numbers, and internal tracking codes like visit numbers. Exclude generic numerical references not intended for identification (e.g., drug dosages, room numbers, dates without a time component). Examples: "201/324/1423", "07-1530", "MRN: 12345", "SSN-987-65-4321". |
| PHONE | Identifies any sequence of digits representing a telephone, pager, or fax number. This includes standard 10-digit numbers (e.g., "201-561-8910"), shorter pager numbers (e.g., "54321"), and numbers preceded by "tel", "PG", or "Pager #". Do not include surrounding words or symbols like "#" unless they are part of the number itself. |
| COUNTRY | Identify names of sovereign states or recognized geopolitical entities with independent governance. Include definite articles (e.g., "the") if they are part of the common name or closely precede it. Exclude cities, states, or regions within a country. Examples: "Canada", "the United States", "France", "United Kingdom", "South Korea". |
🧬 How to Write Effective Entity Type Descriptions
Entity type descriptions are crucial for improving generalization in OpenBioNER. Well-written descriptions help models disambiguate types, handle rare classes, and align with real-world usage across diverse datasets.
✅ Best Practices
Start with a clear definition: Briefly explain what the entity type is.
Include functions or context: Add what it does, its purpose, or where it appears.
List 3–5 concrete examples: Use domain-relevant examples (e.g., real diseases, proteins, or food items).
Mention subtypes or synonyms (optional): Helps capture lexical variation and rare mentions.
Keep it concise: 1–3 well-structured sentences are ideal.
⚠️ Common Mistakes to Avoid
- Vague or overly generic descriptions
- No examples
- Just a list of terms
- Redundant or circular wording
🧪 Template (Recommended Format)
A [TYPE] refers to [concise definition]. It includes examples such as [example1], [example2], and [example3].
Authors
- Alessio Cocchieri
- Giacomo Frisoni
- Francesco Zangrillo
- Luca Ragazzi
- Marcos Martinez Galindo
- Gianluca Moro
- Giuseppe Tagliavini
📬 Contacts
For questions, collaborations, or feedback, feel free to reach out:
- Alessio: a.cocchieri@unibo.it
- Giacomo: giacomo.frisoni@unibo.it
- Francesco: f.zangrillo@unibo.it
- Downloads last month
- 36
Model tree for disi-unibo-nlp/openbioner-base-v2-deid
Base model
dmis-lab/biobert-v1.1
