medical-mt5-clinical-el-spanish

A generative model for clinical entity linking (ICD-10-CM/ES coding) in Spanish, fine-tuned from HiTZ/Medical-mT5-large on Spanish clinical text corpora.

Given a clinical mention extracted from a medical document, the model generates the corresponding ICD-10 code in a sequence-to-sequence fashion.

Model Description

Attribute	Details
Architecture	mT5-large (encoder-decoder)
Base model	HiTZ/Medical-mT5-large
Task	Clinical entity linking / ICD-10 coding
Language	Spanish (`es`)
Parameters	~1.2B
Precision	float32
License	CC BY-NC 4.0

This model is part of a comparative study on generative approaches to clinical entity linking in Spanish. It is one of several fine-tuned variants explored in the associated paper.

Intended Use

Primary Use

Automated ICD-10 code assignment from Spanish clinical text mentions
Clinical NLP pipelines for Spanish electronic health records (EHRs)
Research on generative entity linking in low-resource medical settings

Out-of-Scope Use

Non-clinical or general-domain text
Languages other than Spanish
Direct clinical decision-making without human expert oversight

Training Data

The model was fine-tuned on three Spanish clinical NLP datasets:

Dataset	Description
Biomedical-TeMU/CodiEsp_corpus	CodiEsp shared task corpus — Spanish clinical cases annotated with ICD-10 codes (diagnoses and procedures)
DrBenchmark/E3C	European clinical corpus with entity annotations across multiple languages, including Spanish
IIC/CT-EBM-SP	Spanish clinical trial corpus annotated for biomedical entities
Mantra GSC	Medline abstract titles, drug labels, biomedical patent claims in Spanish

Evaluation

Performance is reported as mean ± standard deviation across three runs with different random seeds on the CodiEsp corpus.

Split	Model	PLI	PLII
TEST	Medical-mT5-C+M	70.46 ± 0.32	77.04 ± 0.35
TEST UNSEEN	Medical-mT5-C+M	42.46 ± 1.46	53.49 ± 1.34

The TEST UNSEEN split contains entity mentions whose surface forms were not seen during training, providing a stricter evaluation of generalization.

C refers to the context, which means that the training data was prepared as a concatenation of the entity mention and the sentence where it occurs.

+M refers to Mapped corpus, which were created from the corpora annotated with UMLS CUIs but was mapped to ICD-10 automatically with ClinIDMap

PLI (Pipeline I) — the model's generated text is compared against all ICD-10 code descriptions in the knowledge base using normalized string matching (diacritics removed, lowercased). A code is assigned when an exact match is found.

PLII (Pipeline II) — extends Pipeline I with semantic text similarity (STS). The generated text and all ICD-10 descriptions are embedded using multilingual SapBERT; the nearest neighbour in the vector space (L2 distance) determines the predicted code. This handles cases where the generated output is semantically close to the correct description but not an exact string match.

How to Use

from transformers import AutoConfig, MT5ForConditionalGeneration, AutoTokenizer

model_name = "ezotova/medical-mt5-clinical-el-spanish"
tokenizer = AutoTokenizer.from_pretrained(model_name)

config = AutoConfig.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name, config=config)

# Example: entity mention from a clinical note
def make_prompt(term, sentence):
    return f"Genera una definición para el término: {term} - en la frase: {sentence}"

term = "TC abdominal"
sentence = "La TC abdominal es normal."

prompt = make_prompt(term, sentence)

inputs = tokenizer(prompt, return_tensors="pt", padding=True)

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    num_beams=5,
    early_stopping=True
)

prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(prediction)

# Expected: tomografía computarizada (scanner tc) de abdomen [Code: BW20]

Citation

@article{zotova2025clinical,
  title     = {Generative Models for Clinical Entity Linking in Spanish},
  author    = {Zotova, Elena and Cuadros, Montse and Rigau, German},
  journal   = {Preprint submitted to Elsevier: Available at SSRN 5976036},
  year      = {2026},
  doi       = {[DOI]}
}

Ethical Considerations

This model is intended for research purposes only and should not be used as a standalone clinical coding tool without expert review.
Training data consists of de-identified or synthetic clinical text; however, users should exercise caution when applying the model to real patient data.
Performance may vary across clinical specialties, document types, and ICD-10 coding guidelines versions.

Acknowledgements

This work was carried out at Vicomtech in collaboration with the HiTZ Center (UPV/EHU).

The base model HiTZ/Medical-mT5-large was developed at HiTZ/IXA and is gratefully acknowledged.

Downloads last month: 590

Safetensors

Model size

1B params

Tensor type

F32

Model tree for ezotova/medical-mt5-clinical-el-spanish

Base model

google/mt5-large

Finetuned

HiTZ/Medical-mT5-large