GLiNER2 — Mexican Legal NER (gliner2-mx-legal-ner)
A fine-tuned GLiNER2 model specialized in Named Entity Recognition for Mexican legal documents in Spanish. Extracts personal identifiers, financial data, legal references, and document numbers from notarial instruments, contracts, labor agreements, amparo suits, and criminal sentences.
Privacy by design: GLiNER2 is a 205M-parameter encoder model that runs entirely locally on CPU or GPU — sensitive legal documents never leave your server.
Quick Start
from gliner2 import GLiNER2
model = GLiNER2.from_pretrained("maik20100/gliner2-mx-legal-ner")
text = """
En la Ciudad de México, el día 15 de marzo de 2024, ante mí, licenciado Roberto Mendoza López,
Notario Público número 124 del Distrito Federal, comparece el señor Javier Ramírez González,
con RFC RAGJ850115KL4 y CURP RGGJ850115HDFRMV09, titular de la cuenta CLABE 032180000118359719.
"""
entity_descriptions = {
"rfc": "Registro Federal de Contribuyentes, clave alfanumérica de 12 o 13 caracteres",
"curp": "Clave Única de Registro de Población, cadena de 18 caracteres alfanuméricos",
"nombre_persona": "Nombre completo de persona física incluyendo nombre(s) de pila y apellidos",
"fecha": "Fecha en cualquier formato utilizado en documentos legales mexicanos",
"cuenta_bancaria": "Número de cuenta bancaria o CLABE interbancaria (18 dígitos)",
}
entities = model.extract_entities(text, entity_descriptions)
print(entities)
Supported Entity Types
Entities with Validated High Performance (F1 ≥ 90%)
These entities are production-ready based on evaluation against a held-out test set.
| Entity | Description | This Model (F1) | Base Model (F1) |
|---|---|---|---|
correo_electronico |
Email address for legal notifications | 1.000 | 1.000 |
cuenta_bancaria |
Bank account number or 18-digit CLABE | 1.000 | 0.739 |
curp |
Clave Única de Registro de Población (18 chars) | 1.000 | 0.859 |
identificacion_oficial |
INE/IFE credential folio, passport number, cartilla | 1.000 | 0.008 |
nss |
Número de Seguridad Social / IMSS affiliation (11 digits) | 1.000 | 1.000 |
numero_contrato |
Contract, policy, credit, or agreement identifier | 1.000 | 0.074 |
numero_escritura |
Public instrument, escritura pública, or acta number | 1.000 | — |
numero_expediente |
Court case or docket number | 1.000 | 0.614 |
numero_telefono |
10-digit mobile or landline number | 1.000 | 0.933 |
rfc |
Registro Federal de Contribuyentes (12–13 chars) | 1.000 | 0.878 |
nacionalidad |
Declared nationality | 0.987 | 0.984 |
nombre_persona |
Full name of a natural person | 0.977 | 0.928 |
fecha |
Date in any format used in Mexican legal documents | 0.904 | 0.703 |
Macro F1 (all entities): 0.802 Micro F1: 0.793 (vs. base model: 0.565 / 0.610)
Entities Targeted for Future Improvement
The following entities are extracted but performance is not yet production-ready. Contributions and targeted training data are welcome.
| Entity | Description | This Model (F1) |
|---|---|---|
domicilio |
Full physical address (street, colonia, municipality, state, postal code) | 0.062 |
monto |
Transactional monetary amount (price, salary, capital contribution) | 0.163 |
notaria |
Notary public identifier including number and jurisdiction | 0.388 |
organizacion |
Legal entity name / razón social | 0.473 |
referencia_legal |
Citation to a Mexican law, code, or article | 0.476 |
Entity Descriptions for Inference
GLiNER2 uses natural language descriptions as semantic anchors. Always pass these at inference time for best results:
entity_descriptions = {
# High-performance entities
"rfc": "Registro Federal de Contribuyentes, clave alfanumérica de 12 caracteres para personas morales o 13 para personas físicas, ejemplo: GARC850101AB1",
"curp": "Clave Única de Registro de Población, cadena de 18 caracteres alfanuméricos que identifica de forma única a un ciudadano mexicano, ejemplo: GARC850101HGTRRR09",
"nombre_persona": "Nombre completo de persona física incluyendo nombre(s) de pila y apellidos paterno y materno, como aparece en documentos legales mexicanos",
"fecha": "Fecha en cualquier formato utilizado en documentos legales mexicanos, incluyendo formato textual completo como 'quince de marzo del año dos mil veinticuatro'",
"cuenta_bancaria": "Número de cuenta bancaria o CLABE interbancaria (18 dígitos) utilizada para depósitos financieros",
"nss": "Número de Seguridad Social (NSS) o Número de Afiliación al IMSS, asignado a trabajadores en México (11 dígitos)",
"correo_electronico": "Dirección de correo electrónico / email para oír y recibir notificaciones",
"numero_telefono": "Número telefónico fijo o móvil, típicamente a 10 dígitos",
"identificacion_oficial": "Identificador de documento oficial (folio de credencial INE/IFE, número de pasaporte, cartilla)",
"numero_contrato": "Número identificador de un contrato, póliza de seguro, crédito, cuenta o convenio",
"numero_escritura": "Número de instrumento público, escritura pública, acta o póliza pasada ante la fe pública",
"numero_expediente": "Número de expediente, causa o juicio asignado por un juzgado o tribunal",
"nacionalidad": "Nacionalidad declarada de la persona compareciente",
# Future-improvement entities (include if needed)
"domicilio": "Dirección física completa incluyendo calle, número exterior e interior, colonia, municipio o alcaldía, estado y código postal",
"monto": "Cantidad monetaria particular acordada, pagada o reclamada por las partes (precio, salario, aportación de capital). Excluye umbrales legales o tarifas oficiales.",
"notaria": "Identificación de notaría pública incluyendo número y jurisdicción",
"organizacion": "Nombre de persona moral (razón social), empresa, asociación civil o dependencia gubernamental",
"referencia_legal": "Cita a una ley, código, artículo o disposición normativa mexicana",
}
Document Types Covered
Training data was generated across 7 Mexican legal document categories:
- Poderes Notariales — powers of attorney
- Actas Constitutivas — corporate formation instruments
- Contratos de Compraventa — purchase/sale agreements
- Juicios de Amparo — constitutional relief petitions
- Contratos Individuales de Trabajo — individual labor contracts
- Sentencias Penales de Primera Instancia — first-instance criminal sentences
- Expedientes / Consultas de Cliente — client intake files
Training Data & Methodology
Training data is fully synthetic — no real PII was used at any stage.
A custom pipeline generates realistic Mexican legal Spanish text using:
- Template generation — LLM-generated document chunks (150–300 words each) for 47 doc-type × section combinations, at two register tones (
clasico/moderno), generated viamistralai/mistral-large-2512through OpenRouter. - Faker injection — Each placeholder is replaced with structurally valid synthetic data using
Faker (es_MX)extended with custom Mexican validators (RFC checksum algorithm, CURP structure, CLABE digit, NSS format). - Export — 22,770 active annotated chunks split 70/15/15 into train/val/test sets (~45 MB of JSONL).
The full pipeline source code and DB schema will be published separately on GitHub (link forthcoming).
Base model: fastino/gliner2-base-v1
Training hardware: NVIDIA RTX 5070 Ti
Training config: full fine-tune, 10 epochs, encoder LR 1e-5, task LR 5e-4, batch size 8
Inference Notes
- Token limit: GLiNER2 has a strict 2,048 token limit (~1,500 words). For longer documents, split on natural section boundaries (e.g., "DECLARACIONES", "CLÁUSULAS") with ~100-word overlap between chunks.
- Entity descriptions matter: This model was trained with rich Spanish-language descriptions. Passing minimal or English descriptions will degrade performance significantly.
montocontrastive rule: The model is trained to label only transactional amounts (precio, salario, capital) asmonto, and to leave statutory/regulatory thresholds (e.g., UMA multiples, fine ceilings) untagged. This is intentional behavior for privacy-safe financial redaction.
Limitations
- Trained exclusively on synthetic data; real-world performance on authentic notarial prose may vary, particularly for
domicilio(complex multi-line addresses) andorganizacion(varied razón social formats). - Performance on document types outside the 7 covered categories is untested.
- Not evaluated on scanned PDFs or OCR output — text quality directly affects extraction.
Contributing
Issues, test results on real documents, and targeted training data contributions are welcome. The generation pipeline (templates, Faker injection, export) will be open-sourced to allow community-driven data expansion targeting the underperforming entity types.
Citation
If you use this model in your work, please cite:
@misc{gliner2-mx-legal-ner,
title = {GLiNER2 Fine-tuned for Mexican Legal NER},
author = {maik20100},
year = {2026},
url = {https://huggingface.co/maik20100/gliner2-mx-legal-ner}
}
License
Apache 2.0 — same as the base GLiNER2 model.
- Downloads last month
- -
Model tree for maik20100/gliner2-mx-legal-ner
Base model
fastino/gliner2-base-v1