GLiNER2 — Mexican Legal NER (gliner2-mx-legal-ner)

A fine-tuned GLiNER2 model specialized in Named Entity Recognition for Mexican legal documents in Spanish. Extracts personal identifiers, financial data, legal references, and document numbers from notarial instruments, contracts, labor agreements, amparo suits, and criminal sentences.

Privacy by design: GLiNER2 is a 205M-parameter encoder model that runs entirely locally on CPU or GPU — sensitive legal documents never leave your server.

Quick Start

from gliner2 import GLiNER2

model = GLiNER2.from_pretrained("maik20100/gliner2-mx-legal-ner")

text = """
En la Ciudad de México, el día 15 de marzo de 2024, ante mí, licenciado Roberto Mendoza López,
Notario Público número 124 del Distrito Federal, comparece el señor Javier Ramírez González,
con RFC RAGJ850115KL4 y CURP RGGJ850115HDFRMV09, titular de la cuenta CLABE 032180000118359719.
"""

entity_descriptions = {
    "rfc": "Registro Federal de Contribuyentes, clave alfanumérica de 12 o 13 caracteres",
    "curp": "Clave Única de Registro de Población, cadena de 18 caracteres alfanuméricos",
    "nombre_persona": "Nombre completo de persona física incluyendo nombre(s) de pila y apellidos",
    "fecha": "Fecha en cualquier formato utilizado en documentos legales mexicanos",
    "cuenta_bancaria": "Número de cuenta bancaria o CLABE interbancaria (18 dígitos)",
}

entities = model.extract_entities(text, entity_descriptions)
print(entities)

Supported Entity Types

Entities with Validated High Performance (F1 ≥ 90%)

These entities are production-ready based on evaluation against a held-out test set.

Entity	Description	This Model (F1)	Base Model (F1)
`correo_electronico`	Email address for legal notifications	1.000	1.000
`cuenta_bancaria`	Bank account number or 18-digit CLABE	1.000	0.739
`curp`	Clave Única de Registro de Población (18 chars)	1.000	0.859
`identificacion_oficial`	INE/IFE credential folio, passport number, cartilla	1.000	0.008
`nss`	Número de Seguridad Social / IMSS affiliation (11 digits)	1.000	1.000
`numero_contrato`	Contract, policy, credit, or agreement identifier	1.000	0.074
`numero_escritura`	Public instrument, escritura pública, or acta number	1.000	—
`numero_expediente`	Court case or docket number	1.000	0.614
`numero_telefono`	10-digit mobile or landline number	1.000	0.933
`rfc`	Registro Federal de Contribuyentes (12–13 chars)	1.000	0.878
`nacionalidad`	Declared nationality	0.987	0.984
`nombre_persona`	Full name of a natural person	0.977	0.928
`fecha`	Date in any format used in Mexican legal documents	0.904	0.703

Macro F1 (all entities): 0.802 Micro F1: 0.793 (vs. base model: 0.565 / 0.610)

Entities Targeted for Future Improvement

The following entities are extracted but performance is not yet production-ready. Contributions and targeted training data are welcome.

Entity	Description	This Model (F1)
`domicilio`	Full physical address (street, colonia, municipality, state, postal code)	0.062
`monto`	Transactional monetary amount (price, salary, capital contribution)	0.163
`notaria`	Notary public identifier including number and jurisdiction	0.388
`organizacion`	Legal entity name / razón social	0.473
`referencia_legal`	Citation to a Mexican law, code, or article	0.476

Entity Descriptions for Inference

GLiNER2 uses natural language descriptions as semantic anchors. Always pass these at inference time for best results:

entity_descriptions = {
    # High-performance entities
    "rfc": "Registro Federal de Contribuyentes, clave alfanumérica de 12 caracteres para personas morales o 13 para personas físicas, ejemplo: GARC850101AB1",
    "curp": "Clave Única de Registro de Población, cadena de 18 caracteres alfanuméricos que identifica de forma única a un ciudadano mexicano, ejemplo: GARC850101HGTRRR09",
    "nombre_persona": "Nombre completo de persona física incluyendo nombre(s) de pila y apellidos paterno y materno, como aparece en documentos legales mexicanos",
    "fecha": "Fecha en cualquier formato utilizado en documentos legales mexicanos, incluyendo formato textual completo como 'quince de marzo del año dos mil veinticuatro'",
    "cuenta_bancaria": "Número de cuenta bancaria o CLABE interbancaria (18 dígitos) utilizada para depósitos financieros",
    "nss": "Número de Seguridad Social (NSS) o Número de Afiliación al IMSS, asignado a trabajadores en México (11 dígitos)",
    "correo_electronico": "Dirección de correo electrónico / email para oír y recibir notificaciones",
    "numero_telefono": "Número telefónico fijo o móvil, típicamente a 10 dígitos",
    "identificacion_oficial": "Identificador de documento oficial (folio de credencial INE/IFE, número de pasaporte, cartilla)",
    "numero_contrato": "Número identificador de un contrato, póliza de seguro, crédito, cuenta o convenio",
    "numero_escritura": "Número de instrumento público, escritura pública, acta o póliza pasada ante la fe pública",
    "numero_expediente": "Número de expediente, causa o juicio asignado por un juzgado o tribunal",
    "nacionalidad": "Nacionalidad declarada de la persona compareciente",
    # Future-improvement entities (include if needed)
    "domicilio": "Dirección física completa incluyendo calle, número exterior e interior, colonia, municipio o alcaldía, estado y código postal",
    "monto": "Cantidad monetaria particular acordada, pagada o reclamada por las partes (precio, salario, aportación de capital). Excluye umbrales legales o tarifas oficiales.",
    "notaria": "Identificación de notaría pública incluyendo número y jurisdicción",
    "organizacion": "Nombre de persona moral (razón social), empresa, asociación civil o dependencia gubernamental",
    "referencia_legal": "Cita a una ley, código, artículo o disposición normativa mexicana",
}

Document Types Covered

Training data was generated across 7 Mexican legal document categories:

Poderes Notariales — powers of attorney
Actas Constitutivas — corporate formation instruments
Contratos de Compraventa — purchase/sale agreements
Juicios de Amparo — constitutional relief petitions
Contratos Individuales de Trabajo — individual labor contracts
Sentencias Penales de Primera Instancia — first-instance criminal sentences
Expedientes / Consultas de Cliente — client intake files

Training Data & Methodology

Training data is fully synthetic — no real PII was used at any stage.

A custom pipeline generates realistic Mexican legal Spanish text using:

Template generation — LLM-generated document chunks (150–300 words each) for 47 doc-type × section combinations, at two register tones (clasico / moderno), generated via mistralai/mistral-large-2512 through OpenRouter.
Faker injection — Each placeholder is replaced with structurally valid synthetic data using Faker (es_MX) extended with custom Mexican validators (RFC checksum algorithm, CURP structure, CLABE digit, NSS format).
Export — 22,770 active annotated chunks split 70/15/15 into train/val/test sets (~45 MB of JSONL).

The full pipeline source code and DB schema will be published separately on GitHub (link forthcoming).

Base model: fastino/gliner2-base-v1
Training hardware: NVIDIA RTX 5070 Ti
Training config: full fine-tune, 10 epochs, encoder LR 1e-5, task LR 5e-4, batch size 8

Inference Notes

Token limit: GLiNER2 has a strict 2,048 token limit (~1,500 words). For longer documents, split on natural section boundaries (e.g., "DECLARACIONES", "CLÁUSULAS") with ~100-word overlap between chunks.
Entity descriptions matter: This model was trained with rich Spanish-language descriptions. Passing minimal or English descriptions will degrade performance significantly.
monto contrastive rule: The model is trained to label only transactional amounts (precio, salario, capital) as monto, and to leave statutory/regulatory thresholds (e.g., UMA multiples, fine ceilings) untagged. This is intentional behavior for privacy-safe financial redaction.

Limitations

Trained exclusively on synthetic data; real-world performance on authentic notarial prose may vary, particularly for domicilio (complex multi-line addresses) and organizacion (varied razón social formats).
Performance on document types outside the 7 covered categories is untested.
Not evaluated on scanned PDFs or OCR output — text quality directly affects extraction.

Contributing

Issues, test results on real documents, and targeted training data contributions are welcome. The generation pipeline (templates, Faker injection, export) will be open-sourced to allow community-driven data expansion targeting the underperforming entity types.

Citation

If you use this model in your work, please cite:

@misc{gliner2-mx-legal-ner,
  title  = {GLiNER2 Fine-tuned for Mexican Legal NER},
  author = {maik20100},
  year   = {2026},
  url    = {https://huggingface.co/maik20100/gliner2-mx-legal-ner}
}

License

Apache 2.0 — same as the base GLiNER2 model.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maik20100/gliner2-mx-legal-ner

Base model

fastino/gliner2-base-v1

Finetuned

(3)

this model