GLiNER2 — Mexican Legal NER (gliner2-mx-legal-ner)

A fine-tuned GLiNER2 model specialized in Named Entity Recognition for Mexican legal documents in Spanish. Extracts personal identifiers, financial data, legal references, and document numbers from notarial instruments, contracts, labor agreements, amparo suits, and criminal sentences.

Privacy by design: GLiNER2 is a 205M-parameter encoder model that runs entirely locally on CPU or GPU — sensitive legal documents never leave your server.


Quick Start

from gliner2 import GLiNER2

model = GLiNER2.from_pretrained("maik20100/gliner2-mx-legal-ner")

text = """
En la Ciudad de México, el día 15 de marzo de 2024, ante mí, licenciado Roberto Mendoza López,
Notario Público número 124 del Distrito Federal, comparece el señor Javier Ramírez González,
con RFC RAGJ850115KL4 y CURP RGGJ850115HDFRMV09, titular de la cuenta CLABE 032180000118359719.
"""

entity_descriptions = {
    "rfc": "Registro Federal de Contribuyentes, clave alfanumérica de 12 o 13 caracteres",
    "curp": "Clave Única de Registro de Población, cadena de 18 caracteres alfanuméricos",
    "nombre_persona": "Nombre completo de persona física incluyendo nombre(s) de pila y apellidos",
    "fecha": "Fecha en cualquier formato utilizado en documentos legales mexicanos",
    "cuenta_bancaria": "Número de cuenta bancaria o CLABE interbancaria (18 dígitos)",
}

entities = model.extract_entities(text, entity_descriptions)
print(entities)

Supported Entity Types

Entities with Validated High Performance (F1 ≥ 90%)

These entities are production-ready based on evaluation against a held-out test set.

Entity Description This Model (F1) Base Model (F1)
correo_electronico Email address for legal notifications 1.000 1.000
cuenta_bancaria Bank account number or 18-digit CLABE 1.000 0.739
curp Clave Única de Registro de Población (18 chars) 1.000 0.859
identificacion_oficial INE/IFE credential folio, passport number, cartilla 1.000 0.008
nss Número de Seguridad Social / IMSS affiliation (11 digits) 1.000 1.000
numero_contrato Contract, policy, credit, or agreement identifier 1.000 0.074
numero_escritura Public instrument, escritura pública, or acta number 1.000
numero_expediente Court case or docket number 1.000 0.614
numero_telefono 10-digit mobile or landline number 1.000 0.933
rfc Registro Federal de Contribuyentes (12–13 chars) 1.000 0.878
nacionalidad Declared nationality 0.987 0.984
nombre_persona Full name of a natural person 0.977 0.928
fecha Date in any format used in Mexican legal documents 0.904 0.703

Macro F1 (all entities): 0.802 Micro F1: 0.793 (vs. base model: 0.565 / 0.610)

Entities Targeted for Future Improvement

The following entities are extracted but performance is not yet production-ready. Contributions and targeted training data are welcome.

Entity Description This Model (F1)
domicilio Full physical address (street, colonia, municipality, state, postal code) 0.062
monto Transactional monetary amount (price, salary, capital contribution) 0.163
notaria Notary public identifier including number and jurisdiction 0.388
organizacion Legal entity name / razón social 0.473
referencia_legal Citation to a Mexican law, code, or article 0.476

Entity Descriptions for Inference

GLiNER2 uses natural language descriptions as semantic anchors. Always pass these at inference time for best results:

entity_descriptions = {
    # High-performance entities
    "rfc": "Registro Federal de Contribuyentes, clave alfanumérica de 12 caracteres para personas morales o 13 para personas físicas, ejemplo: GARC850101AB1",
    "curp": "Clave Única de Registro de Población, cadena de 18 caracteres alfanuméricos que identifica de forma única a un ciudadano mexicano, ejemplo: GARC850101HGTRRR09",
    "nombre_persona": "Nombre completo de persona física incluyendo nombre(s) de pila y apellidos paterno y materno, como aparece en documentos legales mexicanos",
    "fecha": "Fecha en cualquier formato utilizado en documentos legales mexicanos, incluyendo formato textual completo como 'quince de marzo del año dos mil veinticuatro'",
    "cuenta_bancaria": "Número de cuenta bancaria o CLABE interbancaria (18 dígitos) utilizada para depósitos financieros",
    "nss": "Número de Seguridad Social (NSS) o Número de Afiliación al IMSS, asignado a trabajadores en México (11 dígitos)",
    "correo_electronico": "Dirección de correo electrónico / email para oír y recibir notificaciones",
    "numero_telefono": "Número telefónico fijo o móvil, típicamente a 10 dígitos",
    "identificacion_oficial": "Identificador de documento oficial (folio de credencial INE/IFE, número de pasaporte, cartilla)",
    "numero_contrato": "Número identificador de un contrato, póliza de seguro, crédito, cuenta o convenio",
    "numero_escritura": "Número de instrumento público, escritura pública, acta o póliza pasada ante la fe pública",
    "numero_expediente": "Número de expediente, causa o juicio asignado por un juzgado o tribunal",
    "nacionalidad": "Nacionalidad declarada de la persona compareciente",
    # Future-improvement entities (include if needed)
    "domicilio": "Dirección física completa incluyendo calle, número exterior e interior, colonia, municipio o alcaldía, estado y código postal",
    "monto": "Cantidad monetaria particular acordada, pagada o reclamada por las partes (precio, salario, aportación de capital). Excluye umbrales legales o tarifas oficiales.",
    "notaria": "Identificación de notaría pública incluyendo número y jurisdicción",
    "organizacion": "Nombre de persona moral (razón social), empresa, asociación civil o dependencia gubernamental",
    "referencia_legal": "Cita a una ley, código, artículo o disposición normativa mexicana",
}

Document Types Covered

Training data was generated across 7 Mexican legal document categories:

  • Poderes Notariales — powers of attorney
  • Actas Constitutivas — corporate formation instruments
  • Contratos de Compraventa — purchase/sale agreements
  • Juicios de Amparo — constitutional relief petitions
  • Contratos Individuales de Trabajo — individual labor contracts
  • Sentencias Penales de Primera Instancia — first-instance criminal sentences
  • Expedientes / Consultas de Cliente — client intake files

Training Data & Methodology

Training data is fully synthetic — no real PII was used at any stage.

A custom pipeline generates realistic Mexican legal Spanish text using:

  1. Template generation — LLM-generated document chunks (150–300 words each) for 47 doc-type × section combinations, at two register tones (clasico / moderno), generated via mistralai/mistral-large-2512 through OpenRouter.
  2. Faker injection — Each placeholder is replaced with structurally valid synthetic data using Faker (es_MX) extended with custom Mexican validators (RFC checksum algorithm, CURP structure, CLABE digit, NSS format).
  3. Export — 22,770 active annotated chunks split 70/15/15 into train/val/test sets (~45 MB of JSONL).

The full pipeline source code and DB schema will be published separately on GitHub (link forthcoming).

Base model: fastino/gliner2-base-v1
Training hardware: NVIDIA RTX 5070 Ti
Training config: full fine-tune, 10 epochs, encoder LR 1e-5, task LR 5e-4, batch size 8


Inference Notes

  • Token limit: GLiNER2 has a strict 2,048 token limit (~1,500 words). For longer documents, split on natural section boundaries (e.g., "DECLARACIONES", "CLÁUSULAS") with ~100-word overlap between chunks.
  • Entity descriptions matter: This model was trained with rich Spanish-language descriptions. Passing minimal or English descriptions will degrade performance significantly.
  • monto contrastive rule: The model is trained to label only transactional amounts (precio, salario, capital) as monto, and to leave statutory/regulatory thresholds (e.g., UMA multiples, fine ceilings) untagged. This is intentional behavior for privacy-safe financial redaction.

Limitations

  • Trained exclusively on synthetic data; real-world performance on authentic notarial prose may vary, particularly for domicilio (complex multi-line addresses) and organizacion (varied razón social formats).
  • Performance on document types outside the 7 covered categories is untested.
  • Not evaluated on scanned PDFs or OCR output — text quality directly affects extraction.

Contributing

Issues, test results on real documents, and targeted training data contributions are welcome. The generation pipeline (templates, Faker injection, export) will be open-sourced to allow community-driven data expansion targeting the underperforming entity types.


Citation

If you use this model in your work, please cite:

@misc{gliner2-mx-legal-ner,
  title  = {GLiNER2 Fine-tuned for Mexican Legal NER},
  author = {maik20100},
  year   = {2026},
  url    = {https://huggingface.co/maik20100/gliner2-mx-legal-ner}
}

License

Apache 2.0 — same as the base GLiNER2 model.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maik20100/gliner2-mx-legal-ner

Finetuned
(3)
this model