| --- |
| license: apache-2.0 |
| base_model: openai/privacy-filter |
| library_name: peft |
| pipeline_tag: token-classification |
| language: |
| - en |
| tags: |
| - privacy |
| - pii |
| - token-classification |
| - lora |
| - peft |
| - nigeria |
| - privacy-filter |
| model-index: |
| - name: privacy-filter-nigeria |
| results: |
| - task: |
| type: token-classification |
| name: PII Span Detection |
| dataset: |
| name: stage2_v5 private mixed dataset (validation) |
| type: custom |
| metrics: |
| - type: f1 |
| name: Typed Span F1 |
| value: 0.9763 |
| - type: precision |
| name: Typed Span Precision |
| value: 0.9707 |
| - type: recall |
| name: Typed Span Recall |
| value: 0.9820 |
| - task: |
| type: token-classification |
| name: PII Span Detection |
| dataset: |
| name: stage2_v5 private mixed dataset (test) |
| type: custom |
| metrics: |
| - type: f1 |
| name: Typed Span F1 |
| value: 0.9640 |
| - type: precision |
| name: Typed Span Precision |
| value: 0.9593 |
| - type: recall |
| name: Typed Span Recall |
| value: 0.9688 |
| - task: |
| type: token-classification |
| name: Hard-Negative False Positive Audit |
| dataset: |
| name: stage2_v5 hard-negative challenge |
| type: custom |
| metrics: |
| - type: false_positive_rate |
| name: False-positive example rate |
| value: 0.72 |
| --- |
| |
| # Privacy Filter Nigeria LoRA |
|
|
| A LoRA adapter on top of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) |
| for Nigerian-domain PII span detection. |
|
|
| - **Adapter repo:** `iamSamurai/privacy-filter-nigeria` |
| - **Base model:** `openai/privacy-filter` |
| - **Adapter type:** LoRA (PEFT) for token classification |
| - **Source repo:** https://github.com/iamNarcisse/naija-privacy-filter |
| - **Eval artifacts:** [`iamSamurai/openai-privacy-filter-naija-eval-artifacts`](https://huggingface.co/iamSamurai/openai-privacy-filter-naija-eval-artifacts) |
| - **License:** Apache-2.0 (this adapter). Use of the base model remains |
| subject to the upstream model's terms. |
|
|
| ## Evaluation |
|
|
| Latest v5 eval against the internal stage2 v5 private mixed dataset, after |
| deterministic span postprocessing: |
|
|
| | Split | Typed span F1 | Precision | Recall | TP | FP | FN | |
| | --- | ---: | ---: | ---: | ---: | ---: | ---: | |
| | Validation | 0.9763 | 0.9707 | 0.9820 | 762 | 23 | 14 | |
| | Test | 0.9640 | 0.9593 | 0.9688 | 777 | 33 | 25 | |
|
|
| The v5 challenge split is hard-negative-only. Typed F1 is not meaningful |
| because there are no gold positive spans. Use false-positive diagnostics: |
|
|
| | Challenge diagnostic | Value | |
| | --- | ---: | |
| | Examples | 250 | |
| | Examples with predictions | 180 | |
| | False-positive example rate | 0.72 | |
| | Predicted false-positive spans | 456 | |
|
|
| This is a recall-oriented research adapter. The hard-negative audit shows that |
| benign identifier-like text can be over-redacted. Precision-sensitive users |
| should add deterministic filters, tune thresholds where applicable, or |
| finetune on representative local negatives. |
|
|
| ## Supported Label Spans |
|
|
| The adapter emits these span labels. `O` is the background token label and is |
| not returned as a detected span. |
|
|
| | Label | Detects | Example | |
| | --- | --- | --- | |
| | `account_number` | Nigerian bank account/NUBAN-style account numbers when context indicates an account | `6318826391` | |
| | `private_address` | Street, city, state, or postal address spans tied to a person or record | `42 Unity Road, Ikeja, Lagos 100271` | |
| | `private_bvn` | Nigerian Bank Verification Number references and values | `22334455667` | |
| | `private_date` | Dates tied to a person, record, document, or event in a private workflow | `12 April 1988` | |
| | `private_drivers_license_number` | Nigerian driver license identifiers | `K2BHY7F6FEA0` | |
| | `private_email` | Email addresses | `amina.yusuf@example.ng` | |
| | `private_nin` | Nigerian National Identification Number references and values | `12345678901` | |
| | `private_passport_number` | Nigerian passport identifiers | `B05995318` | |
| | `private_person` | Person names and name-like references | `Amina Yusuf` | |
| | `private_phone` | Nigerian local and international phone-number formats | `+234 802 111 3344` | |
| | `private_url` | URLs tied to private records, claims, documents, or workflows | `https://claims.example/record/1234` | |
| | `private_voters_card_number` | Nigerian voter card identifiers | `ABCD 1234 5678 9012 345` | |
| | `secret` | Known-format credentials, authorization codes, session tokens, and similar secrets | `S3cure!9037Ops` | |
|
|
| ## How To Use |
|
|
| > **Note on the classifier head.** This adapter ships a resized token- |
| > classification head for the Nigerian-domain label taxonomy |
| > (`label_map.json` / `token_label_names` in `adapter_config.json`). Loading |
| > the adapter on top of the unmodified base model with vanilla |
| > `peft.PeftModel.from_pretrained` will not resize the head automatically. |
| > Use the project runner (`privacy_filter.py`) or replicate its head-resize |
| > logic to get correct predictions. |
|
|
| ### Recommended: project runner |
|
|
| ```bash |
| pip install "torch>=2.8" "transformers>=4.56" "peft>=0.17" "huggingface-hub>=0.34" |
| git clone https://github.com/iamNarcisse/naija-privacy-filter |
| cd naija-privacy-filter |
| |
| python main.py \ |
| --model-name openai/privacy-filter \ |
| --adapter-name iamSamurai/privacy-filter-nigeria \ |
| "Amina Yusuf can be reached at +234 802 111 3344." |
| ``` |
|
|
| Or via `uv`: |
|
|
| ```bash |
| uv run python main.py \ |
| --model-name openai/privacy-filter \ |
| --adapter-name iamSamurai/privacy-filter-nigeria \ |
| "Amina Yusuf can be reached at +234 802 111 3344." |
| ``` |
|
|
| ### Example result |
|
|
| For the adapter command above, the cleaned output should contain: |
|
|
| | Field | Value | |
| | --- | --- | |
| | Status | `PII detected` | |
| | Detected spans | `2` | |
| | Mode | `cleaned` | |
| | Adapter | `iamSamurai/privacy-filter-nigeria` | |
|
|
| | Label | Text | Start | End | |
| | --- | --- | ---: | ---: | |
| | `private_person` | `Amina Yusuf` | 0 | 11 | |
| | `private_phone` | `+234 802 111 3344` | 30 | 47 | |
|
|
| Confidence scores are model outputs and are not privacy, security, or |
| compliance guarantees. |
|
|
| ### REST API |
|
|
| ```bash |
| PRIVACY_FILTER_MODEL_NAME=openai/privacy-filter \ |
| PRIVACY_FILTER_ADAPTER_NAME=iamSamurai/privacy-filter-nigeria \ |
| uv run uvicorn api:app --reload |
| |
| curl -X POST http://127.0.0.1:8000/predict \ |
| -H "Content-Type: application/json" \ |
| -d '{"text":"Amina Yusuf can be reached at +234 802 111 3344.","mode":"cleaned"}' |
| ``` |
|
|
| ### Direct `transformers + peft` (advanced) |
|
|
| If you want to bypass the project runner, you must resize the base model's |
| classification head to match `token_label_names` from the adapter's |
| `adapter_config.json` before applying the LoRA weights. See |
| [`privacy_filter.py`](https://github.com/iamNarcisse/naija-privacy-filter/blob/main/privacy_filter.py) |
| for the reference implementation. |
|
|
| ## Intended Use |
|
|
| Use this adapter for: |
|
|
| - Research and evaluation of Nigerian-domain PII detection. |
| - Prototyping local inference or REST API integration for privacy-filtering |
| workflows. |
| - Studying LoRA adaptation and deterministic span postprocessing for |
| token-classification models. |
| - Producing candidate spans for downstream review, redaction, or policy |
| engines. |
|
|
| Do **not** use this adapter as the only control for regulatory, legal, |
| medical, financial, or irreversible privacy decisions. |
|
|
| ## Training Data |
|
|
| The source repository includes a tiny public synthetic example bundle, |
| `data/examples`: |
|
|
| | Split | Examples | |
| | --- | ---: | |
| | Train | 5 | |
| | Validation | 5 | |
| | Test | 5 | |
| | Challenge | 5 | |
|
|
| This example bundle is for schema inspection and smoke tests only. It is not a |
| training or evaluation release. |
|
|
| The current v5 adapter was trained and evaluated against a later private |
| stage2 v5 mixed dataset that is not distributed with this model card. That |
| private mix includes synthetic examples, OCR-derived Nigerian |
| identity-document samples used to test document-layout and OCR behavior, and |
| real-world domain samples. Direct identifiers and sensitive fields were |
| annotated and redacted from model-use fields. Source materials and derived |
| artifacts remain private and are not distributed. |
|
|
| Supported span labels are listed in [Supported Label Spans](#supported-label-spans). |
|
|
| The committed public examples are **synthetic**. The private v5 mix is broader |
| and includes reviewed non-synthetic source material after direct identifiers |
| and sensitive fields were redacted from model-use fields. It is not a public |
| corpus of real user records. |
|
|
| ## Bias, Risks, And Limitations |
|
|
| This adapter is built on `openai/privacy-filter` and inherits the upstream |
| model's bias, risk, and limitation profile. The notes below summarize and |
| extend that profile for the Naija research preview. Consult the upstream |
| model card for the authoritative description of base-model behavior. |
|
|
| ### Over-Reliance |
|
|
| This adapter, like the base model, is a redaction and data-minimization aid. |
| It is not an anonymization, compliance, or safety guarantee. Treating its |
| output as a blanket anonymization claim risks missing the privacy objectives |
| the system is being deployed to support. Use it as one layer in a |
| privacy-by-design pipeline alongside policy controls, access controls, |
| logging discipline, and human review where mistakes have material impact. |
| The model detects spans; it does not by itself enforce retention, access |
| control, consent, or data-subject rights. |
|
|
| ### Static Label Policy |
|
|
| The model only detects spans that match its trained label taxonomy. |
| Real-world privacy policies vary, and label boundaries appropriate for one |
| organization may not be appropriate for another. Adjusting the policy |
| requires further finetuning, not runtime configuration. The Naija adapter |
| shifts boundaries for Nigerian-domain identifiers (NIN, BVN, NUBAN account |
| numbers, voter card, driver license, passport, addresses, phone formats), |
| but does not introduce a runtime policy configuration mechanism. |
|
|
| ### Domain And Language Coverage |
|
|
| Performance can drop on: |
|
|
| - non-English text and non-Latin scripts; |
| - naming patterns or identifier formats not represented in training data; |
| - domains outside the evaluated private mix, including unseen OCR layouts, |
| noisy chat logs, code-switched multilingual text, and organization-specific |
| record formats. |
|
|
| The private evaluation mix cannot fully represent every deployment |
| distribution. High typed-span F1 on the included splits should not be read as |
| evidence of production readiness on all real records. |
|
|
| ### Failure Modes |
|
|
| Like all models, this adapter can make mistakes. Common failure modes |
| include: |
|
|
| - under-detection of uncommon personal names, regional naming conventions, |
| initials, or honorific-heavy references; |
| - over-redaction of organizations, locations, or common nouns when local |
| context is ambiguous; |
| - fragmented or shifted span boundaries in mixed-format text, long |
| documents, or text with heavy punctuation and layout artifacts; |
| - structured-identifier ambiguity - a numeric string may be a NIN, BVN, |
| account number, invoice number, order ID, or unrelated code, and the |
| model cannot always disambiguate without surrounding context; |
| - missed secrets for novel credential formats, project-specific token |
| patterns, or secrets split across surrounding syntax; |
| - over-redaction of benign high-entropy strings, hashes, placeholders, sample |
| credentials, synthetic examples, dates, checksums, and routing IDs that |
| resemble real secrets or identity numbers. |
|
|
| Deterministic span postprocessing (`span_postprocess.py` in the source repo) |
| reduces some boundary and known-format failures, but it is tuned for the |
| synthetic Naija release. Applied outside that distribution it may itself |
| introduce false positives. |
|
|
| These failure modes can interact with demographic, regional, and domain |
| variation. Names and identifiers underrepresented in the training data, or |
| that follow conventions different from the dominant training distribution, |
| are more likely to be missed or inconsistently bounded. |
|
|
| ### High-Risk Deployment Caution |
|
|
| Additional caution is warranted in medical, legal, financial, human |
| resources, education, and government workflows. In these settings, both |
| false negatives and false positives can be costly: missed spans may expose |
| sensitive information, while excess masking can remove material context |
| needed for review, auditing, or downstream decisions. Do not use this adapter |
| as the only control in such workflows. |
|
|
| ### Recommendations |
|
|
| - Use the model as part of a privacy-by-design pipeline, not as a standalone |
| anonymization claim. |
| - Evaluate on representative in-domain data under local policy before |
| production use. |
| - Add deterministic filters for high-precision structured IDs where local |
| policy allows it. |
| - Finetune further when your policy boundaries differ from the trained |
| taxonomy or when hard-negative precision matters. |
| - Preserve human review paths for high-sensitivity workflows. |
|
|
| ## Privacy And Safety |
|
|
| The public example bundle is synthetic and should not intentionally contain |
| real personal data. Before publishing any new dataset, predictions, logs, or |
| eval artifacts derived from this adapter, inspect them for accidental real PII |
| or secrets. |
|
|
| Do not publish: |
|
|
| - raw real records; |
| - production prompts, logs, tickets, emails, or support transcripts |
| containing personal data; |
| - API keys, access tokens, cookies, or credentials; |
| - model outputs that include unredacted real PII from private systems; |
| - raw private/internal prediction JSONL or configs containing absolute paths, |
| private dataset IDs, or temporary directory names. |
|
|
| ## Citation And Attribution |
|
|
| If you use this adapter, please cite both the base model and this repository. |
| Preserve the adapter repo ID, dataset version, code commit, and evaluation |
| artifact commit in experiment reports so results are reproducible. |
|
|
| ```bibtex |
| @misc{egonu2026privacyfilternigeria, |
| author = {Egonu Narcisse}, |
| title = {Privacy Filter Nigeria LoRA (v0.1 research preview)}, |
| year = {2026}, |
| url = {https://github.com/iamNarcisse/naija-privacy-filter}, |
| note = {Adapter on top of openai/privacy-filter} |
| } |
| ``` |
|
|
| ## License |
|
|
| This adapter is released under the Apache License, Version 2.0. The base |
| model `openai/privacy-filter` is governed by its own license; consult the |
| upstream model card for terms. |
|
|