File size: 14,359 Bytes
dce66cd 2549068 dce66cd 2549068 dce66cd 2549068 dce66cd 2549068 dce66cd 2549068 ec8f14c dce66cd 2549068 dce66cd 2549068 dce66cd 2549068 dce66cd 2549068 dce66cd ac73356 2549068 dce66cd 2549068 dce66cd 2549068 dce66cd 2549068 dce66cd 09d6e24 dce66cd ac73356 2549068 ac73356 2549068 dce66cd 2549068 ac73356 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 | ---
license: apache-2.0
base_model: openai/privacy-filter
library_name: peft
pipeline_tag: token-classification
language:
- en
tags:
- privacy
- pii
- token-classification
- lora
- peft
- nigeria
- privacy-filter
model-index:
- name: privacy-filter-nigeria
results:
- task:
type: token-classification
name: PII Span Detection
dataset:
name: stage2_v5 private mixed dataset (validation)
type: custom
metrics:
- type: f1
name: Typed Span F1
value: 0.9763
- type: precision
name: Typed Span Precision
value: 0.9707
- type: recall
name: Typed Span Recall
value: 0.9820
- task:
type: token-classification
name: PII Span Detection
dataset:
name: stage2_v5 private mixed dataset (test)
type: custom
metrics:
- type: f1
name: Typed Span F1
value: 0.9640
- type: precision
name: Typed Span Precision
value: 0.9593
- type: recall
name: Typed Span Recall
value: 0.9688
- task:
type: token-classification
name: Hard-Negative False Positive Audit
dataset:
name: stage2_v5 hard-negative challenge
type: custom
metrics:
- type: false_positive_rate
name: False-positive example rate
value: 0.72
---
# Privacy Filter Nigeria LoRA
A LoRA adapter on top of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
for Nigerian-domain PII span detection.
- **Adapter repo:** `iamSamurai/privacy-filter-nigeria`
- **Base model:** `openai/privacy-filter`
- **Adapter type:** LoRA (PEFT) for token classification
- **Source repo:** https://github.com/iamNarcisse/naija-privacy-filter
- **Eval artifacts:** [`iamSamurai/openai-privacy-filter-naija-eval-artifacts`](https://huggingface.co/iamSamurai/openai-privacy-filter-naija-eval-artifacts)
- **License:** Apache-2.0 (this adapter). Use of the base model remains
subject to the upstream model's terms.
## Evaluation
Latest v5 eval against the internal stage2 v5 private mixed dataset, after
deterministic span postprocessing:
| Split | Typed span F1 | Precision | Recall | TP | FP | FN |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| Validation | 0.9763 | 0.9707 | 0.9820 | 762 | 23 | 14 |
| Test | 0.9640 | 0.9593 | 0.9688 | 777 | 33 | 25 |
The v5 challenge split is hard-negative-only. Typed F1 is not meaningful
because there are no gold positive spans. Use false-positive diagnostics:
| Challenge diagnostic | Value |
| --- | ---: |
| Examples | 250 |
| Examples with predictions | 180 |
| False-positive example rate | 0.72 |
| Predicted false-positive spans | 456 |
This is a recall-oriented research adapter. The hard-negative audit shows that
benign identifier-like text can be over-redacted. Precision-sensitive users
should add deterministic filters, tune thresholds where applicable, or
finetune on representative local negatives.
## Supported Label Spans
The adapter emits these span labels. `O` is the background token label and is
not returned as a detected span.
| Label | Detects | Example |
| --- | --- | --- |
| `account_number` | Nigerian bank account/NUBAN-style account numbers when context indicates an account | `6318826391` |
| `private_address` | Street, city, state, or postal address spans tied to a person or record | `42 Unity Road, Ikeja, Lagos 100271` |
| `private_bvn` | Nigerian Bank Verification Number references and values | `22334455667` |
| `private_date` | Dates tied to a person, record, document, or event in a private workflow | `12 April 1988` |
| `private_drivers_license_number` | Nigerian driver license identifiers | `K2BHY7F6FEA0` |
| `private_email` | Email addresses | `amina.yusuf@example.ng` |
| `private_nin` | Nigerian National Identification Number references and values | `12345678901` |
| `private_passport_number` | Nigerian passport identifiers | `B05995318` |
| `private_person` | Person names and name-like references | `Amina Yusuf` |
| `private_phone` | Nigerian local and international phone-number formats | `+234 802 111 3344` |
| `private_url` | URLs tied to private records, claims, documents, or workflows | `https://claims.example/record/1234` |
| `private_voters_card_number` | Nigerian voter card identifiers | `ABCD 1234 5678 9012 345` |
| `secret` | Known-format credentials, authorization codes, session tokens, and similar secrets | `S3cure!9037Ops` |
## How To Use
> **Note on the classifier head.** This adapter ships a resized token-
> classification head for the Nigerian-domain label taxonomy
> (`label_map.json` / `token_label_names` in `adapter_config.json`). Loading
> the adapter on top of the unmodified base model with vanilla
> `peft.PeftModel.from_pretrained` will not resize the head automatically.
> Use the project runner (`privacy_filter.py`) or replicate its head-resize
> logic to get correct predictions.
### Recommended: project runner
```bash
pip install "torch>=2.8" "transformers>=4.56" "peft>=0.17" "huggingface-hub>=0.34"
git clone https://github.com/iamNarcisse/naija-privacy-filter
cd naija-privacy-filter
python main.py \
--model-name openai/privacy-filter \
--adapter-name iamSamurai/privacy-filter-nigeria \
"Amina Yusuf can be reached at +234 802 111 3344."
```
Or via `uv`:
```bash
uv run python main.py \
--model-name openai/privacy-filter \
--adapter-name iamSamurai/privacy-filter-nigeria \
"Amina Yusuf can be reached at +234 802 111 3344."
```
### Example result
For the adapter command above, the cleaned output should contain:
| Field | Value |
| --- | --- |
| Status | `PII detected` |
| Detected spans | `2` |
| Mode | `cleaned` |
| Adapter | `iamSamurai/privacy-filter-nigeria` |
| Label | Text | Start | End |
| --- | --- | ---: | ---: |
| `private_person` | `Amina Yusuf` | 0 | 11 |
| `private_phone` | `+234 802 111 3344` | 30 | 47 |
Confidence scores are model outputs and are not privacy, security, or
compliance guarantees.
### REST API
```bash
PRIVACY_FILTER_MODEL_NAME=openai/privacy-filter \
PRIVACY_FILTER_ADAPTER_NAME=iamSamurai/privacy-filter-nigeria \
uv run uvicorn api:app --reload
curl -X POST http://127.0.0.1:8000/predict \
-H "Content-Type: application/json" \
-d '{"text":"Amina Yusuf can be reached at +234 802 111 3344.","mode":"cleaned"}'
```
### Direct `transformers + peft` (advanced)
If you want to bypass the project runner, you must resize the base model's
classification head to match `token_label_names` from the adapter's
`adapter_config.json` before applying the LoRA weights. See
[`privacy_filter.py`](https://github.com/iamNarcisse/naija-privacy-filter/blob/main/privacy_filter.py)
for the reference implementation.
## Intended Use
Use this adapter for:
- Research and evaluation of Nigerian-domain PII detection.
- Prototyping local inference or REST API integration for privacy-filtering
workflows.
- Studying LoRA adaptation and deterministic span postprocessing for
token-classification models.
- Producing candidate spans for downstream review, redaction, or policy
engines.
Do **not** use this adapter as the only control for regulatory, legal,
medical, financial, or irreversible privacy decisions.
## Training Data
The source repository includes a tiny public synthetic example bundle,
`data/examples`:
| Split | Examples |
| --- | ---: |
| Train | 5 |
| Validation | 5 |
| Test | 5 |
| Challenge | 5 |
This example bundle is for schema inspection and smoke tests only. It is not a
training or evaluation release.
The current v5 adapter was trained and evaluated against a later private
stage2 v5 mixed dataset that is not distributed with this model card. That
private mix includes synthetic examples, OCR-derived Nigerian
identity-document samples used to test document-layout and OCR behavior, and
real-world domain samples. Direct identifiers and sensitive fields were
annotated and redacted from model-use fields. Source materials and derived
artifacts remain private and are not distributed.
Supported span labels are listed in [Supported Label Spans](#supported-label-spans).
The committed public examples are **synthetic**. The private v5 mix is broader
and includes reviewed non-synthetic source material after direct identifiers
and sensitive fields were redacted from model-use fields. It is not a public
corpus of real user records.
## Bias, Risks, And Limitations
This adapter is built on `openai/privacy-filter` and inherits the upstream
model's bias, risk, and limitation profile. The notes below summarize and
extend that profile for the Naija research preview. Consult the upstream
model card for the authoritative description of base-model behavior.
### Over-Reliance
This adapter, like the base model, is a redaction and data-minimization aid.
It is not an anonymization, compliance, or safety guarantee. Treating its
output as a blanket anonymization claim risks missing the privacy objectives
the system is being deployed to support. Use it as one layer in a
privacy-by-design pipeline alongside policy controls, access controls,
logging discipline, and human review where mistakes have material impact.
The model detects spans; it does not by itself enforce retention, access
control, consent, or data-subject rights.
### Static Label Policy
The model only detects spans that match its trained label taxonomy.
Real-world privacy policies vary, and label boundaries appropriate for one
organization may not be appropriate for another. Adjusting the policy
requires further finetuning, not runtime configuration. The Naija adapter
shifts boundaries for Nigerian-domain identifiers (NIN, BVN, NUBAN account
numbers, voter card, driver license, passport, addresses, phone formats),
but does not introduce a runtime policy configuration mechanism.
### Domain And Language Coverage
Performance can drop on:
- non-English text and non-Latin scripts;
- naming patterns or identifier formats not represented in training data;
- domains outside the evaluated private mix, including unseen OCR layouts,
noisy chat logs, code-switched multilingual text, and organization-specific
record formats.
The private evaluation mix cannot fully represent every deployment
distribution. High typed-span F1 on the included splits should not be read as
evidence of production readiness on all real records.
### Failure Modes
Like all models, this adapter can make mistakes. Common failure modes
include:
- under-detection of uncommon personal names, regional naming conventions,
initials, or honorific-heavy references;
- over-redaction of organizations, locations, or common nouns when local
context is ambiguous;
- fragmented or shifted span boundaries in mixed-format text, long
documents, or text with heavy punctuation and layout artifacts;
- structured-identifier ambiguity - a numeric string may be a NIN, BVN,
account number, invoice number, order ID, or unrelated code, and the
model cannot always disambiguate without surrounding context;
- missed secrets for novel credential formats, project-specific token
patterns, or secrets split across surrounding syntax;
- over-redaction of benign high-entropy strings, hashes, placeholders, sample
credentials, synthetic examples, dates, checksums, and routing IDs that
resemble real secrets or identity numbers.
Deterministic span postprocessing (`span_postprocess.py` in the source repo)
reduces some boundary and known-format failures, but it is tuned for the
synthetic Naija release. Applied outside that distribution it may itself
introduce false positives.
These failure modes can interact with demographic, regional, and domain
variation. Names and identifiers underrepresented in the training data, or
that follow conventions different from the dominant training distribution,
are more likely to be missed or inconsistently bounded.
### High-Risk Deployment Caution
Additional caution is warranted in medical, legal, financial, human
resources, education, and government workflows. In these settings, both
false negatives and false positives can be costly: missed spans may expose
sensitive information, while excess masking can remove material context
needed for review, auditing, or downstream decisions. Do not use this adapter
as the only control in such workflows.
### Recommendations
- Use the model as part of a privacy-by-design pipeline, not as a standalone
anonymization claim.
- Evaluate on representative in-domain data under local policy before
production use.
- Add deterministic filters for high-precision structured IDs where local
policy allows it.
- Finetune further when your policy boundaries differ from the trained
taxonomy or when hard-negative precision matters.
- Preserve human review paths for high-sensitivity workflows.
## Privacy And Safety
The public example bundle is synthetic and should not intentionally contain
real personal data. Before publishing any new dataset, predictions, logs, or
eval artifacts derived from this adapter, inspect them for accidental real PII
or secrets.
Do not publish:
- raw real records;
- production prompts, logs, tickets, emails, or support transcripts
containing personal data;
- API keys, access tokens, cookies, or credentials;
- model outputs that include unredacted real PII from private systems;
- raw private/internal prediction JSONL or configs containing absolute paths,
private dataset IDs, or temporary directory names.
## Citation And Attribution
If you use this adapter, please cite both the base model and this repository.
Preserve the adapter repo ID, dataset version, code commit, and evaluation
artifact commit in experiment reports so results are reproducible.
```bibtex
@misc{egonu2026privacyfilternigeria,
author = {Egonu Narcisse},
title = {Privacy Filter Nigeria LoRA (v0.1 research preview)},
year = {2026},
url = {https://github.com/iamNarcisse/naija-privacy-filter},
note = {Adapter on top of openai/privacy-filter}
}
```
## License
This adapter is released under the Apache License, Version 2.0. The base
model `openai/privacy-filter` is governed by its own license; consult the
upstream model card for terms.
|