ko-pii-public-v1 (real-data-augmented release)
Korean PII (κ°μΈμλ³μ 보) detector across 23 categories spanning common /
public-sector / finance / medical domains. Fine-tuned from
openai/privacy-filter
(Apache 2.0).
What changed from the synthetic-only release: training data now combines our synthetic Korean-domain corpus with the real-world dialogic KDPII (CC BY 4.0, 17 mappable categories). This pushes typed F1 on the KDPII held-out test split from 0.44 β 0.88 at the recommended
score >= 0.9operating point.
Quick start
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
import torch
model_id = "ehd0309/ko-pii-public-v1"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(
model_id, dtype=torch.bfloat16,
).to("cuda").eval()
nlp = pipeline("token-classification", model=model, tokenizer=tok,
aggregation_strategy="simple", device=0)
# Filter low-confidence predictions for production
preds = nlp("κ³ κ° κΉλ―Όμ (010-1234-5678) κ³μ’ 110-998877-665544 λ‘ μ‘κΈ")
preds = [p for p in preds if p["score"] >= 0.9]
print(preds)
Categories (23)
| Group | Labels |
|---|---|
| Common (8) | person_name, phone_number, email, address, date_of_birth, ip_address, url, credential_secret |
| Public-sector (5) | rrn (μ£Όλ―Όλ±λ‘λ²νΈ), foreigner_id, drivers_license, passport_number, vehicle_plate |
| Finance (6) | bank_account, card_number, card_cvc, card_expiry, business_reg_number, corporate_number |
| Medical (4) | patient_id (MRN), health_insurance_no, medical_license, prescription_id |
Military / defense categories present in the parent project are intentionally excluded from this public release.
Training data
The training set mixes two complementary sources:
| Source | Records | Style | License |
|---|---|---|---|
| Our synthetic generator | 5 736 | Form / document style ("μ΄λ¦: νκΈΈλ", μ μ²μ, μΉ΄λ κ²°μ μμμ¦ λ±) | Apache 2.0 (this repo) |
| KDPII train (mapped to our 23 schema) | 11 919 | Real conversational Korean dialogues with PII labels | CC BY 4.0 (Zenodo: 10.5281/zenodo.10968609) |
| Combined train total | 17 655 | ||
| Validation (split mix) | 2 934 | ||
| Test = KDPII test split (held out) | 4 891 |
KDPII categories outside our schema (OG_*, OGG_*, CV_*, LCP_COUNTRY,
QT_AGE/LENGTH/WEIGHT/GRADE, TM_BLOOD_TYPE, PS_ID) are dropped during
mapping; their text is retained as natural-distribution context.
Training details
- Base model:
openai/privacy-filter(1.5B params, MoE, BF16) - Trainer: HuggingFace
Trainer, BIO labels (47 = 1 + 23 Γ 2) - BIO label-overlap fix applied (token whose char range overlaps a gold span is tagged correctly even when the token starts in leading whitespace β fixes the dropped-leading-character bug seen with naive char-to-token alignment on tiktoken)
- Hyperparameters: 3 epochs, batch 16, AdamW (lr 2e-4, wd 0.01, warmup 5%, cosine schedule), seed 42
- Hardware: 1Γ NVIDIA GB10 (DGX Spark), bfloat16
- Training time: ~2 h 50 m
Evaluation
1. Held-out test set β KDPII test split (CC BY 4.0)
Standard seqeval BIO scoring, all 23 our-schema labels:
precision recall F1 accuracy
0.673 0.835 0.745 0.988
2. Span-level evaluation (overlap β₯ 0.5 after whitespace trim)
2a. Untyped span detection
| Slice | Recall |
|---|---|
| All KDPII spans (incl. 16 out-of-scope KDPII labels) | 0.417 |
| Mappable categories only (apples-to-apples) | 0.899 (871 / 969) |
2b. Per-label recall on mappable subset
| KDPII label β ours | Recall | Notes |
|---|---|---|
QT_PHONE β phone_number |
1.000 | |
QT_MOBILE β phone_number |
1.000 | |
QT_CARD_NUMBER β card_number |
1.000 | |
TMI_EMAIL β email |
1.000 | |
DT_BIRTH β date_of_birth |
1.000 | |
QT_ACCOUNT_NUMBER β bank_account |
1.000 | |
QT_DRIVER_NUMBER β drivers_license |
1.000 | |
QT_PASSPORT_NUMBER β passport_number |
1.000 | |
TMI_SITE β url |
1.000 | |
QT_PLATE_NUMBER β vehicle_plate |
1.000 | |
QT_IP β ip_address |
1.000 | |
QT_RESIDENT_NUMBER β rrn |
0.944 | |
LC_ADDRESS β address |
0.832 | KDPII includes country/region names ("μΊλλ€", "μμΈ") |
PS_NAME β person_name |
0.734 | conversational Korean names |
PS_NICKNAME β person_name |
0.660 | Korean nicknames (mapped to person_name) |
2c. Typed F1 + threshold sweep β choose your operating point
Confidence threshold trades recall for precision and dramatically reduces
false positives on plain conversational Korean text. score >= 0.9 is
recommended for production.
| Threshold | Precision | Recall | F1 | FP rate on PII-free conversational text |
|---|---|---|---|---|
| 0.0 (default) | 0.718 | 0.892 | 0.795 | 4.6% |
| 0.7 | 0.824 | 0.877 | 0.850 | 2.4% |
| 0.9 (recommended) | 0.941 | 0.821 | 0.877 | 0.6% |
| 0.95 | 0.966 | 0.766 | 0.854 | 0.3% |
(FP rate = fraction of PII-free KDPII records that produce β₯1 prediction.)
3. Comparison vs. previous synthetic-only release
| Metric | Synthetic-only v1 | This release (real-augmented) |
|---|---|---|
| Untyped recall on mappable categories | 0.573 | 0.899 |
| Typed F1 @ thr=0 | 0.351 | 0.795 |
| Typed F1 @ thr=0.9 | 0.441 | 0.877 |
| FP rate on PII-free text @ thr=0.9 | 4.0% | 0.6% |
LC_ADDRESS recall |
0.089 | 0.832 |
PS_NAME recall |
0.345 | 0.734 |
PS_NICKNAME recall |
0.124 | 0.660 |
Limitations
- Format-deterministic categories saturate at 1.000 on KDPII conversational text (email, URL, IP, phone, card, account, RRN, license, passport, plate). In adversarial real-world inputs (obfuscated PII, OCR noise) recall will be lower; pair with regex fallback for compliance-grade redaction.
- Address is now broad. With KDPII training,
addressmatches both full Korean postal addresses and short geographic terms ("μΊλλ€", "μμΈ κ°λ¨κ΅¬"). If you need strict postal-only addresses, post-filter by length / regex. person_namecovers nicknames implicitly because KDPII labels them; acceptable for most privacy use cases but means a chat handle like "ν κΉ½μ΄" will be flagged as a person_name.- No regex fallback included β for absolute recall on RRN / card / phone, combine the model with a regex OR.
- Single training seed. No multi-seed variance reported.
- No real-world labelled production data has been used; KDPII is the closest natural-distribution evaluation we have for Korean.
Risks
- False sense of safety. Treat output as a first pass, not a guarantee. Combine with regex / human review / audit logs for sensitive use.
- Address breadth. Country/region names are flagged as
address; in legal compliance contexts you may need a tighter definition.
License
Apache 2.0. Inherits from base
openai/privacy-filter. Training-data attribution: KDPII used under CC BY 4.0
(see Citation).
Citation
@misc{ko-pii-public-v1,
title = {ko-pii-public-v1: Korean PII Detection (real-data-augmented)},
author = {ehd0309},
year = {2026},
note = {Fine-tuned from openai/privacy-filter on synthetic Korean data
spanning 23 categories across common / public / finance / medical,
augmented with KDPII (real conversational Korean PII).
Evaluated on KDPII test split (held out).},
url = {https://huggingface.co/ehd0309/ko-pii-public-v1},
}
KDPII benchmark used in training and evaluation:
@misc{kdpii2024,
title = {KDPII: A New Korean Dialogic Dataset for the Deidentification
of Personally Identifiable Information},
year = {2024},
doi = {10.5281/zenodo.10968609},
note = {CC BY 4.0},
}
- Downloads last month
- -
Model tree for ehd0309/ko-pii-public-v1
Base model
openai/privacy-filter