ko-pii-public-v1 (real-data-augmented release)

Korean PII (개인식별정보) detector across 23 categories spanning common / public-sector / finance / medical domains. Fine-tuned from openai/privacy-filter (Apache 2.0).

What changed from the synthetic-only release: training data now combines our synthetic Korean-domain corpus with the real-world dialogic KDPII (CC BY 4.0, 17 mappable categories). This pushes typed F1 on the KDPII held-out test split from 0.44 → 0.88 at the recommended score >= 0.9 operating point.

Quick start

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
import torch

model_id = "ehd0309/ko-pii-public-v1"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(
    model_id, dtype=torch.bfloat16,
).to("cuda").eval()
nlp = pipeline("token-classification", model=model, tokenizer=tok,
               aggregation_strategy="simple", device=0)

# Filter low-confidence predictions for production
preds = nlp("고객 김민수 (010-1234-5678) 계좌 110-998877-665544 로 송금")
preds = [p for p in preds if p["score"] >= 0.9]
print(preds)

Categories (23)

Group	Labels
Common (8)	`person_name`, `phone_number`, `email`, `address`, `date_of_birth`, `ip_address`, `url`, `credential_secret`
Public-sector (5)	`rrn` (주민등록번호), `foreigner_id`, `drivers_license`, `passport_number`, `vehicle_plate`
Finance (6)	`bank_account`, `card_number`, `card_cvc`, `card_expiry`, `business_reg_number`, `corporate_number`
Medical (4)	`patient_id` (MRN), `health_insurance_no`, `medical_license`, `prescription_id`

Military / defense categories present in the parent project are intentionally excluded from this public release.

Training data

The training set mixes two complementary sources:

Source	Records	Style	License
Our synthetic generator	5 736	Form / document style ("이름: 홍길동", 신청서, 카드 결제 영수증 등)	Apache 2.0 (this repo)
KDPII train (mapped to our 23 schema)	11 919	Real conversational Korean dialogues with PII labels	CC BY 4.0 (Zenodo: 10.5281/zenodo.10968609)
Combined train total	17 655
Validation (split mix)	2 934
Test = KDPII test split (held out)	4 891

KDPII categories outside our schema (OG_*, OGG_*, CV_*, LCP_COUNTRY, QT_AGE/LENGTH/WEIGHT/GRADE, TM_BLOOD_TYPE, PS_ID) are dropped during mapping; their text is retained as natural-distribution context.

Training details

Base model: openai/privacy-filter (1.5B params, MoE, BF16)
Trainer: HuggingFace Trainer, BIO labels (47 = 1 + 23 × 2)
BIO label-overlap fix applied (token whose char range overlaps a gold span is tagged correctly even when the token starts in leading whitespace — fixes the dropped-leading-character bug seen with naive char-to-token alignment on tiktoken)
Hyperparameters: 3 epochs, batch 16, AdamW (lr 2e-4, wd 0.01, warmup 5%, cosine schedule), seed 42
Hardware: 1× NVIDIA GB10 (DGX Spark), bfloat16
Training time: ~2 h 50 m

Evaluation

1. Held-out test set — KDPII test split (CC BY 4.0)

Standard seqeval BIO scoring, all 23 our-schema labels:

precision  recall    F1    accuracy
   0.673    0.835   0.745   0.988

2. Span-level evaluation (overlap ≥ 0.5 after whitespace trim)

2a. Untyped span detection

Slice	Recall
All KDPII spans (incl. 16 out-of-scope KDPII labels)	0.417
Mappable categories only (apples-to-apples)	0.899 (871 / 969)

2b. Per-label recall on mappable subset

KDPII label → ours	Recall	Notes
`QT_PHONE` → `phone_number`	1.000
`QT_MOBILE` → `phone_number`	1.000
`QT_CARD_NUMBER` → `card_number`	1.000
`TMI_EMAIL` → `email`	1.000
`DT_BIRTH` → `date_of_birth`	1.000
`QT_ACCOUNT_NUMBER` → `bank_account`	1.000
`QT_DRIVER_NUMBER` → `drivers_license`	1.000
`QT_PASSPORT_NUMBER` → `passport_number`	1.000
`TMI_SITE` → `url`	1.000
`QT_PLATE_NUMBER` → `vehicle_plate`	1.000
`QT_IP` → `ip_address`	1.000
`QT_RESIDENT_NUMBER` → `rrn`	0.944
`LC_ADDRESS` → `address`	0.832	KDPII includes country/region names ("캐나다", "서울")
`PS_NAME` → `person_name`	0.734	conversational Korean names
`PS_NICKNAME` → `person_name`	0.660	Korean nicknames (mapped to person_name)

2c. Typed F1 + threshold sweep — choose your operating point

Confidence threshold trades recall for precision and dramatically reduces false positives on plain conversational Korean text. score >= 0.9 is recommended for production.

Threshold	Precision	Recall	F1	FP rate on PII-free conversational text
0.0 (default)	0.718	0.892	0.795	4.6%
0.7	0.824	0.877	0.850	2.4%
0.9 (recommended)	0.941	0.821	0.877	0.6%
0.95	0.966	0.766	0.854	0.3%

(FP rate = fraction of PII-free KDPII records that produce ≥1 prediction.)

3. Comparison vs. previous synthetic-only release

Metric	Synthetic-only v1	This release (real-augmented)
Untyped recall on mappable categories	0.573	0.899
Typed F1 @ thr=0	0.351	0.795
Typed F1 @ thr=0.9	0.441	0.877
FP rate on PII-free text @ thr=0.9	4.0%	0.6%
`LC_ADDRESS` recall	0.089	0.832
`PS_NAME` recall	0.345	0.734
`PS_NICKNAME` recall	0.124	0.660

Limitations

Format-deterministic categories saturate at 1.000 on KDPII conversational text (email, URL, IP, phone, card, account, RRN, license, passport, plate). In adversarial real-world inputs (obfuscated PII, OCR noise) recall will be lower; pair with regex fallback for compliance-grade redaction.
Address is now broad. With KDPII training, address matches both full Korean postal addresses and short geographic terms ("캐나다", "서울 강남구"). If you need strict postal-only addresses, post-filter by length / regex.
person_name covers nicknames implicitly because KDPII labels them; acceptable for most privacy use cases but means a chat handle like "토깽이" will be flagged as a person_name.
No regex fallback included — for absolute recall on RRN / card / phone, combine the model with a regex OR.
Single training seed. No multi-seed variance reported.
No real-world labelled production data has been used; KDPII is the closest natural-distribution evaluation we have for Korean.

Risks

False sense of safety. Treat output as a first pass, not a guarantee. Combine with regex / human review / audit logs for sensitive use.
Address breadth. Country/region names are flagged as address; in legal compliance contexts you may need a tighter definition.

License

Apache 2.0. Inherits from base openai/privacy-filter. Training-data attribution: KDPII used under CC BY 4.0 (see Citation).

Citation

@misc{ko-pii-public-v1,
  title  = {ko-pii-public-v1: Korean PII Detection (real-data-augmented)},
  author = {ehd0309},
  year   = {2026},
  note   = {Fine-tuned from openai/privacy-filter on synthetic Korean data
            spanning 23 categories across common / public / finance / medical,
            augmented with KDPII (real conversational Korean PII).
            Evaluated on KDPII test split (held out).},
  url    = {https://huggingface.co/ehd0309/ko-pii-public-v1},
}

KDPII benchmark used in training and evaluation:

@misc{kdpii2024,
  title  = {KDPII: A New Korean Dialogic Dataset for the Deidentification
            of Personally Identifiable Information},
  year   = {2024},
  doi    = {10.5281/zenodo.10968609},
  note   = {CC BY 4.0},
}

Downloads last month: -

Safetensors

Model size

1B params

Tensor type

F32

BF16

Model tree for ehd0309/ko-pii-public-v1

Base model

openai/privacy-filter

Finetuned

(30)

this model