ko-pii-public-v1 (real-data-augmented release)

Korean PII (κ°œμΈμ‹λ³„μ •λ³΄) detector across 23 categories spanning common / public-sector / finance / medical domains. Fine-tuned from openai/privacy-filter (Apache 2.0).

What changed from the synthetic-only release: training data now combines our synthetic Korean-domain corpus with the real-world dialogic KDPII (CC BY 4.0, 17 mappable categories). This pushes typed F1 on the KDPII held-out test split from 0.44 β†’ 0.88 at the recommended score >= 0.9 operating point.

Quick start

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
import torch

model_id = "ehd0309/ko-pii-public-v1"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(
    model_id, dtype=torch.bfloat16,
).to("cuda").eval()
nlp = pipeline("token-classification", model=model, tokenizer=tok,
               aggregation_strategy="simple", device=0)

# Filter low-confidence predictions for production
preds = nlp("고객 κΉ€λ―Όμˆ˜ (010-1234-5678) κ³„μ’Œ 110-998877-665544 둜 μ†‘κΈˆ")
preds = [p for p in preds if p["score"] >= 0.9]
print(preds)

Categories (23)

Group Labels
Common (8) person_name, phone_number, email, address, date_of_birth, ip_address, url, credential_secret
Public-sector (5) rrn (μ£Όλ―Όλ“±λ‘λ²ˆν˜Έ), foreigner_id, drivers_license, passport_number, vehicle_plate
Finance (6) bank_account, card_number, card_cvc, card_expiry, business_reg_number, corporate_number
Medical (4) patient_id (MRN), health_insurance_no, medical_license, prescription_id

Military / defense categories present in the parent project are intentionally excluded from this public release.

Training data

The training set mixes two complementary sources:

Source Records Style License
Our synthetic generator 5 736 Form / document style ("이름: 홍길동", μ‹ μ²­μ„œ, μΉ΄λ“œ 결제 영수증 λ“±) Apache 2.0 (this repo)
KDPII train (mapped to our 23 schema) 11 919 Real conversational Korean dialogues with PII labels CC BY 4.0 (Zenodo: 10.5281/zenodo.10968609)
Combined train total 17 655
Validation (split mix) 2 934
Test = KDPII test split (held out) 4 891

KDPII categories outside our schema (OG_*, OGG_*, CV_*, LCP_COUNTRY, QT_AGE/LENGTH/WEIGHT/GRADE, TM_BLOOD_TYPE, PS_ID) are dropped during mapping; their text is retained as natural-distribution context.

Training details

  • Base model: openai/privacy-filter (1.5B params, MoE, BF16)
  • Trainer: HuggingFace Trainer, BIO labels (47 = 1 + 23 Γ— 2)
  • BIO label-overlap fix applied (token whose char range overlaps a gold span is tagged correctly even when the token starts in leading whitespace β€” fixes the dropped-leading-character bug seen with naive char-to-token alignment on tiktoken)
  • Hyperparameters: 3 epochs, batch 16, AdamW (lr 2e-4, wd 0.01, warmup 5%, cosine schedule), seed 42
  • Hardware: 1Γ— NVIDIA GB10 (DGX Spark), bfloat16
  • Training time: ~2 h 50 m

Evaluation

1. Held-out test set β€” KDPII test split (CC BY 4.0)

Standard seqeval BIO scoring, all 23 our-schema labels:

precision  recall    F1    accuracy
   0.673    0.835   0.745   0.988

2. Span-level evaluation (overlap β‰₯ 0.5 after whitespace trim)

2a. Untyped span detection

Slice Recall
All KDPII spans (incl. 16 out-of-scope KDPII labels) 0.417
Mappable categories only (apples-to-apples) 0.899 (871 / 969)

2b. Per-label recall on mappable subset

KDPII label β†’ ours Recall Notes
QT_PHONE β†’ phone_number 1.000
QT_MOBILE β†’ phone_number 1.000
QT_CARD_NUMBER β†’ card_number 1.000
TMI_EMAIL β†’ email 1.000
DT_BIRTH β†’ date_of_birth 1.000
QT_ACCOUNT_NUMBER β†’ bank_account 1.000
QT_DRIVER_NUMBER β†’ drivers_license 1.000
QT_PASSPORT_NUMBER β†’ passport_number 1.000
TMI_SITE β†’ url 1.000
QT_PLATE_NUMBER β†’ vehicle_plate 1.000
QT_IP β†’ ip_address 1.000
QT_RESIDENT_NUMBER β†’ rrn 0.944
LC_ADDRESS β†’ address 0.832 KDPII includes country/region names ("μΊλ‚˜λ‹€", "μ„œμšΈ")
PS_NAME β†’ person_name 0.734 conversational Korean names
PS_NICKNAME β†’ person_name 0.660 Korean nicknames (mapped to person_name)

2c. Typed F1 + threshold sweep β€” choose your operating point

Confidence threshold trades recall for precision and dramatically reduces false positives on plain conversational Korean text. score >= 0.9 is recommended for production.

Threshold Precision Recall F1 FP rate on PII-free conversational text
0.0 (default) 0.718 0.892 0.795 4.6%
0.7 0.824 0.877 0.850 2.4%
0.9 (recommended) 0.941 0.821 0.877 0.6%
0.95 0.966 0.766 0.854 0.3%

(FP rate = fraction of PII-free KDPII records that produce β‰₯1 prediction.)

3. Comparison vs. previous synthetic-only release

Metric Synthetic-only v1 This release (real-augmented)
Untyped recall on mappable categories 0.573 0.899
Typed F1 @ thr=0 0.351 0.795
Typed F1 @ thr=0.9 0.441 0.877
FP rate on PII-free text @ thr=0.9 4.0% 0.6%
LC_ADDRESS recall 0.089 0.832
PS_NAME recall 0.345 0.734
PS_NICKNAME recall 0.124 0.660

Limitations

  1. Format-deterministic categories saturate at 1.000 on KDPII conversational text (email, URL, IP, phone, card, account, RRN, license, passport, plate). In adversarial real-world inputs (obfuscated PII, OCR noise) recall will be lower; pair with regex fallback for compliance-grade redaction.
  2. Address is now broad. With KDPII training, address matches both full Korean postal addresses and short geographic terms ("μΊλ‚˜λ‹€", "μ„œμšΈ 강남ꡬ"). If you need strict postal-only addresses, post-filter by length / regex.
  3. person_name covers nicknames implicitly because KDPII labels them; acceptable for most privacy use cases but means a chat handle like "토깽이" will be flagged as a person_name.
  4. No regex fallback included β€” for absolute recall on RRN / card / phone, combine the model with a regex OR.
  5. Single training seed. No multi-seed variance reported.
  6. No real-world labelled production data has been used; KDPII is the closest natural-distribution evaluation we have for Korean.

Risks

  • False sense of safety. Treat output as a first pass, not a guarantee. Combine with regex / human review / audit logs for sensitive use.
  • Address breadth. Country/region names are flagged as address; in legal compliance contexts you may need a tighter definition.

License

Apache 2.0. Inherits from base openai/privacy-filter. Training-data attribution: KDPII used under CC BY 4.0 (see Citation).

Citation

@misc{ko-pii-public-v1,
  title  = {ko-pii-public-v1: Korean PII Detection (real-data-augmented)},
  author = {ehd0309},
  year   = {2026},
  note   = {Fine-tuned from openai/privacy-filter on synthetic Korean data
            spanning 23 categories across common / public / finance / medical,
            augmented with KDPII (real conversational Korean PII).
            Evaluated on KDPII test split (held out).},
  url    = {https://huggingface.co/ehd0309/ko-pii-public-v1},
}

KDPII benchmark used in training and evaluation:

@misc{kdpii2024,
  title  = {KDPII: A New Korean Dialogic Dataset for the Deidentification
            of Personally Identifiable Information},
  year   = {2024},
  doi    = {10.5281/zenodo.10968609},
  note   = {CC BY 4.0},
}
Downloads last month
-
Safetensors
Model size
1B params
Tensor type
F32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ehd0309/ko-pii-public-v1

Finetuned
(30)
this model