Symio-ai/legal-pii-redactor
Model Description
Legal PII Redactor identifies and redacts personally identifiable information (PII) in legal documents while preserving legally necessary information. It distinguishes between PII that must be redacted (SSNs, financial account numbers, minor names) and PII that is legally required in filings (party names, addresses for service).
Implements court-mandated redaction requirements (FRCP 5.2, FL Rule 2.425) while ensuring filings remain valid.
Intended Use
- Primary: Redact PII from legal documents before filing per court rules
- Secondary: Prepare public versions of sealed or confidential documents
- Integration: Post-processing step in GLACIER Stage 6 before filing
Task Type
token-classification -- Named entity recognition for PII categories with context-aware redaction decisions
Base Model
microsoft/deberta-v3-base -- Efficient inference for high-throughput document processing
Training Data
| Source | Records | Description |
|---|---|---|
| Redacted Court Filings | ~200K | Filings with clerk-applied redactions (before/after pairs) |
| PII-Annotated Legal Docs | ~100K | Expert-annotated documents with PII labels |
| Court Redaction Orders | ~20K | Judicial orders specifying what to redact |
| FRCP 5.2 / Rule 2.425 Case Law | ~5K opinions | Rulings on redaction requirements |
| Synthetic PII Documents | ~500K | Generated documents with known PII for training |
PII Categories and Redaction Rules
SSN-- Social Security Number --> ALWAYS redact (show last 4 only)TAX_ID-- Taxpayer ID --> ALWAYS redactFINANCIAL_ACCOUNT-- Bank/credit account numbers --> ALWAYS redact (last 4 only)MINOR_NAME-- Name of minor child --> ALWAYS redact (use initials)DOB_MINOR-- Date of birth of minor --> ALWAYS redactHOME_ADDRESS-- Home address --> Redact unless needed for servicePHONE-- Phone number --> Redact unless in business contextEMAIL-- Email address --> Preserve if needed for certificate of serviceMEDICAL-- Medical information --> Redact in public filingsPARTY_NAME-- Named party --> PRESERVE (required in caption)ATTORNEY_INFO-- Attorney contact --> PRESERVE (required in filing)CASE_NUMBER-- Case number --> PRESERVECOURT_INFO-- Court identification --> PRESERVE
Redaction Format
Original: "John Smith, SSN 123-45-6789, residing at 123 Main St"
Redacted: "John Smith, SSN XXX-XX-6789, residing at [ADDRESS REDACTED]"
Benchmark Criteria (90%+ Target)
| Metric | Target | Description |
|---|---|---|
| PII Detection Recall | >= 98% | Must catch nearly all PII |
| SSN/Financial Recall | 100% | Zero tolerance for missed financial PII |
| Minor Name Recall | 100% | Zero tolerance for exposed minor information |
| False Redaction Rate | <= 2% | Must not redact legally required information |
| Court Rule Compliance | >= 95% | Redaction matches applicable court rule |
| Throughput | >= 50 pages/sec | Fast enough for bulk document processing |
GLACIER Pipeline Integration
STAGE 6 (Final Draft) --> pii-redactor processes document before filing
Input: final document text
Output: redacted version + redaction log
Redaction log shows: what was redacted, which rule required it, original value (encrypted)
Court Rule Mapping:
- Federal: FRCP 5.2 (SSN, TIN, DOB of minors, financial accounts, minor names)
- Florida: Rule 2.425 (broader than federal -- includes home addresses)
- Mississippi: MRCP (follows federal standards)
Training Configuration
- Epochs: 10
- Learning rate: 3e-5
- Batch size: 32
- Max sequence length: 512
- Hardware: AWS SageMaker ml.g5.2xlarge
Limitations
- Context-dependent redaction decisions (e.g., when is an address needed for service?) require case-specific context
- Handwritten or poorly OCR'd documents may have lower PII detection rates
- Novel PII types (cryptocurrency addresses, biometric data) are less represented
- Does not handle image-based PII redaction (photos, scanned signatures)
- Sealed document handling requires additional judicial order analysis
Version History
| Version | Date | Notes |
|---|---|---|
| v0.1 | 2026-04-10 | Initial model card, repo created |
Model tree for Symio-ai/legal-pii-redactor
Base model
microsoft/deberta-v3-base