Symio-ai/legal-format-validator
Model Description
Legal Format Validator checks legal documents for formatting compliance with jurisdiction-specific rules. Given a document, it identifies formatting violations: incorrect margins, font, spacing, missing caption elements, missing verification blocks, missing certificate of service, improper signature blocks, and prohibited content.
Designed to work alongside the legal-format-guard.sh hook as an ML-powered format checker that catches nuanced formatting issues the regex-based hook might miss.
Intended Use
- Primary: Validate document formatting before filing in GLACIER Stage 5
- Secondary: Quality assurance for all legal document output
- Integration: Supplements
legal-format-guard.shwith ML-based validation
Task Type
text-classification -- Multi-label classification of formatting violations
Base Model
microsoft/deberta-v3-base -- Strong sequence classification with efficient inference
Training Data
| Source | Records | Description |
|---|---|---|
| Accepted Court Filings | ~200K | Successfully filed documents (positive examples) |
| Rejected Filings | ~50K | Filings rejected by clerk for formatting issues |
| Format Rule Annotations | ~100K | Expert-labeled formatting violations |
| Synthetic Violations | ~300K | Programmatically generated format violations |
Validation Checks (Labels)
CAPTION_MISSING-- No court captionCAPTION_INCORRECT-- Caption has wrong court, parties, or case numberVERIFICATION_MISSING-- No verification block (required in FL/MS)COS_MISSING-- No Certificate of ServiceCOS_INCOMPLETE-- COS missing required informationSIGNATURE_MISSING-- No signature blockSIGNATURE_IMPROPER-- Signature block format incorrectAI_ATTRIBUTION-- Contains AI/Claude/Anthropic attribution (prohibited)PUNITIVE_VIOLATION-- Demands punitive without leave (FL 768.72)PAGE_LIMIT_EXCEEDED-- Exceeds applicable page limitFONT_VIOLATION-- Wrong font or sizeSPACING_VIOLATION-- Wrong line spacingMARGIN_VIOLATION-- Wrong marginsEXHIBIT_MISMATCH-- Referenced exhibit not attached or wrong number
Benchmark Criteria (90%+ Target)
| Metric | Target | Description |
|---|---|---|
| Violation Detection Recall | >= 95% | Must catch nearly all formatting violations |
| False Positive Rate | <= 5% | Must not flag compliant documents |
| AI_ATTRIBUTION Recall | 100% | Zero tolerance for AI attribution in filings |
| PUNITIVE_VIOLATION Recall | 100% | Zero tolerance for improper punitive demands |
| Latency | < 1s | Per-document validation time |
GLACIER Pipeline Integration
STAGE 4 (First Draft) --> format-validator runs after document generation
STAGE 5 (WDC #2) --> format-validator is part of the full audit
STAGE 6 (Final Draft) --> format-validator confirms all fixes applied
Relationship to legal-format-guard.sh: The shell hook performs regex-based checks (fast, deterministic). This model handles semantic checks that regex cannot (e.g., "is this caption actually correct for this case?" or "does this verification block reference the right statute?").
Training Configuration
- Epochs: 8
- Learning rate: 2e-5
- Batch size: 16
- Max sequence length: 2048
- Loss: Binary cross-entropy (multi-label)
- Hardware: AWS SageMaker ml.g5.2xlarge
Limitations
- Training data is heavily weighted toward FL and MS; other jurisdictions may have gaps
- Cannot check actual font rendering (only textual indicators of font specification)
- Margin and spacing validation works on markdown/text indicators, not PDF measurement
- New local rules not in training data will not be caught
Version History
| Version | Date | Notes |
|---|---|---|
| v0.1 | 2026-04-10 | Initial model card, repo created |
Model tree for Symio-ai/legal-format-validator
Base model
microsoft/deberta-v3-base