Symio-ai/legal-format-validator

Model Description

Legal Format Validator checks legal documents for formatting compliance with jurisdiction-specific rules. Given a document, it identifies formatting violations: incorrect margins, font, spacing, missing caption elements, missing verification blocks, missing certificate of service, improper signature blocks, and prohibited content.

Designed to work alongside the legal-format-guard.sh hook as an ML-powered format checker that catches nuanced formatting issues the regex-based hook might miss.

Intended Use

Primary: Validate document formatting before filing in GLACIER Stage 5
Secondary: Quality assurance for all legal document output
Integration: Supplements legal-format-guard.sh with ML-based validation

Task Type

text-classification -- Multi-label classification of formatting violations

Base Model

microsoft/deberta-v3-base -- Strong sequence classification with efficient inference

Training Data

Source	Records	Description
Accepted Court Filings	~200K	Successfully filed documents (positive examples)
Rejected Filings	~50K	Filings rejected by clerk for formatting issues
Format Rule Annotations	~100K	Expert-labeled formatting violations
Synthetic Violations	~300K	Programmatically generated format violations

Validation Checks (Labels)

CAPTION_MISSING -- No court caption
CAPTION_INCORRECT -- Caption has wrong court, parties, or case number
VERIFICATION_MISSING -- No verification block (required in FL/MS)
COS_MISSING -- No Certificate of Service
COS_INCOMPLETE -- COS missing required information
SIGNATURE_MISSING -- No signature block
SIGNATURE_IMPROPER -- Signature block format incorrect
AI_ATTRIBUTION -- Contains AI/Claude/Anthropic attribution (prohibited)
PUNITIVE_VIOLATION -- Demands punitive without leave (FL 768.72)
PAGE_LIMIT_EXCEEDED -- Exceeds applicable page limit
FONT_VIOLATION -- Wrong font or size
SPACING_VIOLATION -- Wrong line spacing
MARGIN_VIOLATION -- Wrong margins
EXHIBIT_MISMATCH -- Referenced exhibit not attached or wrong number

Benchmark Criteria (90%+ Target)

Metric	Target	Description
Violation Detection Recall	>= 95%	Must catch nearly all formatting violations
False Positive Rate	<= 5%	Must not flag compliant documents
AI_ATTRIBUTION Recall	100%	Zero tolerance for AI attribution in filings
PUNITIVE_VIOLATION Recall	100%	Zero tolerance for improper punitive demands
Latency	< 1s	Per-document validation time

GLACIER Pipeline Integration

STAGE 4 (First Draft) --> format-validator runs after document generation
STAGE 5 (WDC #2) --> format-validator is part of the full audit
STAGE 6 (Final Draft) --> format-validator confirms all fixes applied

Relationship to legal-format-guard.sh: The shell hook performs regex-based checks (fast, deterministic). This model handles semantic checks that regex cannot (e.g., "is this caption actually correct for this case?" or "does this verification block reference the right statute?").

Training Configuration

Epochs: 8
Learning rate: 2e-5
Batch size: 16
Max sequence length: 2048
Loss: Binary cross-entropy (multi-label)
Hardware: AWS SageMaker ml.g5.2xlarge

Limitations

Training data is heavily weighted toward FL and MS; other jurisdictions may have gaps
Cannot check actual font rendering (only textual indicators of font specification)
Margin and spacing validation works on markdown/text indicators, not PDF measurement
New local rules not in training data will not be caught

Version History

Version	Date	Notes
v0.1	2026-04-10	Initial model card, repo created

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Symio-ai/legal-format-validator

Base model

microsoft/deberta-v3-base

Finetuned

(587)

this model