Symio-ai/legal-format-validator

Model Description

Legal Format Validator checks legal documents for formatting compliance with jurisdiction-specific rules. Given a document, it identifies formatting violations: incorrect margins, font, spacing, missing caption elements, missing verification blocks, missing certificate of service, improper signature blocks, and prohibited content.

Designed to work alongside the legal-format-guard.sh hook as an ML-powered format checker that catches nuanced formatting issues the regex-based hook might miss.

Intended Use

  • Primary: Validate document formatting before filing in GLACIER Stage 5
  • Secondary: Quality assurance for all legal document output
  • Integration: Supplements legal-format-guard.sh with ML-based validation

Task Type

text-classification -- Multi-label classification of formatting violations

Base Model

microsoft/deberta-v3-base -- Strong sequence classification with efficient inference

Training Data

Source Records Description
Accepted Court Filings ~200K Successfully filed documents (positive examples)
Rejected Filings ~50K Filings rejected by clerk for formatting issues
Format Rule Annotations ~100K Expert-labeled formatting violations
Synthetic Violations ~300K Programmatically generated format violations

Validation Checks (Labels)

  • CAPTION_MISSING -- No court caption
  • CAPTION_INCORRECT -- Caption has wrong court, parties, or case number
  • VERIFICATION_MISSING -- No verification block (required in FL/MS)
  • COS_MISSING -- No Certificate of Service
  • COS_INCOMPLETE -- COS missing required information
  • SIGNATURE_MISSING -- No signature block
  • SIGNATURE_IMPROPER -- Signature block format incorrect
  • AI_ATTRIBUTION -- Contains AI/Claude/Anthropic attribution (prohibited)
  • PUNITIVE_VIOLATION -- Demands punitive without leave (FL 768.72)
  • PAGE_LIMIT_EXCEEDED -- Exceeds applicable page limit
  • FONT_VIOLATION -- Wrong font or size
  • SPACING_VIOLATION -- Wrong line spacing
  • MARGIN_VIOLATION -- Wrong margins
  • EXHIBIT_MISMATCH -- Referenced exhibit not attached or wrong number

Benchmark Criteria (90%+ Target)

Metric Target Description
Violation Detection Recall >= 95% Must catch nearly all formatting violations
False Positive Rate <= 5% Must not flag compliant documents
AI_ATTRIBUTION Recall 100% Zero tolerance for AI attribution in filings
PUNITIVE_VIOLATION Recall 100% Zero tolerance for improper punitive demands
Latency < 1s Per-document validation time

GLACIER Pipeline Integration

STAGE 4 (First Draft) --> format-validator runs after document generation
STAGE 5 (WDC #2) --> format-validator is part of the full audit
STAGE 6 (Final Draft) --> format-validator confirms all fixes applied

Relationship to legal-format-guard.sh: The shell hook performs regex-based checks (fast, deterministic). This model handles semantic checks that regex cannot (e.g., "is this caption actually correct for this case?" or "does this verification block reference the right statute?").

Training Configuration

  • Epochs: 8
  • Learning rate: 2e-5
  • Batch size: 16
  • Max sequence length: 2048
  • Loss: Binary cross-entropy (multi-label)
  • Hardware: AWS SageMaker ml.g5.2xlarge

Limitations

  • Training data is heavily weighted toward FL and MS; other jurisdictions may have gaps
  • Cannot check actual font rendering (only textual indicators of font specification)
  • Margin and spacing validation works on markdown/text indicators, not PDF measurement
  • New local rules not in training data will not be caught

Version History

Version Date Notes
v0.1 2026-04-10 Initial model card, repo created
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Symio-ai/legal-format-validator

Finetuned
(587)
this model