Symio-ai/legal-citation-verifier

Model Description

Legal Citation Verifier validates legal citations against known case law databases. Given a citation string (e.g., "Smith v. Jones, 123 So. 3d 456 (Fla. 5th DCA 2020)"), the model classifies it as VALID, INVALID, PARTIALLY_VALID, or FABRICATED with a confidence score.

This model is purpose-built for the GLACIER legal production pipeline to catch hallucinated citations before they reach court filings.

Intended Use

  • Primary: Pre-filing citation validation in the GLACIER Stage 5 WDC audit
  • Secondary: Real-time citation checking during legal research and drafting
  • Integration: Called by bedrock-legal and mcp-troops during WDC validation panels

Task Type

text-classification -- Multi-class classification of citation validity

Base Model

microsoft/deberta-v3-large -- Selected for its strong NLI performance which transfers well to citation verification (matching citation text against factual records)

Training Data

Source Records Description
CourtListener API ~5M opinions Verified case citations with full metadata
Midpage Citation DB ~13M opinions Citation treatment signals (positive/negative/distinguished)
Fabricated Citations ~500K synthetic LLM-generated fake citations for negative training
Bar Discipline Cases ~50K Cases where attorneys were sanctioned for fabricated citations

Data Pipeline

  1. Extract all citations from CourtListener opinions via regex + NER
  2. Cross-reference each citation against the full database
  3. Generate negative examples using controlled LLM hallucination
  4. Label: VALID (exact match), PARTIALLY_VALID (close match, wrong page/date), INVALID (no match found), FABRICATED (structurally valid but nonexistent)

Benchmark Criteria (90%+ Target)

Metric Target Description
Accuracy >= 92% Overall classification accuracy
FABRICATED Recall >= 97% Must catch nearly all fabricated citations
VALID Precision >= 95% Must not reject real citations
Latency < 200ms Per-citation inference time for real-time use
F1 (macro) >= 90% Balanced performance across all classes

Critical threshold: A FABRICATED citation passing as VALID is a zero-tolerance failure. The model must err on the side of flagging uncertain citations for human review.

GLACIER Pipeline Integration

STAGE 2 (Research) --> citation-verifier checks all retrieved citations
STAGE 3 (WDC #1)  --> citation-verifier validates theory citations
STAGE 5 (WDC #2)  --> citation-verifier audits every citation in final draft

Integration point: Called via bedrock-legal verify_citation or directly through SageMaker endpoint. Returns structured JSON:

{
  "citation": "Smith v. Jones, 123 So. 3d 456 (Fla. 5th DCA 2020)",
  "classification": "VALID",
  "confidence": 0.97,
  "matched_record": "courtlistener:op-12345",
  "treatment": "positive_cited"
}

Training Configuration

  • Epochs: 5
  • Learning rate: 2e-5 with linear warmup
  • Batch size: 32
  • Max sequence length: 256
  • Optimizer: AdamW with weight decay 0.01
  • Hardware: AWS SageMaker ml.g5.2xlarge

Limitations

  • Trained primarily on US federal and state court citations (FL, MS focus)
  • May underperform on administrative law, tribal court, or international citations
  • Citation format variations not in training data may cause false positives
  • Does not verify the substantive holding of the cited case, only its existence

Version History

Version Date Notes
v0.1 2026-04-10 Initial model card, repo created
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Symio-ai/legal-citation-verifier

Finetuned
(258)
this model