Symio-ai/legal-citation-verifier

Model Description

Legal Citation Verifier validates legal citations against known case law databases. Given a citation string (e.g., "Smith v. Jones, 123 So. 3d 456 (Fla. 5th DCA 2020)"), the model classifies it as VALID, INVALID, PARTIALLY_VALID, or FABRICATED with a confidence score.

This model is purpose-built for the GLACIER legal production pipeline to catch hallucinated citations before they reach court filings.

Intended Use

Primary: Pre-filing citation validation in the GLACIER Stage 5 WDC audit
Secondary: Real-time citation checking during legal research and drafting
Integration: Called by bedrock-legal and mcp-troops during WDC validation panels

Task Type

text-classification -- Multi-class classification of citation validity

Base Model

microsoft/deberta-v3-large -- Selected for its strong NLI performance which transfers well to citation verification (matching citation text against factual records)

Training Data

Source	Records	Description
CourtListener API	~5M opinions	Verified case citations with full metadata
Midpage Citation DB	~13M opinions	Citation treatment signals (positive/negative/distinguished)
Fabricated Citations	~500K synthetic	LLM-generated fake citations for negative training
Bar Discipline Cases	~50K	Cases where attorneys were sanctioned for fabricated citations

Data Pipeline

Extract all citations from CourtListener opinions via regex + NER
Cross-reference each citation against the full database
Generate negative examples using controlled LLM hallucination
Label: VALID (exact match), PARTIALLY_VALID (close match, wrong page/date), INVALID (no match found), FABRICATED (structurally valid but nonexistent)

Benchmark Criteria (90%+ Target)

Metric	Target	Description
Accuracy	>= 92%	Overall classification accuracy
FABRICATED Recall	>= 97%	Must catch nearly all fabricated citations
VALID Precision	>= 95%	Must not reject real citations
Latency	< 200ms	Per-citation inference time for real-time use
F1 (macro)	>= 90%	Balanced performance across all classes

Critical threshold: A FABRICATED citation passing as VALID is a zero-tolerance failure. The model must err on the side of flagging uncertain citations for human review.

GLACIER Pipeline Integration

STAGE 2 (Research) --> citation-verifier checks all retrieved citations
STAGE 3 (WDC #1)  --> citation-verifier validates theory citations
STAGE 5 (WDC #2)  --> citation-verifier audits every citation in final draft

Integration point: Called via bedrock-legal verify_citation or directly through SageMaker endpoint. Returns structured JSON:

{
  "citation": "Smith v. Jones, 123 So. 3d 456 (Fla. 5th DCA 2020)",
  "classification": "VALID",
  "confidence": 0.97,
  "matched_record": "courtlistener:op-12345",
  "treatment": "positive_cited"
}

Training Configuration

Epochs: 5
Learning rate: 2e-5 with linear warmup
Batch size: 32
Max sequence length: 256
Optimizer: AdamW with weight decay 0.01
Hardware: AWS SageMaker ml.g5.2xlarge

Limitations

Trained primarily on US federal and state court citations (FL, MS focus)
May underperform on administrative law, tribal court, or international citations
Citation format variations not in training data may cause false positives
Does not verify the substantive holding of the cited case, only its existence

Version History

Version	Date	Notes
v0.1	2026-04-10	Initial model card, repo created

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Symio-ai/legal-citation-verifier

Base model

microsoft/deberta-v3-large

Finetuned

(258)

this model