Symio-ai/legal-research-ranker

Model Description

Legal Research Ranker is a cross-encoder reranking model that scores the relevance of legal documents to a research query. Given a (query, document) pair, it produces a relevance score from 0 to 1. Designed to replace generic Cohere Rerank with a legal-domain-specific ranker.

Critical for ensuring the GLACIER pipeline surfaces the most relevant authorities first.

Intended Use

Primary: Rerank legal research results from CourtListener, Midpage, and bedrock-legal
Secondary: Score case relevance for precedent matching and brief support
Integration: Replaces/supplements Cohere Rerank in GLACIER Stage 2

Task Type

text-classification -- Cross-encoder relevance scoring (regression, 0-1 output)

Base Model

cross-encoder/ms-marco-MiniLM-L-12-v2 -- Strong baseline for passage reranking, to be fine-tuned on legal query-document pairs

Training Data

Source	Records	Description
Legal Research Queries	~200K pairs	(query, relevant_document) from attorney research sessions
CourtListener Search Logs	~500K pairs	Implicit feedback from search click-through data
Expert Annotations	~50K pairs	Attorney-scored relevance (1-5 scale)
Negative Mining	~1M pairs	Hard negatives from same practice area but wrong jurisdiction/posture

Relevance Scale

0.0-0.2: Irrelevant (different topic, jurisdiction, or legal issue)
0.2-0.5: Tangentially relevant (same practice area but different issue)
0.5-0.7: Relevant (same issue, persuasive authority)
0.7-0.9: Highly relevant (same issue, same jurisdiction, similar facts)
0.9-1.0: On-point (controlling authority, nearly identical facts)

Benchmark Criteria (90%+ Target)

Metric	Target	Description
NDCG@10	>= 0.85	Normalized DCG for top-10 results
MRR	>= 0.90	Mean reciprocal rank of first relevant result
Precision@5	>= 0.80	Precision in top-5 results
Latency	< 50ms/pair	Per-pair scoring speed
Correlation	>= 0.88	Spearman correlation with expert ratings

GLACIER Pipeline Integration

STAGE 2 (Research) --> research-ranker reranks all retrieval results before presentation
STAGE 3 (WDC #1)  --> ranker scores supporting authorities for theory strength
STAGE 5 (WDC #2)  --> ranker validates that cited authorities are actually the strongest available

Training Configuration

Epochs: 3
Learning rate: 1e-5
Batch size: 64
Max sequence length: 512 (query: 64, document: 448)
Loss: MSE regression loss
Hardware: AWS SageMaker ml.g5.2xlarge

Limitations

Optimized for litigation research queries; transactional/regulatory queries may rank lower quality
Jurisdiction weighting is implicit; a separate jurisdiction filter should be applied first
Does not understand temporal relevance without explicit signals
Hard negatives from same practice area are the most challenging failure mode

Version History

Version	Date	Notes
v0.1	2026-04-10	Initial model card, repo created

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Symio-ai/legal-research-ranker

Base model

microsoft/MiniLM-L12-H384-uncased

Quantized

cross-encoder/ms-marco-MiniLM-L12-v2

Finetuned

(28)

this model