Symio-ai/legal-research-ranker

Model Description

Legal Research Ranker is a cross-encoder reranking model that scores the relevance of legal documents to a research query. Given a (query, document) pair, it produces a relevance score from 0 to 1. Designed to replace generic Cohere Rerank with a legal-domain-specific ranker.

Critical for ensuring the GLACIER pipeline surfaces the most relevant authorities first.

Intended Use

  • Primary: Rerank legal research results from CourtListener, Midpage, and bedrock-legal
  • Secondary: Score case relevance for precedent matching and brief support
  • Integration: Replaces/supplements Cohere Rerank in GLACIER Stage 2

Task Type

text-classification -- Cross-encoder relevance scoring (regression, 0-1 output)

Base Model

cross-encoder/ms-marco-MiniLM-L-12-v2 -- Strong baseline for passage reranking, to be fine-tuned on legal query-document pairs

Training Data

Source Records Description
Legal Research Queries ~200K pairs (query, relevant_document) from attorney research sessions
CourtListener Search Logs ~500K pairs Implicit feedback from search click-through data
Expert Annotations ~50K pairs Attorney-scored relevance (1-5 scale)
Negative Mining ~1M pairs Hard negatives from same practice area but wrong jurisdiction/posture

Relevance Scale

  • 0.0-0.2: Irrelevant (different topic, jurisdiction, or legal issue)
  • 0.2-0.5: Tangentially relevant (same practice area but different issue)
  • 0.5-0.7: Relevant (same issue, persuasive authority)
  • 0.7-0.9: Highly relevant (same issue, same jurisdiction, similar facts)
  • 0.9-1.0: On-point (controlling authority, nearly identical facts)

Benchmark Criteria (90%+ Target)

Metric Target Description
NDCG@10 >= 0.85 Normalized DCG for top-10 results
MRR >= 0.90 Mean reciprocal rank of first relevant result
Precision@5 >= 0.80 Precision in top-5 results
Latency < 50ms/pair Per-pair scoring speed
Correlation >= 0.88 Spearman correlation with expert ratings

GLACIER Pipeline Integration

STAGE 2 (Research) --> research-ranker reranks all retrieval results before presentation
STAGE 3 (WDC #1)  --> ranker scores supporting authorities for theory strength
STAGE 5 (WDC #2)  --> ranker validates that cited authorities are actually the strongest available

Training Configuration

  • Epochs: 3
  • Learning rate: 1e-5
  • Batch size: 64
  • Max sequence length: 512 (query: 64, document: 448)
  • Loss: MSE regression loss
  • Hardware: AWS SageMaker ml.g5.2xlarge

Limitations

  • Optimized for litigation research queries; transactional/regulatory queries may rank lower quality
  • Jurisdiction weighting is implicit; a separate jurisdiction filter should be applied first
  • Does not understand temporal relevance without explicit signals
  • Hard negatives from same practice area are the most challenging failure mode

Version History

Version Date Notes
v0.1 2026-04-10 Initial model card, repo created
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Symio-ai/legal-research-ranker