Spaces:

Beemer0
/

CanLex

Running

App Files Files Community

CanLex / precision-findings.md

Beemer

Expand eval set to 89 questions; add semantic-fusion weight knob

d72272a 5 days ago

preview code

raw

history blame contribute delete

3.74 kB

CanLex retrieval — precision investigation (2026-05-21)

Investigation of the persistent eval misses, with a tested, recommended fix. No retrieval-algorithm change has been deployed — this is for review.

The question

The eval had a handful of persistent misses where the correct provision ranked outside the top 5. Why, and what fixes it?

Diagnosis

Stage-by-stage trace of each miss — the gold provision's rank out of each retriever, and after fusion:

Query	Gold	BM25 rank	Semantic rank	Fused rank
pre-removal risk assessment	IRPA s.112	45	35	35
report to a customs officer on arrival	Customs Act s.11	51	1	6
duty to report imported goods	Customs Act s.12	58	1	6
report large amounts of currency	PCMLTFA s.12	82	32	63
seize unreported currency	PCMLTFA s.18	51	3	14

Two distinct causes:

1. BM25 dilutes strong semantic hits. For Customs Act s.11 and s.12 and PCMLTFA s.18 the semantic retriever ranks the gold #1, #1, #3 — essentially perfect. But BM25 ranks the same provision #51, #58, #51, because the query keywords ("report", "currency", "arriving") are common words with no distinctive term to latch onto. Reciprocal-rank fusion with equal weight averages the two rankings, so a #1 semantic hit fused with a #51 BM25 hit lands around #6. The strong signal is diluted by the weak one.

2. The enacting statute is out-competed by elaborating material. IRPA s.112 (PRRA) is ranked only mediocre by both retrievers (BM25 #45, semantic #35): the IRPR regulations (s.160 "Application for protection", s.161, s.165, s.232) elaborate the PRRA process across many focused sections, and the currency-forfeiture case law (Dokaj, Williams, Hociung) crowds PCMLTFA s.12. One enacting section cannot out-rank a dozen elaborating chunks on a topical query. The _ensure_legislation guarantee added this batch mitigates this at the production default top_k=6 (PCMLTFA s.18 reaches #2 there, vs #11 at the eval's top_k=20), but does not fix cause #2 fully.

Tested fix — up-weight the semantic retriever

canlex/index.py now has a W_SEM constant: the weight on the semantic retriever's contribution to the RRF fusion (default 1.0 = equal weight = current, unchanged behaviour). Sweep on the 89-question eval set:

W_SEM	Hit@1	Hit@3	Hit@5	Hit@10	MRR
1.0 (current)	0.573	0.787	0.876	0.921	0.701
1.5	0.629	0.798	0.888	0.933	0.737
2.0	0.652	0.809	0.899	0.933	0.752
3.0	0.652	0.820	0.910	0.933	0.754

Up-weighting the semantic retriever improves every metric monotonically, with no regression — the gain is largest exactly where the diagnosis predicted (Hit@1 +0.08, MRR +0.05).

Recommendation

Set W_SEM = 2.0 in canlex/index.py. It captures most of the gain (Hit@1 0.57 -> 0.65, Hit@5 0.88 -> 0.90, MRR 0.70 -> 0.75) while keeping a meaningful BM25 contribution. W_SEM=3.0 squeezes slightly more but tilts the fusion heavily toward semantic; 2.0 is the balanced choice.

To apply: change the one constant, run py -m canlex.eval to confirm, redeploy.

Caveat: measured on the 89-question eval. Semantic up-weighting is principled (the diagnostic shows semantic genuinely ranks these golds well), but keep an eye on exact-keyword and section-number lookups after adopting it.

Still hard after W_SEM=2.0

IRPA s.112 (PRRA) — cause #2 above; W_SEM does not fix it, because semantic itself ranks s.112 only #35. A later option: an Act-over-its-own-regulation tie-break, or accepting that the IRPR PRRA regulations are themselves a reasonable answer and broadening that gold.