Spaces:

Beemer0
/

CanLex

Running

App Files Files Community

CanLex / precision-findings.md

Beemer

Expand eval set to 89 questions; add semantic-fusion weight knob

d72272a 6 days ago

preview code

raw

history blame contribute delete

3.74 kB

	# CanLex retrieval — precision investigation (2026-05-21)

	Investigation of the persistent eval misses, with a tested, recommended fix.
	No retrieval-algorithm change has been deployed — this is for review.

	## The question

	The eval had a handful of persistent misses where the correct provision ranked
	outside the top 5. Why, and what fixes it?

	## Diagnosis

	Stage-by-stage trace of each miss — the gold provision's rank out of each
	retriever, and after fusion:

	\| Query \| Gold \| BM25 rank \| Semantic rank \| Fused rank \|
	\|---\|---\|---\|---\|---\|
	\| pre-removal risk assessment \| IRPA s.112 \| 45 \| 35 \| 35 \|
	\| report to a customs officer on arrival \| Customs Act s.11 \| 51 \| 1 \| 6 \|
	\| duty to report imported goods \| Customs Act s.12 \| 58 \| 1 \| 6 \|
	\| report large amounts of currency \| PCMLTFA s.12 \| 82 \| 32 \| 63 \|
	\| seize unreported currency \| PCMLTFA s.18 \| 51 \| 3 \| 14 \|

	Two distinct causes:

	1. BM25 dilutes strong semantic hits. For Customs Act s.11 and s.12 and
	PCMLTFA s.18 the semantic retriever ranks the gold #1, #1, #3 — essentially
	perfect. But BM25 ranks the same provision #51, #58, #51, because the query
	keywords ("report", "currency", "arriving") are common words with no
	distinctive term to latch onto. Reciprocal-rank fusion with equal weight
	averages the two rankings, so a #1 semantic hit fused with a #51 BM25 hit lands
	around #6. The strong signal is diluted by the weak one.

	2. The enacting statute is out-competed by elaborating material. IRPA s.112
	(PRRA) is ranked only mediocre by both retrievers (BM25 #45, semantic #35):
	the IRPR regulations (s.160 "Application for protection", s.161, s.165, s.232)
	elaborate the PRRA process across many focused sections, and the
	currency-forfeiture case law (Dokaj, Williams, Hociung) crowds PCMLTFA s.12. One
	enacting section cannot out-rank a dozen elaborating chunks on a topical query.
	The `_ensure_legislation` guarantee added this batch mitigates this at the
	production default `top_k=6` (PCMLTFA s.18 reaches #2 there, vs #11 at the
	eval's `top_k=20`), but does not fix cause #2 fully.

	## Tested fix — up-weight the semantic retriever

	`canlex/index.py` now has a `W_SEM` constant: the weight on the semantic
	retriever's contribution to the RRF fusion (default 1.0 = equal weight =
	current, unchanged behaviour). Sweep on the 89-question eval set:

	\| W_SEM \| Hit@1 \| Hit@3 \| Hit@5 \| Hit@10 \| MRR \|
	\|---\|---\|---\|---\|---\|---\|
	\| 1.0 (current) \| 0.573 \| 0.787 \| 0.876 \| 0.921 \| 0.701 \|
	\| 1.5 \| 0.629 \| 0.798 \| 0.888 \| 0.933 \| 0.737 \|
	\| 2.0 \| 0.652 \| 0.809 \| 0.899 \| 0.933 \| 0.752 \|
	\| 3.0 \| 0.652 \| 0.820 \| 0.910 \| 0.933 \| 0.754 \|

	Up-weighting the semantic retriever improves every metric monotonically, with no
	regression — the gain is largest exactly where the diagnosis predicted
	(Hit@1 +0.08, MRR +0.05).

	## Recommendation

	Set `W_SEM = 2.0` in `canlex/index.py`. It captures most of the gain
	(Hit@1 0.57 -> 0.65, Hit@5 0.88 -> 0.90, MRR 0.70 -> 0.75) while keeping a
	meaningful BM25 contribution. W_SEM=3.0 squeezes slightly more but tilts the
	fusion heavily toward semantic; 2.0 is the balanced choice.

	To apply: change the one constant, run `py -m canlex.eval` to confirm, redeploy.

	Caveat: measured on the 89-question eval. Semantic up-weighting is principled
	(the diagnostic shows semantic genuinely ranks these golds well), but keep an
	eye on exact-keyword and section-number lookups after adopting it.

	## Still hard after W_SEM=2.0

	IRPA s.112 (PRRA) — cause #2 above; W_SEM does not fix it, because semantic
	itself ranks s.112 only #35. A later option: an Act-over-its-own-regulation
	tie-break, or accepting that the IRPR PRRA regulations are themselves a
	reasonable answer and broadening that gold.