| # CanLex retrieval β precision investigation (2026-05-21) |
|
|
| Investigation of the persistent eval misses, with a tested, recommended fix. |
| **No retrieval-algorithm change has been deployed** β this is for review. |
|
|
| ## The question |
|
|
| The eval had a handful of persistent misses where the correct provision ranked |
| outside the top 5. Why, and what fixes it? |
|
|
| ## Diagnosis |
|
|
| Stage-by-stage trace of each miss β the gold provision's rank out of each |
| retriever, and after fusion: |
|
|
| | Query | Gold | BM25 rank | Semantic rank | Fused rank | |
| |---|---|---|---|---| |
| | pre-removal risk assessment | IRPA s.112 | 45 | 35 | 35 | |
| | report to a customs officer on arrival | Customs Act s.11 | 51 | **1** | 6 | |
| | duty to report imported goods | Customs Act s.12 | 58 | **1** | 6 | |
| | report large amounts of currency | PCMLTFA s.12 | 82 | 32 | 63 | |
| | seize unreported currency | PCMLTFA s.18 | 51 | **3** | 14 | |
|
|
| Two distinct causes: |
|
|
| **1. BM25 dilutes strong semantic hits.** For Customs Act s.11 and s.12 and |
| PCMLTFA s.18 the *semantic* retriever ranks the gold #1, #1, #3 β essentially |
| perfect. But BM25 ranks the same provision #51, #58, #51, because the query |
| keywords ("report", "currency", "arriving") are common words with no |
| distinctive term to latch onto. Reciprocal-rank fusion with equal weight |
| averages the two rankings, so a #1 semantic hit fused with a #51 BM25 hit lands |
| around #6. The strong signal is diluted by the weak one. |
|
|
| **2. The enacting statute is out-competed by elaborating material.** IRPA s.112 |
| (PRRA) is ranked only mediocre by *both* retrievers (BM25 #45, semantic #35): |
| the IRPR regulations (s.160 "Application for protection", s.161, s.165, s.232) |
| elaborate the PRRA process across many focused sections, and the |
| currency-forfeiture case law (Dokaj, Williams, Hociung) crowds PCMLTFA s.12. One |
| enacting section cannot out-rank a dozen elaborating chunks on a topical query. |
| The `_ensure_legislation` guarantee added this batch mitigates this at the |
| production default `top_k=6` (PCMLTFA s.18 reaches #2 there, vs #11 at the |
| eval's `top_k=20`), but does not fix cause #2 fully. |
|
|
| ## Tested fix β up-weight the semantic retriever |
|
|
| `canlex/index.py` now has a `W_SEM` constant: the weight on the semantic |
| retriever's contribution to the RRF fusion (default **1.0** = equal weight = |
| current, unchanged behaviour). Sweep on the 89-question eval set: |
|
|
| | W_SEM | Hit@1 | Hit@3 | Hit@5 | Hit@10 | MRR | |
| |---|---|---|---|---|---| |
| | 1.0 (current) | 0.573 | 0.787 | 0.876 | 0.921 | 0.701 | |
| | 1.5 | 0.629 | 0.798 | 0.888 | 0.933 | 0.737 | |
| | 2.0 | 0.652 | 0.809 | 0.899 | 0.933 | 0.752 | |
| | 3.0 | 0.652 | 0.820 | 0.910 | 0.933 | 0.754 | |
| |
| Up-weighting the semantic retriever improves every metric monotonically, with no |
| regression β the gain is largest exactly where the diagnosis predicted |
| (Hit@1 +0.08, MRR +0.05). |
| |
| ## Recommendation |
| |
| **Set `W_SEM = 2.0`** in `canlex/index.py`. It captures most of the gain |
| (Hit@1 0.57 -> 0.65, Hit@5 0.88 -> 0.90, MRR 0.70 -> 0.75) while keeping a |
| meaningful BM25 contribution. W_SEM=3.0 squeezes slightly more but tilts the |
| fusion heavily toward semantic; 2.0 is the balanced choice. |
|
|
| To apply: change the one constant, run `py -m canlex.eval` to confirm, redeploy. |
|
|
| Caveat: measured on the 89-question eval. Semantic up-weighting is principled |
| (the diagnostic shows semantic genuinely ranks these golds well), but keep an |
| eye on exact-keyword and section-number lookups after adopting it. |
|
|
| ## Still hard after W_SEM=2.0 |
| |
| IRPA s.112 (PRRA) β cause #2 above; W_SEM does not fix it, because semantic |
| itself ranks s.112 only #35. A later option: an Act-over-its-own-regulation |
| tie-break, or accepting that the IRPR PRRA regulations are themselves a |
| reasonable answer and broadening that gold. |
|
|