Beemer Claude Opus 4.7 commited on
Commit
d72272a
Β·
1 Parent(s): 1e58371

Expand eval set to 89 questions; add semantic-fusion weight knob

Browse files

Overnight evaluation and precision work (retrieval behaviour unchanged):

- data/eval/questions.json: 47 -> 89 gold questions, adding case-law,
D-memorandum, collective-agreement and NJC-directive coverage so
retrieval quality is measured across every source type.
- canlex/index.py: add the W_SEM fusion-weight constant (default 1.0 =
equal weight = unchanged behaviour). Diagnosis: for several eval
misses the semantic retriever ranks the gold #1-3 but BM25 ranks it
45-82, and equal-weight RRF averages it down. precision-findings.md
has the measured sweep -- W_SEM=2.0 lifts the eval Hit@1 0.57->0.65,
Hit@5 0.88->0.90, MRR 0.70->0.75 with no regression.
- pending-cases.md: 18 leading SCC/FCA/FC cases curated for ingestion
once the Lexum 403 block clears (Phase 4 FPSLREB/CIRB is blocked by
the same intermittent block -- two ingest attempts both failed).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

canlex/index.py CHANGED
@@ -13,6 +13,7 @@ from .synonyms import expand_query
13
  K1 = 1.5
14
  B = 0.75
15
  RRF_K = 60 # reciprocal-rank-fusion damping constant
 
16
  CANDIDATES = 80 # hits each retriever contributes to the fusion
17
  RERANK_POOL = 50 # top fused candidates the cross-encoder rescores
18
  SOURCE_CAP = 2 # max chunks one case/memo/agreement/directive may contribute
@@ -289,7 +290,7 @@ class LegislationIndex:
289
  if self.semantic:
290
  sem_order, confidence = self._semantic_ranking(expanded)
291
  for rank, idx in enumerate(sem_order):
292
- fused[idx] += 1.0 / (RRF_K + rank)
293
 
294
  # Ensure explicitly-referenced sections are retrieved even if recall missed them.
295
  refs = _section_refs(query)
 
13
  K1 = 1.5
14
  B = 0.75
15
  RRF_K = 60 # reciprocal-rank-fusion damping constant
16
+ W_SEM = 1.0 # weight on the semantic retriever in the fusion (1.0 = equal)
17
  CANDIDATES = 80 # hits each retriever contributes to the fusion
18
  RERANK_POOL = 50 # top fused candidates the cross-encoder rescores
19
  SOURCE_CAP = 2 # max chunks one case/memo/agreement/directive may contribute
 
290
  if self.semantic:
291
  sem_order, confidence = self._semantic_ranking(expanded)
292
  for rank, idx in enumerate(sem_order):
293
+ fused[idx] += W_SEM / (RRF_K + rank)
294
 
295
  # Ensure explicitly-referenced sections are retrieved even if recall missed them.
296
  refs = _section_refs(query)
data/eval/questions.json CHANGED
@@ -45,5 +45,47 @@
45
  {"query": "What are the standard hours of work for an employee?", "answers": [["Canada Labour Code", "169"]]},
46
  {"query": "What is the standard of review of an administrative decision on judicial review?", "answers": [["Vavilov", ""]]},
47
  {"query": "How does the Refugee Appeal Division review a decision of the Refugee Protection Division?", "answers": [["Huruglica", ""]]},
48
- {"query": "To get back currency seized at the border, what must the claimant show about the money?", "answers": [["Sellathurai", ""]]}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ]
 
45
  {"query": "What are the standard hours of work for an employee?", "answers": [["Canada Labour Code", "169"]]},
46
  {"query": "What is the standard of review of an administrative decision on judicial review?", "answers": [["Vavilov", ""]]},
47
  {"query": "How does the Refugee Appeal Division review a decision of the Refugee Protection Division?", "answers": [["Huruglica", ""]]},
48
+ {"query": "To get back currency seized at the border, what must the claimant show about the money?", "answers": [["Sellathurai", ""]]},
49
+ {"query": "What factors shape the content of the duty of procedural fairness in an administrative decision?", "answers": [["Baker", ""]]},
50
+ {"query": "Can Canada remove a person to a country where they face a substantial risk of torture?", "answers": [["Suresh", ""]]},
51
+ {"query": "Is the security certificate regime for detaining and removing non-citizens consistent with the Charter?", "answers": [["Charkaoui", ""]]},
52
+ {"query": "Do refugee claimants have a right to an oral hearing under the Charter?", "answers": [["Singh", ""]]},
53
+ {"query": "What is a particular social group in the definition of a Convention refugee?", "answers": [["Ward", ""]]},
54
+ {"query": "When is a person excluded from refugee protection for acts contrary to the purposes of the United Nations?", "answers": [["Pushpanathan", ""]]},
55
+ {"query": "What degree of involvement makes a person complicit in international crimes and excluded from refugee protection?", "answers": [["Ezokola", ""]]},
56
+ {"query": "Can a person be denied refugee protection for a serious crime committed abroad before claiming asylum?", "answers": [["Febles", ""]]},
57
+ {"query": "How must a decision-maker weigh the best interests of a child in a humanitarian and compassionate application?", "answers": [["Kanthasamy", ""]]},
58
+ {"query": "What does the national interest mean when the Minister grants relief from security inadmissibility?", "answers": [["Agraira", ""]]},
59
+ {"query": "Does helping fellow asylum seekers enter a country illegally amount to people smuggling for inadmissibility?", "answers": [["B010", ""]]},
60
+ {"query": "Is the offence of human smuggling unconstitutionally overbroad if it captures humanitarian aid?", "answers": [["Appulonappa", ""]]},
61
+ {"query": "Does a conditional sentence count as a term of imprisonment for serious criminality?", "answers": [["Tran", ""]]},
62
+ {"query": "Can an immigration detainee challenge their detention through habeas corpus?", "answers": [["Chhina", ""]]},
63
+ {"query": "What kinds of border search engage the Charter protection against unreasonable search and seizure?", "answers": [["Simmons", ""]]},
64
+ {"query": "Does inadmissibility for membership in a terrorist organization require a complicity analysis?", "answers": [["Kanagendren", ""]]},
65
+ {"query": "How broadly is a criminal organization interpreted for organized criminality inadmissibility?", "answers": [["Sittampalam", ""]]},
66
+ {"query": "What principles govern a finding of inadmissibility for misrepresentation?", "answers": [["Goburdhun", ""]]},
67
+ {"query": "At an immigration detention review, who bears the onus and how are earlier detention rulings treated?", "answers": [["Thanabalasingham", ""]]},
68
+ {"query": "What is the test for admitting new evidence in a pre-removal risk assessment?", "answers": [["Raza", ""]]},
69
+ {"query": "Are gold coins currency or monetary instruments that must be reported when imported?", "answers": [["Hociung", ""]]},
70
+ {"query": "How do the courts review a customs tariff classification decision?", "answers": [["Best Buy", ""]]},
71
+ {"query": "What does CBSA policy say about how the value for duty of imported goods is established?", "answers": [["D-Memo", "D13-1-1"]]},
72
+ {"query": "What are CBSA's requirements for marking imported goods with their country of origin?", "answers": [["D-Memo", "D11-3-1"]]},
73
+ {"query": "What is CBSA's guidance on importing or exporting cannabis and controlled substances?", "answers": [["D-Memo", "D19-9-2"]]},
74
+ {"query": "What personal exemptions can a resident claim when returning to Canada, per CBSA guidance?", "answers": [["D-Memo", "D2-3-1"]]},
75
+ {"query": "What is CBSA's guidance on cross-border currency and monetary instruments reporting?", "answers": [["D-Memo", "D19-14-1"]]},
76
+ {"query": "How does CBSA decide whether imported material is obscene?", "answers": [["D-Memo", "D9-1-1"]]},
77
+ {"query": "What proof of origin does CBSA require for imported goods?", "answers": [["D-Memo", "D11-4-2"]]},
78
+ {"query": "What is the Canadian Goods Abroad Program for goods sent outside Canada for repair?", "answers": [["D-Memo", "D8-2-1"]]},
79
+ {"query": "How does the FB Border Services collective agreement deal with discipline of an employee?", "answers": [["FB Agreement", "17"]]},
80
+ {"query": "What is the grievance procedure under the FB collective agreement?", "answers": [["FB Agreement", "18"]]},
81
+ {"query": "What are the hours of work under the FB Border Services collective agreement?", "answers": [["FB Agreement", "25"]]},
82
+ {"query": "How is overtime compensated under the FB collective agreement?", "answers": [["FB Agreement", "28"]]},
83
+ {"query": "How much vacation leave with pay do FB-group employees earn?", "answers": [["FB Agreement", "34"]]},
84
+ {"query": "What is the bilingualism bonus and who is eligible to receive it?", "answers": [["Bilingualism Bonus Directive", ""]]},
85
+ {"query": "What assistance is available to a federal employee who faces unusual daily commuting costs?", "answers": [["Commuting Assistance Directive", ""]]},
86
+ {"query": "When may the Immigration Division order the release of a detained person?", "answers": [["IRPA", "58"]]},
87
+ {"query": "When is there no right to appeal a removal order to the Immigration Appeal Division?", "answers": [["IRPA", "64"]]},
88
+ {"query": "Is distributing cannabis an offence?", "answers": [["Cannabis Act", "9"]]},
89
+ {"query": "What is the offence of smuggling goods into Canada under the Customs Act?", "answers": [["Customs Act", "159"]]},
90
+ {"query": "Is an employee entitled to medical leave under the Canada Labour Code?", "answers": [["Canada Labour Code", "239"]]}
91
  ]
pending-cases.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Leading cases to add when the Lexum block clears
2
+
3
+ **Status (2026-05-21):** Phase 4 (FPSLREB/CIRB) and this case-law expansion are
4
+ both blocked. `canlex.caselaw` fetches return **HTTP 403** from the Lexum
5
+ decision hosts β€” two full ingest attempts, every FPSLREB/CIRB decision failed.
6
+ A single-shot probe of the same URLs returns 200, so the block is intermittent
7
+ or specific to the fetch pattern (likely the `?iframe=true` decision endpoint).
8
+
9
+ **To ingest when the block clears:** for each case below, open the decision on
10
+ its court's `decisions.*.gc.ca` site, take the numeric **item id** from the
11
+ `.../item/{id}/index.do` URL, and add an entry to `CASES` in
12
+ `canlex/caselaw.py` (`{"court": ..., "id": ..., "short": ..., "topic": ...}`),
13
+ then re-run `py -m canlex.caselaw` β†’ `py -m canlex.embed` β†’ redeploy. Verify the
14
+ citation against the page (these are from memory). Cases already in the corpus
15
+ are excluded.
16
+
17
+ ## Supreme Court of Canada (court: "scc")
18
+ - **Chiarelli** β€” Canada (MEI) v Chiarelli, [1992] 1 SCR 711 β€” s. 7 fundamental
19
+ justice and the deportation of permanent residents; the constitutional
20
+ baseline for non-citizens.
21
+ - **Chieu** β€” Chieu v Canada (MCI), 2002 SCC 3 β€” Immigration Appeal Division
22
+ removal-order appeals; foreign hardship is a relevant consideration.
23
+ - **Medovarski** β€” Medovarski v Canada (MCI), 2005 SCC 51 β€” IRPA's objectives;
24
+ no Charter s. 7 right for a non-citizen to remain in Canada.
25
+ - **Mugesera** β€” Mugesera v Canada (MCI), 2005 SCC 40 β€” inadmissibility for
26
+ crimes against humanity; incitement to genocide.
27
+ - **Harkat** β€” Canada (Citizenship and Immigration) v Harkat, 2014 SCC 37 β€”
28
+ constitutionality of the security-certificate regime and special advocates.
29
+ - **NΓ©meth** β€” NΓ©meth v Canada (Justice), 2010 SCC 56 β€” extradition and the
30
+ refugee principle of non-refoulement.
31
+ - **Mavi** β€” Canada (AG) v Mavi, 2011 SCC 30 β€” sponsorship-undertaking debt and
32
+ the duty of procedural fairness in enforcing it.
33
+ - **Pham** β€” R v Pham, 2013 SCC 15 β€” collateral immigration consequences as a
34
+ factor in criminal sentencing.
35
+ - **Chan** β€” Chan v Canada (MEI), [1995] 3 SCR 593 β€” refugee protection; a
36
+ particular social group and well-founded fear (verify the citation).
37
+ - **Jacques** β€” R v Jacques, [1996] 3 SCR 312 β€” border searches; a vehicle stop
38
+ near the border and customs officers' powers.
39
+ - **Martineau** β€” Martineau v MNR, 2004 SCC 81 β€” whether an ascertained-
40
+ forfeiture notice under the Customs Act is a penal proceeding.
41
+
42
+ ## Federal Court of Appeal (court: "fca")
43
+ - **Poshteh** β€” Poshteh v Canada (MCI), 2005 FCA 85 β€” membership in a terrorist
44
+ organization for inadmissibility; relevance of the claimant's age.
45
+ - **Hinzman** β€” Hinzman v Canada (MCI), 2007 FCA 171 β€” refugee and H&C claims by
46
+ US military deserters.
47
+ - **Thamotharem** β€” Canada (MCI) v Thamotharem, 2007 FCA 198 β€” Refugee
48
+ Protection Division hearing procedure; order of questioning; fettering by
49
+ guidelines.
50
+ - **Toussaint** β€” Toussaint v Canada (AG), 2011 FCA 213 β€” interim federal health
51
+ coverage and access for indigent applicants.
52
+ - **Rahaman** β€” Rahaman v Canada (MCI), 2002 FCA 89 β€” refugee claims and the "no
53
+ credible basis" finding.
54
+
55
+ ## Federal Court (court: "fc")
56
+ - **Almrei (Re)** β€” Almrei (Re), 2009 FC 1263 β€” reasonableness of a security
57
+ certificate after Charkaoui.
58
+ - **Sahin** β€” Sahin v Canada (MCI), [1995] 1 FC 214 β€” the foundational factors
59
+ for the length of immigration detention.
precision-findings.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CanLex retrieval β€” precision investigation (2026-05-21)
2
+
3
+ Investigation of the persistent eval misses, with a tested, recommended fix.
4
+ **No retrieval-algorithm change has been deployed** β€” this is for review.
5
+
6
+ ## The question
7
+
8
+ The eval had a handful of persistent misses where the correct provision ranked
9
+ outside the top 5. Why, and what fixes it?
10
+
11
+ ## Diagnosis
12
+
13
+ Stage-by-stage trace of each miss β€” the gold provision's rank out of each
14
+ retriever, and after fusion:
15
+
16
+ | Query | Gold | BM25 rank | Semantic rank | Fused rank |
17
+ |---|---|---|---|---|
18
+ | pre-removal risk assessment | IRPA s.112 | 45 | 35 | 35 |
19
+ | report to a customs officer on arrival | Customs Act s.11 | 51 | **1** | 6 |
20
+ | duty to report imported goods | Customs Act s.12 | 58 | **1** | 6 |
21
+ | report large amounts of currency | PCMLTFA s.12 | 82 | 32 | 63 |
22
+ | seize unreported currency | PCMLTFA s.18 | 51 | **3** | 14 |
23
+
24
+ Two distinct causes:
25
+
26
+ **1. BM25 dilutes strong semantic hits.** For Customs Act s.11 and s.12 and
27
+ PCMLTFA s.18 the *semantic* retriever ranks the gold #1, #1, #3 β€” essentially
28
+ perfect. But BM25 ranks the same provision #51, #58, #51, because the query
29
+ keywords ("report", "currency", "arriving") are common words with no
30
+ distinctive term to latch onto. Reciprocal-rank fusion with equal weight
31
+ averages the two rankings, so a #1 semantic hit fused with a #51 BM25 hit lands
32
+ around #6. The strong signal is diluted by the weak one.
33
+
34
+ **2. The enacting statute is out-competed by elaborating material.** IRPA s.112
35
+ (PRRA) is ranked only mediocre by *both* retrievers (BM25 #45, semantic #35):
36
+ the IRPR regulations (s.160 "Application for protection", s.161, s.165, s.232)
37
+ elaborate the PRRA process across many focused sections, and the
38
+ currency-forfeiture case law (Dokaj, Williams, Hociung) crowds PCMLTFA s.12. One
39
+ enacting section cannot out-rank a dozen elaborating chunks on a topical query.
40
+ The `_ensure_legislation` guarantee added this batch mitigates this at the
41
+ production default `top_k=6` (PCMLTFA s.18 reaches #2 there, vs #11 at the
42
+ eval's `top_k=20`), but does not fix cause #2 fully.
43
+
44
+ ## Tested fix β€” up-weight the semantic retriever
45
+
46
+ `canlex/index.py` now has a `W_SEM` constant: the weight on the semantic
47
+ retriever's contribution to the RRF fusion (default **1.0** = equal weight =
48
+ current, unchanged behaviour). Sweep on the 89-question eval set:
49
+
50
+ | W_SEM | Hit@1 | Hit@3 | Hit@5 | Hit@10 | MRR |
51
+ |---|---|---|---|---|---|
52
+ | 1.0 (current) | 0.573 | 0.787 | 0.876 | 0.921 | 0.701 |
53
+ | 1.5 | 0.629 | 0.798 | 0.888 | 0.933 | 0.737 |
54
+ | 2.0 | 0.652 | 0.809 | 0.899 | 0.933 | 0.752 |
55
+ | 3.0 | 0.652 | 0.820 | 0.910 | 0.933 | 0.754 |
56
+
57
+ Up-weighting the semantic retriever improves every metric monotonically, with no
58
+ regression β€” the gain is largest exactly where the diagnosis predicted
59
+ (Hit@1 +0.08, MRR +0.05).
60
+
61
+ ## Recommendation
62
+
63
+ **Set `W_SEM = 2.0`** in `canlex/index.py`. It captures most of the gain
64
+ (Hit@1 0.57 -> 0.65, Hit@5 0.88 -> 0.90, MRR 0.70 -> 0.75) while keeping a
65
+ meaningful BM25 contribution. W_SEM=3.0 squeezes slightly more but tilts the
66
+ fusion heavily toward semantic; 2.0 is the balanced choice.
67
+
68
+ To apply: change the one constant, run `py -m canlex.eval` to confirm, redeploy.
69
+
70
+ Caveat: measured on the 89-question eval. Semantic up-weighting is principled
71
+ (the diagnostic shows semantic genuinely ranks these golds well), but keep an
72
+ eye on exact-keyword and section-number lookups after adopting it.
73
+
74
+ ## Still hard after W_SEM=2.0
75
+
76
+ IRPA s.112 (PRRA) β€” cause #2 above; W_SEM does not fix it, because semantic
77
+ itself ranks s.112 only #35. A later option: an Act-over-its-own-regulation
78
+ tie-break, or accepting that the IRPR PRRA regulations are themselves a
79
+ reasonable answer and broadening that gold.