Beemer Claude Opus 4.7 commited on
Commit
666cd44
·
1 Parent(s): 4b46b2f

Use case-law topic, not paragraph range, as the retrieval title

Browse files

A case-law chunk's marginal_note is the paragraph range ('paras 11-13'),
which is useless for topic matching; the proposition the case stands for
lives in the 'heading' field, constant across every chunk of a given case.
The BM25 title-position emphasis, the title-match boost, and the semantic
embed_text were all reading marginal_note for case-law and giving it no
useful signal, so leading cases stayed buried behind statutes and memos
that name the doctrine by literal wording.

Add a doc_type-aware topical_title helper used in all three places. For
case-law it returns 'heading' (the case topic); for memoranda it keeps the
existing 'part'-based selection; for everything else it returns marginal_note.

Also fix a related issue: a query like "IRPA s. 40 misrepresentation defence"
uses the section number topically, but the section-ref recall pulled every
Act's s. 40 into the pool and the section-ref pin pushed them above the
case law that interprets the section the user meant. Both paths now require
the candidate's act_short or act_code to appear as a substring in the query
-- a lookup-style "section 32 of the Customs Act" still pins cleanly, but
a topical "IRPA s. 40" no longer drags in AAAMPA, Cannabis Act, etc.

141-question eval: Hit@1 0.79 / Hit@3 0.92 / Hit@5 0.95 / Hit@10 0.97
/ MRR 0.86 -- vs pre-fix 139-Q baseline 0.75 / 0.88 / 0.94 / 0.97 / 0.83,
Hit@1 +0.04, Hit@3 +0.04, Hit@5 +0.01, MRR +0.03; 10 misses -> 7. Two new
Wang/Bellido gold questions on the IRPA s. 40 misrepresentation doctrine
both surface in top 5 (previously absent from top 20). One small collateral
re-rank: Customs Tariff Sch-98.03 (non-resident conveyance) slipped from
#2 to #6, beaten by Sch-98.02 on opposite-paraphrase ("non-resident" vs
"resident"); the embedding space shifts under any re-embed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (3) hide show
  1. canlex/embed.py +6 -7
  2. canlex/index.py +52 -21
  3. data/eval/questions.json +2 -0
canlex/embed.py CHANGED
@@ -33,13 +33,12 @@ def load_chunks():
33
  def embed_text(chunk):
34
  """Compact, retrieval-focused representation of one section."""
35
  # The section title is the strongest topical signal, so it is repeated to
36
- # emphasise it. For D-memoranda the marginal note is only a generic section
37
- # label ('Guidelines', 'Legislation'); the memo's actual subject lives in
38
- # the 'part' field, so that is used as the title instead.
39
- if chunk.get("doc_type") == "memorandum":
40
- note = chunk.get("part") or chunk["marginal_note"]
41
- else:
42
- note = chunk["marginal_note"]
43
  body = chunk["text"][:_MAX_BODY]
44
  parts = [chunk["act_short"], note, note, chunk["heading"], body]
45
  return " . ".join(p for p in parts if p)
 
33
  def embed_text(chunk):
34
  """Compact, retrieval-focused representation of one section."""
35
  # The section title is the strongest topical signal, so it is repeated to
36
+ # emphasise it. Title selection is doc_type-aware (see index.topical_title):
37
+ # a D-memo's marginal_note is a generic banner so its actual subject in
38
+ # 'part' is used; a case-law chunk's marginal_note is just the paragraph
39
+ # range so the case proposition in 'heading' is used.
40
+ from .index import topical_title
41
+ note = topical_title(chunk)
 
42
  body = chunk["text"][:_MAX_BODY]
43
  parts = [chunk["act_short"], note, note, chunk["heading"], body]
44
  return " . ".join(p for p in parts if p)
canlex/index.py CHANGED
@@ -85,6 +85,23 @@ def _section_refs(query):
85
  return set(_SECTION_REF.findall(query.lower()))
86
 
87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  def _provision_units(text):
89
  """Citable parts of a provision, for pinpoint scoring -- a list of
90
  (citation_suffix, scoring_text, snippet). One entry per paragraph, with its
@@ -136,11 +153,15 @@ class LegislationIndex:
136
  self.postings = defaultdict(list) # term -> [(doc_idx, term_frequency), ...]
137
  df = defaultdict(int)
138
  for idx, c in enumerate(self.chunks):
139
- # The marginal note (title) is repeated to weight it above body text;
140
  # the Act name, code and section are indexed too, so an Act's own
141
  # terminology (e.g. "controlled substance") and its codes/numbers
142
- # are searchable even when a section's text omits them.
143
- blob = " ".join((c["marginal_note"], c["marginal_note"], c["heading"],
 
 
 
 
144
  c["part"], c["division"], c["act_name"], c["act_code"],
145
  c["section"], c["text"]))
146
  counts = Counter(tokenize(blob))
@@ -153,24 +174,16 @@ class LegislationIndex:
153
  self.idf = {t: math.log(1 + (n - d + 0.5) / (d + 0.5)) for t, d in df.items()}
154
 
155
  def _build_note_tokens(self):
156
- """Pre-tokenise each chunk's topical title, for the title-match boost in
157
- search(). For legislation, agreements and directives the title is the
158
- marginal note (the section heading). A D-memorandum's marginal note is
159
- generic ('Legislation', 'Guidelines and General Information', or a stray
160
- page banner), so the memo's subject -- carried in its 'part' field -- is
161
- used instead. Each chunk is also flagged as a regulation (act codes
162
- beginning SOR/C.R.C.) for the Act-over-regulation preference, and as
163
- collective-agreement back-matter (memoranda and letters with no article
164
- number) for the back-matter penalty."""
165
  self._note_tokens = []
166
  self._is_regulation = []
167
  self._is_backmatter = []
168
  for c in self.chunks:
169
- if c.get("doc_type") == "memorandum":
170
- title = c.get("part") or c["marginal_note"]
171
- else:
172
- title = c["marginal_note"]
173
- self._note_tokens.append(set(tokenize(title)))
174
  self._is_regulation.append(
175
  c.get("doc_type", "legislation") == "legislation"
176
  and c["act_code"].startswith(("SOR", "C.R.C")))
@@ -381,11 +394,24 @@ class LegislationIndex:
381
  for rank, idx in enumerate(sem_order):
382
  fused[idx] += W_SEM / (RRF_K + rank)
383
 
384
- # Ensure explicitly-referenced sections are retrieved even if recall missed them.
 
 
 
 
 
 
385
  refs = _section_refs(query)
 
 
 
 
 
 
386
  if refs:
387
  for idx, c in enumerate(self.chunks):
388
- if c["section"] in refs and idx not in fused:
 
389
  fused[idx] = 0.0
390
 
391
  # Title-match boost: the marginal note is a section's canonical subject.
@@ -448,9 +474,14 @@ class LegislationIndex:
448
  fusion_rank[i]))
449
  candidates = pool + candidates[RERANK_POOL:]
450
 
451
- # Explicit section references are pinned to the very top.
 
 
 
452
  if refs:
453
- pinned = [i for i in candidates if self.chunks[i]["section"] in refs]
 
 
454
  if pinned:
455
  pinned_set = set(pinned)
456
  candidates = pinned + [i for i in candidates if i not in pinned_set]
 
85
  return set(_SECTION_REF.findall(query.lower()))
86
 
87
 
88
+ def topical_title(chunk):
89
+ """Return the chunk's topic-bearing string, used wherever a section's
90
+ 'title' is weighted for retrieval -- BM25 indexing, the title-match boost,
91
+ and the semantic embedding. Differs by doc_type because the field that
92
+ carries the topic differs: legislation/agreement/directive/delegation use
93
+ the marginal_note (section heading); D-memoranda use 'part' because their
94
+ marginal_note is a generic banner; case-law uses 'heading' because its
95
+ marginal_note is just the paragraph range ('paras 11-13') and the case
96
+ proposition lives in heading."""
97
+ doc_type = chunk.get("doc_type")
98
+ if doc_type == "memorandum":
99
+ return chunk.get("part") or chunk["marginal_note"]
100
+ if doc_type == "caselaw":
101
+ return chunk.get("heading") or chunk["marginal_note"]
102
+ return chunk["marginal_note"]
103
+
104
+
105
  def _provision_units(text):
106
  """Citable parts of a provision, for pinpoint scoring -- a list of
107
  (citation_suffix, scoring_text, snippet). One entry per paragraph, with its
 
153
  self.postings = defaultdict(list) # term -> [(doc_idx, term_frequency), ...]
154
  df = defaultdict(int)
155
  for idx, c in enumerate(self.chunks):
156
+ # The topical title is repeated to weight it above body text;
157
  # the Act name, code and section are indexed too, so an Act's own
158
  # terminology (e.g. "controlled substance") and its codes/numbers
159
+ # are searchable even when a section's text omits them. The title
160
+ # is doc_type-aware via topical_title -- for case-law it picks
161
+ # the case proposition (heading), not the paragraph range
162
+ # (marginal_note), so a leading case surfaces on a topical query.
163
+ title = topical_title(c)
164
+ blob = " ".join((title, title, c["heading"],
165
  c["part"], c["division"], c["act_name"], c["act_code"],
166
  c["section"], c["text"]))
167
  counts = Counter(tokenize(blob))
 
174
  self.idf = {t: math.log(1 + (n - d + 0.5) / (d + 0.5)) for t, d in df.items()}
175
 
176
  def _build_note_tokens(self):
177
+ """Pre-tokenise each chunk's topical title (see topical_title) for the
178
+ title-match boost in search(). Each chunk is also flagged as a
179
+ regulation (act codes beginning SOR/C.R.C.) for the Act-over-regulation
180
+ preference, and as collective-agreement back-matter (memoranda and
181
+ letters with no article number) for the back-matter penalty."""
 
 
 
 
182
  self._note_tokens = []
183
  self._is_regulation = []
184
  self._is_backmatter = []
185
  for c in self.chunks:
186
+ self._note_tokens.append(set(tokenize(topical_title(c))))
 
 
 
 
187
  self._is_regulation.append(
188
  c.get("doc_type", "legislation") == "legislation"
189
  and c["act_code"].startswith(("SOR", "C.R.C")))
 
394
  for rank, idx in enumerate(sem_order):
395
  fused[idx] += W_SEM / (RRF_K + rank)
396
 
397
+ # Ensure explicitly-referenced sections are retrieved even if recall
398
+ # missed them -- but only for Acts the query actually names. A query
399
+ # like "IRPA s. 40 misrepresentation defence" uses the section number
400
+ # topically; pulling every Act's s. 40 into the pool would drown out
401
+ # the case law that interprets the section the user meant. Substring
402
+ # check rather than token-overlap because act_codes split into trivial
403
+ # tokens ("A-8.8" -> {a, 8}) that spuriously match common query words.
404
  refs = _section_refs(query)
405
+ q_lc = query.lower()
406
+ def _act_in_query(c):
407
+ short = c["act_short"].lower()
408
+ code = c["act_code"].lower()
409
+ return ((short and short in q_lc)
410
+ or (code and len(code) >= 3 and code in q_lc))
411
  if refs:
412
  for idx, c in enumerate(self.chunks):
413
+ if (c["section"] in refs and idx not in fused
414
+ and _act_in_query(c)):
415
  fused[idx] = 0.0
416
 
417
  # Title-match boost: the marginal note is a section's canonical subject.
 
474
  fusion_rank[i]))
475
  candidates = pool + candidates[RERANK_POOL:]
476
 
477
+ # Explicit section references are pinned to the very top -- using the
478
+ # same Act-mentioned constraint as the recall step above, for the same
479
+ # reason: a bare "s. 40" without an Act name is usually topical
480
+ # (e.g. "the IRPA s. 40 misrepresentation defence"), not a lookup.
481
  if refs:
482
+ pinned = [i for i in candidates
483
+ if self.chunks[i]["section"] in refs
484
+ and _act_in_query(self.chunks[i])]
485
  if pinned:
486
  pinned_set = set(pinned)
487
  candidates = pinned + [i for i in candidates if i not in pinned_set]
data/eval/questions.json CHANGED
@@ -69,6 +69,8 @@
69
  {"query": "Does inadmissibility for membership in a terrorist organization require a complicity analysis?", "answers": [["Kanagendren", ""]]},
70
  {"query": "How broadly is a criminal organization interpreted for organized criminality inadmissibility?", "answers": [["Sittampalam", ""]]},
71
  {"query": "What principles govern a finding of inadmissibility for misrepresentation?", "answers": [["Goburdhun", ""]]},
 
 
72
  {"query": "At an immigration detention review, who bears the onus and how are earlier detention rulings treated?", "answers": [["Thanabalasingham", ""]]},
73
  {"query": "What is the test for admitting new evidence in a pre-removal risk assessment?", "answers": [["Raza", ""]]},
74
  {"query": "Are gold coins currency or monetary instruments that must be reported when imported?", "answers": [["Hociung", ""]]},
 
69
  {"query": "Does inadmissibility for membership in a terrorist organization require a complicity analysis?", "answers": [["Kanagendren", ""]]},
70
  {"query": "How broadly is a criminal organization interpreted for organized criminality inadmissibility?", "answers": [["Sittampalam", ""]]},
71
  {"query": "What principles govern a finding of inadmissibility for misrepresentation?", "answers": [["Goburdhun", ""]]},
72
+ {"query": "What is the innocent-misrepresentation defence to inadmissibility under IRPA s. 40?", "answers": [["Wang", ""]]},
73
+ {"query": "Must a misrepresentation under IRPA s. 40 be intentional, deliberate or negligent to be material?", "answers": [["Bellido", ""]]},
74
  {"query": "At an immigration detention review, who bears the onus and how are earlier detention rulings treated?", "answers": [["Thanabalasingham", ""]]},
75
  {"query": "What is the test for admitting new evidence in a pre-removal risk assessment?", "answers": [["Raza", ""]]},
76
  {"query": "Are gold coins currency or monetary instruments that must be reported when imported?", "answers": [["Hociung", ""]]},