Use case-law topic, not paragraph range, as the retrieval title
Browse filesA case-law chunk's marginal_note is the paragraph range ('paras 11-13'),
which is useless for topic matching; the proposition the case stands for
lives in the 'heading' field, constant across every chunk of a given case.
The BM25 title-position emphasis, the title-match boost, and the semantic
embed_text were all reading marginal_note for case-law and giving it no
useful signal, so leading cases stayed buried behind statutes and memos
that name the doctrine by literal wording.
Add a doc_type-aware topical_title helper used in all three places. For
case-law it returns 'heading' (the case topic); for memoranda it keeps the
existing 'part'-based selection; for everything else it returns marginal_note.
Also fix a related issue: a query like "IRPA s. 40 misrepresentation defence"
uses the section number topically, but the section-ref recall pulled every
Act's s. 40 into the pool and the section-ref pin pushed them above the
case law that interprets the section the user meant. Both paths now require
the candidate's act_short or act_code to appear as a substring in the query
-- a lookup-style "section 32 of the Customs Act" still pins cleanly, but
a topical "IRPA s. 40" no longer drags in AAAMPA, Cannabis Act, etc.
141-question eval: Hit@1 0.79 / Hit@3 0.92 / Hit@5 0.95 / Hit@10 0.97
/ MRR 0.86 -- vs pre-fix 139-Q baseline 0.75 / 0.88 / 0.94 / 0.97 / 0.83,
Hit@1 +0.04, Hit@3 +0.04, Hit@5 +0.01, MRR +0.03; 10 misses -> 7. Two new
Wang/Bellido gold questions on the IRPA s. 40 misrepresentation doctrine
both surface in top 5 (previously absent from top 20). One small collateral
re-rank: Customs Tariff Sch-98.03 (non-resident conveyance) slipped from
#2 to #6, beaten by Sch-98.02 on opposite-paraphrase ("non-resident" vs
"resident"); the embedding space shifts under any re-embed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- canlex/embed.py +6 -7
- canlex/index.py +52 -21
- data/eval/questions.json +2 -0
|
@@ -33,13 +33,12 @@ def load_chunks():
|
|
| 33 |
def embed_text(chunk):
|
| 34 |
"""Compact, retrieval-focused representation of one section."""
|
| 35 |
# The section title is the strongest topical signal, so it is repeated to
|
| 36 |
-
# emphasise it.
|
| 37 |
-
#
|
| 38 |
-
#
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
note = chunk["marginal_note"]
|
| 43 |
body = chunk["text"][:_MAX_BODY]
|
| 44 |
parts = [chunk["act_short"], note, note, chunk["heading"], body]
|
| 45 |
return " . ".join(p for p in parts if p)
|
|
|
|
| 33 |
def embed_text(chunk):
|
| 34 |
"""Compact, retrieval-focused representation of one section."""
|
| 35 |
# The section title is the strongest topical signal, so it is repeated to
|
| 36 |
+
# emphasise it. Title selection is doc_type-aware (see index.topical_title):
|
| 37 |
+
# a D-memo's marginal_note is a generic banner so its actual subject in
|
| 38 |
+
# 'part' is used; a case-law chunk's marginal_note is just the paragraph
|
| 39 |
+
# range so the case proposition in 'heading' is used.
|
| 40 |
+
from .index import topical_title
|
| 41 |
+
note = topical_title(chunk)
|
|
|
|
| 42 |
body = chunk["text"][:_MAX_BODY]
|
| 43 |
parts = [chunk["act_short"], note, note, chunk["heading"], body]
|
| 44 |
return " . ".join(p for p in parts if p)
|
|
@@ -85,6 +85,23 @@ def _section_refs(query):
|
|
| 85 |
return set(_SECTION_REF.findall(query.lower()))
|
| 86 |
|
| 87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
def _provision_units(text):
|
| 89 |
"""Citable parts of a provision, for pinpoint scoring -- a list of
|
| 90 |
(citation_suffix, scoring_text, snippet). One entry per paragraph, with its
|
|
@@ -136,11 +153,15 @@ class LegislationIndex:
|
|
| 136 |
self.postings = defaultdict(list) # term -> [(doc_idx, term_frequency), ...]
|
| 137 |
df = defaultdict(int)
|
| 138 |
for idx, c in enumerate(self.chunks):
|
| 139 |
-
# The
|
| 140 |
# the Act name, code and section are indexed too, so an Act's own
|
| 141 |
# terminology (e.g. "controlled substance") and its codes/numbers
|
| 142 |
-
# are searchable even when a section's text omits them.
|
| 143 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
c["part"], c["division"], c["act_name"], c["act_code"],
|
| 145 |
c["section"], c["text"]))
|
| 146 |
counts = Counter(tokenize(blob))
|
|
@@ -153,24 +174,16 @@ class LegislationIndex:
|
|
| 153 |
self.idf = {t: math.log(1 + (n - d + 0.5) / (d + 0.5)) for t, d in df.items()}
|
| 154 |
|
| 155 |
def _build_note_tokens(self):
|
| 156 |
-
"""Pre-tokenise each chunk's topical title
|
| 157 |
-
search().
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
used instead. Each chunk is also flagged as a regulation (act codes
|
| 162 |
-
beginning SOR/C.R.C.) for the Act-over-regulation preference, and as
|
| 163 |
-
collective-agreement back-matter (memoranda and letters with no article
|
| 164 |
-
number) for the back-matter penalty."""
|
| 165 |
self._note_tokens = []
|
| 166 |
self._is_regulation = []
|
| 167 |
self._is_backmatter = []
|
| 168 |
for c in self.chunks:
|
| 169 |
-
|
| 170 |
-
title = c.get("part") or c["marginal_note"]
|
| 171 |
-
else:
|
| 172 |
-
title = c["marginal_note"]
|
| 173 |
-
self._note_tokens.append(set(tokenize(title)))
|
| 174 |
self._is_regulation.append(
|
| 175 |
c.get("doc_type", "legislation") == "legislation"
|
| 176 |
and c["act_code"].startswith(("SOR", "C.R.C")))
|
|
@@ -381,11 +394,24 @@ class LegislationIndex:
|
|
| 381 |
for rank, idx in enumerate(sem_order):
|
| 382 |
fused[idx] += W_SEM / (RRF_K + rank)
|
| 383 |
|
| 384 |
-
# Ensure explicitly-referenced sections are retrieved even if recall
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 385 |
refs = _section_refs(query)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 386 |
if refs:
|
| 387 |
for idx, c in enumerate(self.chunks):
|
| 388 |
-
if c["section"] in refs and idx not in fused
|
|
|
|
| 389 |
fused[idx] = 0.0
|
| 390 |
|
| 391 |
# Title-match boost: the marginal note is a section's canonical subject.
|
|
@@ -448,9 +474,14 @@ class LegislationIndex:
|
|
| 448 |
fusion_rank[i]))
|
| 449 |
candidates = pool + candidates[RERANK_POOL:]
|
| 450 |
|
| 451 |
-
# Explicit section references are pinned to the very top
|
|
|
|
|
|
|
|
|
|
| 452 |
if refs:
|
| 453 |
-
pinned = [i for i in candidates
|
|
|
|
|
|
|
| 454 |
if pinned:
|
| 455 |
pinned_set = set(pinned)
|
| 456 |
candidates = pinned + [i for i in candidates if i not in pinned_set]
|
|
|
|
| 85 |
return set(_SECTION_REF.findall(query.lower()))
|
| 86 |
|
| 87 |
|
| 88 |
+
def topical_title(chunk):
|
| 89 |
+
"""Return the chunk's topic-bearing string, used wherever a section's
|
| 90 |
+
'title' is weighted for retrieval -- BM25 indexing, the title-match boost,
|
| 91 |
+
and the semantic embedding. Differs by doc_type because the field that
|
| 92 |
+
carries the topic differs: legislation/agreement/directive/delegation use
|
| 93 |
+
the marginal_note (section heading); D-memoranda use 'part' because their
|
| 94 |
+
marginal_note is a generic banner; case-law uses 'heading' because its
|
| 95 |
+
marginal_note is just the paragraph range ('paras 11-13') and the case
|
| 96 |
+
proposition lives in heading."""
|
| 97 |
+
doc_type = chunk.get("doc_type")
|
| 98 |
+
if doc_type == "memorandum":
|
| 99 |
+
return chunk.get("part") or chunk["marginal_note"]
|
| 100 |
+
if doc_type == "caselaw":
|
| 101 |
+
return chunk.get("heading") or chunk["marginal_note"]
|
| 102 |
+
return chunk["marginal_note"]
|
| 103 |
+
|
| 104 |
+
|
| 105 |
def _provision_units(text):
|
| 106 |
"""Citable parts of a provision, for pinpoint scoring -- a list of
|
| 107 |
(citation_suffix, scoring_text, snippet). One entry per paragraph, with its
|
|
|
|
| 153 |
self.postings = defaultdict(list) # term -> [(doc_idx, term_frequency), ...]
|
| 154 |
df = defaultdict(int)
|
| 155 |
for idx, c in enumerate(self.chunks):
|
| 156 |
+
# The topical title is repeated to weight it above body text;
|
| 157 |
# the Act name, code and section are indexed too, so an Act's own
|
| 158 |
# terminology (e.g. "controlled substance") and its codes/numbers
|
| 159 |
+
# are searchable even when a section's text omits them. The title
|
| 160 |
+
# is doc_type-aware via topical_title -- for case-law it picks
|
| 161 |
+
# the case proposition (heading), not the paragraph range
|
| 162 |
+
# (marginal_note), so a leading case surfaces on a topical query.
|
| 163 |
+
title = topical_title(c)
|
| 164 |
+
blob = " ".join((title, title, c["heading"],
|
| 165 |
c["part"], c["division"], c["act_name"], c["act_code"],
|
| 166 |
c["section"], c["text"]))
|
| 167 |
counts = Counter(tokenize(blob))
|
|
|
|
| 174 |
self.idf = {t: math.log(1 + (n - d + 0.5) / (d + 0.5)) for t, d in df.items()}
|
| 175 |
|
| 176 |
def _build_note_tokens(self):
|
| 177 |
+
"""Pre-tokenise each chunk's topical title (see topical_title) for the
|
| 178 |
+
title-match boost in search(). Each chunk is also flagged as a
|
| 179 |
+
regulation (act codes beginning SOR/C.R.C.) for the Act-over-regulation
|
| 180 |
+
preference, and as collective-agreement back-matter (memoranda and
|
| 181 |
+
letters with no article number) for the back-matter penalty."""
|
|
|
|
|
|
|
|
|
|
|
|
|
| 182 |
self._note_tokens = []
|
| 183 |
self._is_regulation = []
|
| 184 |
self._is_backmatter = []
|
| 185 |
for c in self.chunks:
|
| 186 |
+
self._note_tokens.append(set(tokenize(topical_title(c))))
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
self._is_regulation.append(
|
| 188 |
c.get("doc_type", "legislation") == "legislation"
|
| 189 |
and c["act_code"].startswith(("SOR", "C.R.C")))
|
|
|
|
| 394 |
for rank, idx in enumerate(sem_order):
|
| 395 |
fused[idx] += W_SEM / (RRF_K + rank)
|
| 396 |
|
| 397 |
+
# Ensure explicitly-referenced sections are retrieved even if recall
|
| 398 |
+
# missed them -- but only for Acts the query actually names. A query
|
| 399 |
+
# like "IRPA s. 40 misrepresentation defence" uses the section number
|
| 400 |
+
# topically; pulling every Act's s. 40 into the pool would drown out
|
| 401 |
+
# the case law that interprets the section the user meant. Substring
|
| 402 |
+
# check rather than token-overlap because act_codes split into trivial
|
| 403 |
+
# tokens ("A-8.8" -> {a, 8}) that spuriously match common query words.
|
| 404 |
refs = _section_refs(query)
|
| 405 |
+
q_lc = query.lower()
|
| 406 |
+
def _act_in_query(c):
|
| 407 |
+
short = c["act_short"].lower()
|
| 408 |
+
code = c["act_code"].lower()
|
| 409 |
+
return ((short and short in q_lc)
|
| 410 |
+
or (code and len(code) >= 3 and code in q_lc))
|
| 411 |
if refs:
|
| 412 |
for idx, c in enumerate(self.chunks):
|
| 413 |
+
if (c["section"] in refs and idx not in fused
|
| 414 |
+
and _act_in_query(c)):
|
| 415 |
fused[idx] = 0.0
|
| 416 |
|
| 417 |
# Title-match boost: the marginal note is a section's canonical subject.
|
|
|
|
| 474 |
fusion_rank[i]))
|
| 475 |
candidates = pool + candidates[RERANK_POOL:]
|
| 476 |
|
| 477 |
+
# Explicit section references are pinned to the very top -- using the
|
| 478 |
+
# same Act-mentioned constraint as the recall step above, for the same
|
| 479 |
+
# reason: a bare "s. 40" without an Act name is usually topical
|
| 480 |
+
# (e.g. "the IRPA s. 40 misrepresentation defence"), not a lookup.
|
| 481 |
if refs:
|
| 482 |
+
pinned = [i for i in candidates
|
| 483 |
+
if self.chunks[i]["section"] in refs
|
| 484 |
+
and _act_in_query(self.chunks[i])]
|
| 485 |
if pinned:
|
| 486 |
pinned_set = set(pinned)
|
| 487 |
candidates = pinned + [i for i in candidates if i not in pinned_set]
|
|
@@ -69,6 +69,8 @@
|
|
| 69 |
{"query": "Does inadmissibility for membership in a terrorist organization require a complicity analysis?", "answers": [["Kanagendren", ""]]},
|
| 70 |
{"query": "How broadly is a criminal organization interpreted for organized criminality inadmissibility?", "answers": [["Sittampalam", ""]]},
|
| 71 |
{"query": "What principles govern a finding of inadmissibility for misrepresentation?", "answers": [["Goburdhun", ""]]},
|
|
|
|
|
|
|
| 72 |
{"query": "At an immigration detention review, who bears the onus and how are earlier detention rulings treated?", "answers": [["Thanabalasingham", ""]]},
|
| 73 |
{"query": "What is the test for admitting new evidence in a pre-removal risk assessment?", "answers": [["Raza", ""]]},
|
| 74 |
{"query": "Are gold coins currency or monetary instruments that must be reported when imported?", "answers": [["Hociung", ""]]},
|
|
|
|
| 69 |
{"query": "Does inadmissibility for membership in a terrorist organization require a complicity analysis?", "answers": [["Kanagendren", ""]]},
|
| 70 |
{"query": "How broadly is a criminal organization interpreted for organized criminality inadmissibility?", "answers": [["Sittampalam", ""]]},
|
| 71 |
{"query": "What principles govern a finding of inadmissibility for misrepresentation?", "answers": [["Goburdhun", ""]]},
|
| 72 |
+
{"query": "What is the innocent-misrepresentation defence to inadmissibility under IRPA s. 40?", "answers": [["Wang", ""]]},
|
| 73 |
+
{"query": "Must a misrepresentation under IRPA s. 40 be intentional, deliberate or negligent to be material?", "answers": [["Bellido", ""]]},
|
| 74 |
{"query": "At an immigration detention review, who bears the onus and how are earlier detention rulings treated?", "answers": [["Thanabalasingham", ""]]},
|
| 75 |
{"query": "What is the test for admitting new evidence in a pre-removal risk assessment?", "answers": [["Raza", ""]]},
|
| 76 |
{"query": "Are gold coins currency or monetary instruments that must be reported when imported?", "answers": [["Hociung", ""]]},
|