Spaces:

Beemer0
/

CanLex

Running

Beemer Claude Opus 4.7 commited on 4 days ago

Commit

666cd44

1 Parent(s): 4b46b2f

Use case-law topic, not paragraph range, as the retrieval title

A case-law chunk's marginal_note is the paragraph range ('paras 11-13'),
which is useless for topic matching; the proposition the case stands for
lives in the 'heading' field, constant across every chunk of a given case.
The BM25 title-position emphasis, the title-match boost, and the semantic
embed_text were all reading marginal_note for case-law and giving it no
useful signal, so leading cases stayed buried behind statutes and memos
that name the doctrine by literal wording.

Add a doc_type-aware topical_title helper used in all three places. For
case-law it returns 'heading' (the case topic); for memoranda it keeps the
existing 'part'-based selection; for everything else it returns marginal_note.

Also fix a related issue: a query like "IRPA s. 40 misrepresentation defence"
uses the section number topically, but the section-ref recall pulled every
Act's s. 40 into the pool and the section-ref pin pushed them above the
case law that interprets the section the user meant. Both paths now require
the candidate's act_short or act_code to appear as a substring in the query
-- a lookup-style "section 32 of the Customs Act" still pins cleanly, but
a topical "IRPA s. 40" no longer drags in AAAMPA, Cannabis Act, etc.

141-question eval: Hit@1 0.79 / Hit@3 0.92 / Hit@5 0.95 / Hit@10 0.97
/ MRR 0.86 -- vs pre-fix 139-Q baseline 0.75 / 0.88 / 0.94 / 0.97 / 0.83,
Hit@1 +0.04, Hit@3 +0.04, Hit@5 +0.01, MRR +0.03; 10 misses -> 7. Two new
Wang/Bellido gold questions on the IRPA s. 40 misrepresentation doctrine
both surface in top 5 (previously absent from top 20). One small collateral
re-rank: Customs Tariff Sch-98.03 (non-resident conveyance) slipped from
#2 to #6, beaten by Sch-98.02 on opposite-paraphrase ("non-resident" vs
"resident"); the embedding space shifts under any re-embed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (3) hide show

canlex/embed.py +6 -7
canlex/index.py +52 -21
data/eval/questions.json +2 -0

canlex/embed.py CHANGED Viewed

@@ -33,13 +33,12 @@ def load_chunks():
 def embed_text(chunk):
     """Compact, retrieval-focused representation of one section."""
     # The section title is the strongest topical signal, so it is repeated to
-    # emphasise it. For D-memoranda the marginal note is only a generic section
-    # label ('Guidelines', 'Legislation'); the memo's actual subject lives in
-    # the 'part' field, so that is used as the title instead.
-    if chunk.get("doc_type") == "memorandum":
-        note = chunk.get("part") or chunk["marginal_note"]
-    else:
-        note = chunk["marginal_note"]
     body = chunk["text"][:_MAX_BODY]
     parts = [chunk["act_short"], note, note, chunk["heading"], body]
     return " . ".join(p for p in parts if p)

 def embed_text(chunk):
     """Compact, retrieval-focused representation of one section."""
     # The section title is the strongest topical signal, so it is repeated to
+    # emphasise it. Title selection is doc_type-aware (see index.topical_title):
+    # a D-memo's marginal_note is a generic banner so its actual subject in
+    # 'part' is used; a case-law chunk's marginal_note is just the paragraph
+    # range so the case proposition in 'heading' is used.
+    from .index import topical_title
+    note = topical_title(chunk)
     body = chunk["text"][:_MAX_BODY]
     parts = [chunk["act_short"], note, note, chunk["heading"], body]
     return " . ".join(p for p in parts if p)

canlex/index.py CHANGED Viewed

@@ -85,6 +85,23 @@ def _section_refs(query):
     return set(_SECTION_REF.findall(query.lower()))
 def _provision_units(text):
     """Citable parts of a provision, for pinpoint scoring -- a list of
     (citation_suffix, scoring_text, snippet). One entry per paragraph, with its
@@ -136,11 +153,15 @@ class LegislationIndex:
         self.postings = defaultdict(list)  # term -> [(doc_idx, term_frequency), ...]
         df = defaultdict(int)
         for idx, c in enumerate(self.chunks):
-            # The marginal note (title) is repeated to weight it above body text;
             # the Act name, code and section are indexed too, so an Act's own
             # terminology (e.g. "controlled substance") and its codes/numbers
-            # are searchable even when a section's text omits them.
-            blob = " ".join((c["marginal_note"], c["marginal_note"], c["heading"],
                              c["part"], c["division"], c["act_name"], c["act_code"],
                              c["section"], c["text"]))
             counts = Counter(tokenize(blob))
@@ -153,24 +174,16 @@ class LegislationIndex:
         self.idf = {t: math.log(1 + (n - d + 0.5) / (d + 0.5)) for t, d in df.items()}
     def _build_note_tokens(self):
-        """Pre-tokenise each chunk's topical title, for the title-match boost in
-        search(). For legislation, agreements and directives the title is the
-        marginal note (the section heading). A D-memorandum's marginal note is
-        generic ('Legislation', 'Guidelines and General Information', or a stray
-        page banner), so the memo's subject -- carried in its 'part' field -- is
-        used instead. Each chunk is also flagged as a regulation (act codes
-        beginning SOR/C.R.C.) for the Act-over-regulation preference, and as
-        collective-agreement back-matter (memoranda and letters with no article
-        number) for the back-matter penalty."""
         self._note_tokens = []
         self._is_regulation = []
         self._is_backmatter = []
         for c in self.chunks:
-            if c.get("doc_type") == "memorandum":
-                title = c.get("part") or c["marginal_note"]
-            else:
-                title = c["marginal_note"]
-            self._note_tokens.append(set(tokenize(title)))
             self._is_regulation.append(
                 c.get("doc_type", "legislation") == "legislation"
                 and c["act_code"].startswith(("SOR", "C.R.C")))
@@ -381,11 +394,24 @@ class LegislationIndex:
             for rank, idx in enumerate(sem_order):
                 fused[idx] += W_SEM / (RRF_K + rank)
-        # Ensure explicitly-referenced sections are retrieved even if recall missed them.
         refs = _section_refs(query)
         if refs:
             for idx, c in enumerate(self.chunks):
-                if c["section"] in refs and idx not in fused:
                     fused[idx] = 0.0
         # Title-match boost: the marginal note is a section's canonical subject.
@@ -448,9 +474,14 @@ class LegislationIndex:
                                      fusion_rank[i]))
             candidates = pool + candidates[RERANK_POOL:]
-        # Explicit section references are pinned to the very top.
         if refs:
-            pinned = [i for i in candidates if self.chunks[i]["section"] in refs]
             if pinned:
                 pinned_set = set(pinned)
                 candidates = pinned + [i for i in candidates if i not in pinned_set]

     return set(_SECTION_REF.findall(query.lower()))
+def topical_title(chunk):
+    """Return the chunk's topic-bearing string, used wherever a section's
+    'title' is weighted for retrieval -- BM25 indexing, the title-match boost,
+    and the semantic embedding. Differs by doc_type because the field that
+    carries the topic differs: legislation/agreement/directive/delegation use
+    the marginal_note (section heading); D-memoranda use 'part' because their
+    marginal_note is a generic banner; case-law uses 'heading' because its
+    marginal_note is just the paragraph range ('paras 11-13') and the case
+    proposition lives in heading."""
+    doc_type = chunk.get("doc_type")
+    if doc_type == "memorandum":
+        return chunk.get("part") or chunk["marginal_note"]
+    if doc_type == "caselaw":
+        return chunk.get("heading") or chunk["marginal_note"]
+    return chunk["marginal_note"]
 def _provision_units(text):
     """Citable parts of a provision, for pinpoint scoring -- a list of
     (citation_suffix, scoring_text, snippet). One entry per paragraph, with its
         self.postings = defaultdict(list)  # term -> [(doc_idx, term_frequency), ...]
         df = defaultdict(int)
         for idx, c in enumerate(self.chunks):
+            # The topical title is repeated to weight it above body text;
             # the Act name, code and section are indexed too, so an Act's own
             # terminology (e.g. "controlled substance") and its codes/numbers
+            # are searchable even when a section's text omits them. The title
+            # is doc_type-aware via topical_title -- for case-law it picks
+            # the case proposition (heading), not the paragraph range
+            # (marginal_note), so a leading case surfaces on a topical query.
+            title = topical_title(c)
+            blob = " ".join((title, title, c["heading"],
                              c["part"], c["division"], c["act_name"], c["act_code"],
                              c["section"], c["text"]))
             counts = Counter(tokenize(blob))
         self.idf = {t: math.log(1 + (n - d + 0.5) / (d + 0.5)) for t, d in df.items()}
     def _build_note_tokens(self):
+        """Pre-tokenise each chunk's topical title (see topical_title) for the
+        title-match boost in search(). Each chunk is also flagged as a
+        regulation (act codes beginning SOR/C.R.C.) for the Act-over-regulation
+        preference, and as collective-agreement back-matter (memoranda and
+        letters with no article number) for the back-matter penalty."""
         self._note_tokens = []
         self._is_regulation = []
         self._is_backmatter = []
         for c in self.chunks:
+            self._note_tokens.append(set(tokenize(topical_title(c))))
             self._is_regulation.append(
                 c.get("doc_type", "legislation") == "legislation"
                 and c["act_code"].startswith(("SOR", "C.R.C")))
             for rank, idx in enumerate(sem_order):
                 fused[idx] += W_SEM / (RRF_K + rank)
+        # Ensure explicitly-referenced sections are retrieved even if recall
+        # missed them -- but only for Acts the query actually names. A query
+        # like "IRPA s. 40 misrepresentation defence" uses the section number
+        # topically; pulling every Act's s. 40 into the pool would drown out
+        # the case law that interprets the section the user meant. Substring
+        # check rather than token-overlap because act_codes split into trivial
+        # tokens ("A-8.8" -> {a, 8}) that spuriously match common query words.
         refs = _section_refs(query)
+        q_lc = query.lower()
+        def _act_in_query(c):
+            short = c["act_short"].lower()
+            code = c["act_code"].lower()
+            return ((short and short in q_lc)
+                    or (code and len(code) >= 3 and code in q_lc))
         if refs:
             for idx, c in enumerate(self.chunks):
+                if (c["section"] in refs and idx not in fused
+                        and _act_in_query(c)):
                     fused[idx] = 0.0
         # Title-match boost: the marginal note is a section's canonical subject.
                                      fusion_rank[i]))
             candidates = pool + candidates[RERANK_POOL:]
+        # Explicit section references are pinned to the very top -- using the
+        # same Act-mentioned constraint as the recall step above, for the same
+        # reason: a bare "s. 40" without an Act name is usually topical
+        # (e.g. "the IRPA s. 40 misrepresentation defence"), not a lookup.
         if refs:
+            pinned = [i for i in candidates
+                      if self.chunks[i]["section"] in refs
+                      and _act_in_query(self.chunks[i])]
             if pinned:
                 pinned_set = set(pinned)
                 candidates = pinned + [i for i in candidates if i not in pinned_set]

data/eval/questions.json CHANGED Viewed

@@ -69,6 +69,8 @@
   {"query": "Does inadmissibility for membership in a terrorist organization require a complicity analysis?", "answers": [["Kanagendren", ""]]},
   {"query": "How broadly is a criminal organization interpreted for organized criminality inadmissibility?", "answers": [["Sittampalam", ""]]},
   {"query": "What principles govern a finding of inadmissibility for misrepresentation?", "answers": [["Goburdhun", ""]]},
   {"query": "At an immigration detention review, who bears the onus and how are earlier detention rulings treated?", "answers": [["Thanabalasingham", ""]]},
   {"query": "What is the test for admitting new evidence in a pre-removal risk assessment?", "answers": [["Raza", ""]]},
   {"query": "Are gold coins currency or monetary instruments that must be reported when imported?", "answers": [["Hociung", ""]]},

   {"query": "Does inadmissibility for membership in a terrorist organization require a complicity analysis?", "answers": [["Kanagendren", ""]]},
   {"query": "How broadly is a criminal organization interpreted for organized criminality inadmissibility?", "answers": [["Sittampalam", ""]]},
   {"query": "What principles govern a finding of inadmissibility for misrepresentation?", "answers": [["Goburdhun", ""]]},
+  {"query": "What is the innocent-misrepresentation defence to inadmissibility under IRPA s. 40?", "answers": [["Wang", ""]]},
+  {"query": "Must a misrepresentation under IRPA s. 40 be intentional, deliberate or negligent to be material?", "answers": [["Bellido", ""]]},
   {"query": "At an immigration detention review, who bears the onus and how are earlier detention rulings treated?", "answers": [["Thanabalasingham", ""]]},
   {"query": "What is the test for admitting new evidence in a pre-removal risk assessment?", "answers": [["Raza", ""]]},
   {"query": "Are gold coins currency or monetary instruments that must be reported when imported?", "answers": [["Hociung", ""]]},