Spaces:

Beemer0
/

CanLex

Running

Beemer Claude Opus 4.7 commited on 4 days ago

Commit

547ec21

1 Parent(s): d33c8fb

Post-stem normalization for verb/noun pairs Snowball leaves apart

Snowball stems some legal verb/noun pairs to different roots ("seize" ->
"seiz", "seizure" -> "seizur"; same for forfeit/forfeiture, detain/
detention, exclude/exclusion, admit/admission, apply/application,
comply/compliance, grieve/grievance, appeal/appellate). A query naming
the verb missed a provision titled with the noun on the title-match
boost and contributed less to BM25 than it should. A small post-stem
table merges each pair to the verb's stem -- applied on both indexing
and querying so the merge is consistent.

141-question eval: Hit@1 0.79 / Hit@3 0.93 / Hit@5 0.96 / Hit@10 0.98
/ MRR 0.87 (vs pre-stemmer 0.79 / 0.92 / 0.95 / 0.97 / 0.86 -- every
metric beyond Hit@1 up 0.01); 7 misses -> 6 (Khosa #6 -> #5).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (1) hide show

canlex/index.py +21 -1

canlex/index.py CHANGED Viewed

@@ -69,11 +69,31 @@ _STEMMER = snowballstemmer.stemmer("english")
 _STEM_CACHE = {}
 def _stem(word):
-    """Snowball-stem a word, memoised -- legal text repeats terms heavily."""
     stemmed = _STEM_CACHE.get(word)
     if stemmed is None:
         stemmed = _STEMMER.stemWord(word)
         _STEM_CACHE[word] = stemmed
     return stemmed

 _STEM_CACHE = {}
+# Stem pairs Snowball does not merge but that share a legal meaning, so a
+# query naming the verb still matches a provision titled with the noun (or
+# vice versa). Mapped to the verb form on both index and query sides, which
+# is consistent and arbitrary -- the merge is what matters.
+_STEM_NORMALIZE = {
+    "seizur": "seiz",            # seizure -> seize
+    "forfeitur": "forfeit",      # forfeiture -> forfeit
+    "appel": "appeal",           # appellate/appellant -> appeal
+    "detent": "detain",          # detention -> detain
+    "exclus": "exclud",          # exclusion -> exclude
+    "admiss": "admit",           # admission/admissibility -> admit
+    "applic": "appli",           # application -> apply
+    "complianc": "compli",       # compliance -> comply
+    "grievanc": "griev",         # grievance -> grieve
+}
 def _stem(word):
+    """Snowball-stem a word, memoised -- legal text repeats terms heavily.
+    A small post-stem normalization merges a few verb/noun pairs Snowball
+    leaves apart ('seize'/'seizure', 'forfeit'/'forfeiture')."""
     stemmed = _STEM_CACHE.get(word)
     if stemmed is None:
         stemmed = _STEMMER.stemWord(word)
+        stemmed = _STEM_NORMALIZE.get(stemmed, stemmed)
         _STEM_CACHE[word] = stemmed
     return stemmed