Beemer Claude Opus 4.7 commited on
Commit
547ec21
·
1 Parent(s): d33c8fb

Post-stem normalization for verb/noun pairs Snowball leaves apart

Browse files

Snowball stems some legal verb/noun pairs to different roots ("seize" ->
"seiz", "seizure" -> "seizur"; same for forfeit/forfeiture, detain/
detention, exclude/exclusion, admit/admission, apply/application,
comply/compliance, grieve/grievance, appeal/appellate). A query naming
the verb missed a provision titled with the noun on the title-match
boost and contributed less to BM25 than it should. A small post-stem
table merges each pair to the verb's stem -- applied on both indexing
and querying so the merge is consistent.

141-question eval: Hit@1 0.79 / Hit@3 0.93 / Hit@5 0.96 / Hit@10 0.98
/ MRR 0.87 (vs pre-stemmer 0.79 / 0.92 / 0.95 / 0.97 / 0.86 -- every
metric beyond Hit@1 up 0.01); 7 misses -> 6 (Khosa #6 -> #5).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (1) hide show
  1. canlex/index.py +21 -1
canlex/index.py CHANGED
@@ -69,11 +69,31 @@ _STEMMER = snowballstemmer.stemmer("english")
69
  _STEM_CACHE = {}
70
 
71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  def _stem(word):
73
- """Snowball-stem a word, memoised -- legal text repeats terms heavily."""
 
 
74
  stemmed = _STEM_CACHE.get(word)
75
  if stemmed is None:
76
  stemmed = _STEMMER.stemWord(word)
 
77
  _STEM_CACHE[word] = stemmed
78
  return stemmed
79
 
 
69
  _STEM_CACHE = {}
70
 
71
 
72
+ # Stem pairs Snowball does not merge but that share a legal meaning, so a
73
+ # query naming the verb still matches a provision titled with the noun (or
74
+ # vice versa). Mapped to the verb form on both index and query sides, which
75
+ # is consistent and arbitrary -- the merge is what matters.
76
+ _STEM_NORMALIZE = {
77
+ "seizur": "seiz", # seizure -> seize
78
+ "forfeitur": "forfeit", # forfeiture -> forfeit
79
+ "appel": "appeal", # appellate/appellant -> appeal
80
+ "detent": "detain", # detention -> detain
81
+ "exclus": "exclud", # exclusion -> exclude
82
+ "admiss": "admit", # admission/admissibility -> admit
83
+ "applic": "appli", # application -> apply
84
+ "complianc": "compli", # compliance -> comply
85
+ "grievanc": "griev", # grievance -> grieve
86
+ }
87
+
88
+
89
  def _stem(word):
90
+ """Snowball-stem a word, memoised -- legal text repeats terms heavily.
91
+ A small post-stem normalization merges a few verb/noun pairs Snowball
92
+ leaves apart ('seize'/'seizure', 'forfeit'/'forfeiture')."""
93
  stemmed = _STEM_CACHE.get(word)
94
  if stemmed is None:
95
  stemmed = _STEMMER.stemWord(word)
96
+ stemmed = _STEM_NORMALIZE.get(stemmed, stemmed)
97
  _STEM_CACHE[word] = stemmed
98
  return stemmed
99