Post-stem normalization for verb/noun pairs Snowball leaves apart
Browse filesSnowball stems some legal verb/noun pairs to different roots ("seize" ->
"seiz", "seizure" -> "seizur"; same for forfeit/forfeiture, detain/
detention, exclude/exclusion, admit/admission, apply/application,
comply/compliance, grieve/grievance, appeal/appellate). A query naming
the verb missed a provision titled with the noun on the title-match
boost and contributed less to BM25 than it should. A small post-stem
table merges each pair to the verb's stem -- applied on both indexing
and querying so the merge is consistent.
141-question eval: Hit@1 0.79 / Hit@3 0.93 / Hit@5 0.96 / Hit@10 0.98
/ MRR 0.87 (vs pre-stemmer 0.79 / 0.92 / 0.95 / 0.97 / 0.86 -- every
metric beyond Hit@1 up 0.01); 7 misses -> 6 (Khosa #6 -> #5).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- canlex/index.py +21 -1
|
@@ -69,11 +69,31 @@ _STEMMER = snowballstemmer.stemmer("english")
|
|
| 69 |
_STEM_CACHE = {}
|
| 70 |
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
def _stem(word):
|
| 73 |
-
"""Snowball-stem a word, memoised -- legal text repeats terms heavily.
|
|
|
|
|
|
|
| 74 |
stemmed = _STEM_CACHE.get(word)
|
| 75 |
if stemmed is None:
|
| 76 |
stemmed = _STEMMER.stemWord(word)
|
|
|
|
| 77 |
_STEM_CACHE[word] = stemmed
|
| 78 |
return stemmed
|
| 79 |
|
|
|
|
| 69 |
_STEM_CACHE = {}
|
| 70 |
|
| 71 |
|
| 72 |
+
# Stem pairs Snowball does not merge but that share a legal meaning, so a
|
| 73 |
+
# query naming the verb still matches a provision titled with the noun (or
|
| 74 |
+
# vice versa). Mapped to the verb form on both index and query sides, which
|
| 75 |
+
# is consistent and arbitrary -- the merge is what matters.
|
| 76 |
+
_STEM_NORMALIZE = {
|
| 77 |
+
"seizur": "seiz", # seizure -> seize
|
| 78 |
+
"forfeitur": "forfeit", # forfeiture -> forfeit
|
| 79 |
+
"appel": "appeal", # appellate/appellant -> appeal
|
| 80 |
+
"detent": "detain", # detention -> detain
|
| 81 |
+
"exclus": "exclud", # exclusion -> exclude
|
| 82 |
+
"admiss": "admit", # admission/admissibility -> admit
|
| 83 |
+
"applic": "appli", # application -> apply
|
| 84 |
+
"complianc": "compli", # compliance -> comply
|
| 85 |
+
"grievanc": "griev", # grievance -> grieve
|
| 86 |
+
}
|
| 87 |
+
|
| 88 |
+
|
| 89 |
def _stem(word):
|
| 90 |
+
"""Snowball-stem a word, memoised -- legal text repeats terms heavily.
|
| 91 |
+
A small post-stem normalization merges a few verb/noun pairs Snowball
|
| 92 |
+
leaves apart ('seize'/'seizure', 'forfeit'/'forfeiture')."""
|
| 93 |
stemmed = _STEM_CACHE.get(word)
|
| 94 |
if stemmed is None:
|
| 95 |
stemmed = _STEMMER.stemWord(word)
|
| 96 |
+
stemmed = _STEM_NORMALIZE.get(stemmed, stemmed)
|
| 97 |
_STEM_CACHE[word] = stemmed
|
| 98 |
return stemmed
|
| 99 |
|