Spaces:
Sleeping
Sleeping
File size: 6,616 Bytes
ec4ae03 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | """
Normalization layer for LLM outputs before SymPy parsing.
This module provides a single, well-tested function to convert common LLM output
patterns (Unicode operators, currency symbols, implicit styles) into SymPy-friendly
ASCII Python-like expressions suitable for `sympy.parsing.sympy_parser.parse_expr`.
## Why normalize instead of controlling LLM output?
LLMs generate diverse textual math notation (^, Γ, Ο, commas in numbers, etc.) that
cannot be reliably controlled at the token level. A deterministic preprocessing layer
is more robust than trying to force specific character-level outputs during training.
## SymPy parsing context
SymPy's `parse_expr` (docs: https://docs.sympy.org/latest/modules/parsing.html):
- Uses Python-like expression syntax as the base grammar.
- Applies **transformations** (token rewrites) before evaluation.
- Notable transformations:
- `standard_transformations`: auto symbol/number conversion, factorial notation.
- `convert_xor`: treats `^` as power (not bitwise XOR).
- `implicit_multiplication_application`: relaxes syntax (implicit mult, split symbols).
- LaTeX is a **separate path** via `sympy.parsing.latex.parse_latex` (experimental).
**Security note:** `parse_expr` uses `eval` internally. Treat LLM outputs as untrusted;
this module helps but does not sandbox.
## Normalization mapping (categories)
| Category | LLM output | Normalized | Notes |
|--------------------|----------------------|-------------------|----------------------------------------|
| Power | `^` | `**` | Python power operator |
| Multiplication | `Γ`, `Β·`, `β’` | `*` | Unicode operators β ASCII |
| Division | `Γ·` | `/` | Unicode division sign β ASCII |
| Minus sign | `β` (U+2212) | `-` | Typography minus β ASCII hyphen-minus |
| Comparisons | `β€`, `β₯`, `β ` | `<=`, `>=`, `!=` | Relational operators (if parsing them) |
| Constants | `Ο` | `pi` | Greek letter β SymPy symbol name |
| Thousands sep | `80,000` | `80000` | Remove commas in numeric literals |
| Currency | `$`, `β¬`, `Β£` | (removed) | Strip before parsing numeric tails |
| Extra whitespace | multiple spaces/tabs | single space | Collapse for cleaner parsing |
Not handled (by design):
- **LaTeX** (`\\frac`, `\\sqrt`, etc.): route to `parse_latex` separately if needed.
- **Natural language prefix** ("Janet sells 16-3-4=9 eggs"): caller extracts math tail first.
- **Grouping `[` `]`**: context-dependent; avoid substituting without semantic analysis.
Version lock: sympy==1.14.0 (line 84 in requirements.txt at time of writing).
"""
from __future__ import annotations
import re
def normalize_for_parse_expr(text: str) -> str:
"""
Normalize LLM-generated math text for SymPy's `parse_expr`.
Converts common Unicode operators, currency symbols, and formatting quirks
into ASCII Python-like syntax. This is the single source of truth for
string preprocessing before SymPy parsing in this project.
Parameters
----------
text : str
Raw string (potentially mixed prose and math from LLM).
Returns
-------
str
Normalized ASCII expression.
Examples
--------
>>> normalize_for_parse_expr("2^3")
'2**3'
>>> normalize_for_parse_expr("16 Γ 3 β 4")
'16 * 3 - 4'
>>> normalize_for_parse_expr("$2,500")
'2500'
>>> normalize_for_parse_expr("Ο/2")
'pi/2'
"""
s = text.strip()
# Power: ^ β **
s = s.replace("^", "**")
# Multiplication: Unicode operators β *
s = s.replace("Γ", "*")
s = s.replace("Β·", "*")
s = s.replace("β’", "*")
s = s.replace("\u00d7", "*") # U+00D7 MULTIPLICATION SIGN (Γ)
s = s.replace("\u22c5", "*") # U+22C5 DOT OPERATOR (β
)
s = s.replace("\u2022", "*") # U+2022 BULLET (β’)
# Division: Unicode Γ· β /
s = s.replace("Γ·", "/")
s = s.replace("\u00f7", "/") # U+00F7 DIVISION SIGN
# Minus: typography minus (U+2212) β ASCII hyphen-minus
s = s.replace("\u2212", "-") # U+2212 MINUS SIGN (β)
# Comparison operators (if ever parsing relations)
s = s.replace("β€", "<=")
s = s.replace("β₯", ">=")
s = s.replace("β ", "!=")
s = s.replace("\u2264", "<=") # U+2264 LESS-THAN OR EQUAL TO
s = s.replace("\u2265", ">=") # U+2265 GREATER-THAN OR EQUAL TO
s = s.replace("\u2260", "!=") # U+2260 NOT EQUAL TO
# Greek constants: Ο β pi (SymPy symbol name)
s = s.replace("Ο", "pi")
s = s.replace("\u03c0", "pi") # U+03C0 GREEK SMALL LETTER PI
# Currency symbols: remove (caller typically strips or segments numeric tails)
s = re.sub(r"[$β¬Β£Β₯βΉ]", "", s)
# Thousands separators in numbers: 80,000 β 80000
# Match comma only between digits in a numeric context
s = re.sub(r"(?<=\d),(?=\d{3}\b)", "", s)
# Spoken "times" with ASCII letter x (grade-school / LLM): "4 x 90" must not
# become 4*x*90 in SymPy (x parsed as a symbol β false failures on chains).
# Only between digit and digit or digit and '('.
s = re.sub(r"(?<=\d)\s+[xX]\s+(?=\d|\()", "*", s)
# Collapse multiple spaces/tabs to single space
s = re.sub(r"[ \t]+", " ", s)
# Collapse excessive newlines (keep at most double)
s = re.sub(r"\n{3,}", "\n\n", s)
return s.strip()
def prefer_arithmetic_tail(text: str) -> str:
"""
Return substring starting from the first digit (if present), else full text.
Useful when LLM outputs mix natural language with equations, e.g.:
"Janet sells 16 - 3 - 4 = 9 eggs every day"
This heuristic extracts "16 - 3 - 4 = 9 eggs every day" (digit onward),
reducing risk of English words being parsed as symbols when implicit
multiplication transformations are enabled.
Parameters
----------
text : str
Potentially mixed prose + math.
Returns
-------
str
Substring from first digit onward, or original if no digit.
Examples
--------
>>> prefer_arithmetic_tail("Janet sells 16-3-4=9")
'16-3-4=9'
>>> prefer_arithmetic_tail("no digits here")
'no digits here'
"""
s = normalize_for_parse_expr(text)
m = re.search(r"\d", s)
if m:
return s[m.start() :].strip()
return s.strip()
|