Spaces:
Sleeping
Sleeping
| """ | |
| Normalization layer for LLM outputs before SymPy parsing. | |
| This module provides a single, well-tested function to convert common LLM output | |
| patterns (Unicode operators, currency symbols, implicit styles) into SymPy-friendly | |
| ASCII Python-like expressions suitable for `sympy.parsing.sympy_parser.parse_expr`. | |
| ## Why normalize instead of controlling LLM output? | |
| LLMs generate diverse textual math notation (^, Γ, Ο, commas in numbers, etc.) that | |
| cannot be reliably controlled at the token level. A deterministic preprocessing layer | |
| is more robust than trying to force specific character-level outputs during training. | |
| ## SymPy parsing context | |
| SymPy's `parse_expr` (docs: https://docs.sympy.org/latest/modules/parsing.html): | |
| - Uses Python-like expression syntax as the base grammar. | |
| - Applies **transformations** (token rewrites) before evaluation. | |
| - Notable transformations: | |
| - `standard_transformations`: auto symbol/number conversion, factorial notation. | |
| - `convert_xor`: treats `^` as power (not bitwise XOR). | |
| - `implicit_multiplication_application`: relaxes syntax (implicit mult, split symbols). | |
| - LaTeX is a **separate path** via `sympy.parsing.latex.parse_latex` (experimental). | |
| **Security note:** `parse_expr` uses `eval` internally. Treat LLM outputs as untrusted; | |
| this module helps but does not sandbox. | |
| ## Normalization mapping (categories) | |
| | Category | LLM output | Normalized | Notes | | |
| |--------------------|----------------------|-------------------|----------------------------------------| | |
| | Power | `^` | `**` | Python power operator | | |
| | Multiplication | `Γ`, `Β·`, `β’` | `*` | Unicode operators β ASCII | | |
| | Division | `Γ·` | `/` | Unicode division sign β ASCII | | |
| | Minus sign | `β` (U+2212) | `-` | Typography minus β ASCII hyphen-minus | | |
| | Comparisons | `β€`, `β₯`, `β ` | `<=`, `>=`, `!=` | Relational operators (if parsing them) | | |
| | Constants | `Ο` | `pi` | Greek letter β SymPy symbol name | | |
| | Thousands sep | `80,000` | `80000` | Remove commas in numeric literals | | |
| | Currency | `$`, `β¬`, `Β£` | (removed) | Strip before parsing numeric tails | | |
| | Extra whitespace | multiple spaces/tabs | single space | Collapse for cleaner parsing | | |
| Not handled (by design): | |
| - **LaTeX** (`\\frac`, `\\sqrt`, etc.): route to `parse_latex` separately if needed. | |
| - **Natural language prefix** ("Janet sells 16-3-4=9 eggs"): caller extracts math tail first. | |
| - **Grouping `[` `]`**: context-dependent; avoid substituting without semantic analysis. | |
| Version lock: sympy==1.14.0 (line 84 in requirements.txt at time of writing). | |
| """ | |
| from __future__ import annotations | |
| import re | |
| def normalize_for_parse_expr(text: str) -> str: | |
| """ | |
| Normalize LLM-generated math text for SymPy's `parse_expr`. | |
| Converts common Unicode operators, currency symbols, and formatting quirks | |
| into ASCII Python-like syntax. This is the single source of truth for | |
| string preprocessing before SymPy parsing in this project. | |
| Parameters | |
| ---------- | |
| text : str | |
| Raw string (potentially mixed prose and math from LLM). | |
| Returns | |
| ------- | |
| str | |
| Normalized ASCII expression. | |
| Examples | |
| -------- | |
| >>> normalize_for_parse_expr("2^3") | |
| '2**3' | |
| >>> normalize_for_parse_expr("16 Γ 3 β 4") | |
| '16 * 3 - 4' | |
| >>> normalize_for_parse_expr("$2,500") | |
| '2500' | |
| >>> normalize_for_parse_expr("Ο/2") | |
| 'pi/2' | |
| """ | |
| s = text.strip() | |
| # Power: ^ β ** | |
| s = s.replace("^", "**") | |
| # Multiplication: Unicode operators β * | |
| s = s.replace("Γ", "*") | |
| s = s.replace("Β·", "*") | |
| s = s.replace("β’", "*") | |
| s = s.replace("\u00d7", "*") # U+00D7 MULTIPLICATION SIGN (Γ) | |
| s = s.replace("\u22c5", "*") # U+22C5 DOT OPERATOR (β ) | |
| s = s.replace("\u2022", "*") # U+2022 BULLET (β’) | |
| # Division: Unicode Γ· β / | |
| s = s.replace("Γ·", "/") | |
| s = s.replace("\u00f7", "/") # U+00F7 DIVISION SIGN | |
| # Minus: typography minus (U+2212) β ASCII hyphen-minus | |
| s = s.replace("\u2212", "-") # U+2212 MINUS SIGN (β) | |
| # Comparison operators (if ever parsing relations) | |
| s = s.replace("β€", "<=") | |
| s = s.replace("β₯", ">=") | |
| s = s.replace("β ", "!=") | |
| s = s.replace("\u2264", "<=") # U+2264 LESS-THAN OR EQUAL TO | |
| s = s.replace("\u2265", ">=") # U+2265 GREATER-THAN OR EQUAL TO | |
| s = s.replace("\u2260", "!=") # U+2260 NOT EQUAL TO | |
| # Greek constants: Ο β pi (SymPy symbol name) | |
| s = s.replace("Ο", "pi") | |
| s = s.replace("\u03c0", "pi") # U+03C0 GREEK SMALL LETTER PI | |
| # Currency symbols: remove (caller typically strips or segments numeric tails) | |
| s = re.sub(r"[$β¬Β£Β₯βΉ]", "", s) | |
| # Thousands separators in numbers: 80,000 β 80000 | |
| # Match comma only between digits in a numeric context | |
| s = re.sub(r"(?<=\d),(?=\d{3}\b)", "", s) | |
| # Spoken "times" with ASCII letter x (grade-school / LLM): "4 x 90" must not | |
| # become 4*x*90 in SymPy (x parsed as a symbol β false failures on chains). | |
| # Only between digit and digit or digit and '('. | |
| s = re.sub(r"(?<=\d)\s+[xX]\s+(?=\d|\()", "*", s) | |
| # Collapse multiple spaces/tabs to single space | |
| s = re.sub(r"[ \t]+", " ", s) | |
| # Collapse excessive newlines (keep at most double) | |
| s = re.sub(r"\n{3,}", "\n\n", s) | |
| return s.strip() | |
| def prefer_arithmetic_tail(text: str) -> str: | |
| """ | |
| Return substring starting from the first digit (if present), else full text. | |
| Useful when LLM outputs mix natural language with equations, e.g.: | |
| "Janet sells 16 - 3 - 4 = 9 eggs every day" | |
| This heuristic extracts "16 - 3 - 4 = 9 eggs every day" (digit onward), | |
| reducing risk of English words being parsed as symbols when implicit | |
| multiplication transformations are enabled. | |
| Parameters | |
| ---------- | |
| text : str | |
| Potentially mixed prose + math. | |
| Returns | |
| ------- | |
| str | |
| Substring from first digit onward, or original if no digit. | |
| Examples | |
| -------- | |
| >>> prefer_arithmetic_tail("Janet sells 16-3-4=9") | |
| '16-3-4=9' | |
| >>> prefer_arithmetic_tail("no digits here") | |
| 'no digits here' | |
| """ | |
| s = normalize_for_parse_expr(text) | |
| m = re.search(r"\d", s) | |
| if m: | |
| return s[m.start() :].strip() | |
| return s.strip() | |