File size: 6,616 Bytes
ec4ae03
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
"""
Normalization layer for LLM outputs before SymPy parsing.

This module provides a single, well-tested function to convert common LLM output
patterns (Unicode operators, currency symbols, implicit styles) into SymPy-friendly
ASCII Python-like expressions suitable for `sympy.parsing.sympy_parser.parse_expr`.

## Why normalize instead of controlling LLM output?

LLMs generate diverse textual math notation (^, Γ—, Ο€, commas in numbers, etc.) that
cannot be reliably controlled at the token level. A deterministic preprocessing layer
is more robust than trying to force specific character-level outputs during training.

## SymPy parsing context

SymPy's `parse_expr` (docs: https://docs.sympy.org/latest/modules/parsing.html):
- Uses Python-like expression syntax as the base grammar.
- Applies **transformations** (token rewrites) before evaluation.
- Notable transformations:
  - `standard_transformations`: auto symbol/number conversion, factorial notation.
  - `convert_xor`: treats `^` as power (not bitwise XOR).
  - `implicit_multiplication_application`: relaxes syntax (implicit mult, split symbols).
  - LaTeX is a **separate path** via `sympy.parsing.latex.parse_latex` (experimental).

**Security note:** `parse_expr` uses `eval` internally. Treat LLM outputs as untrusted;
this module helps but does not sandbox.

## Normalization mapping (categories)

| Category           | LLM output           | Normalized        | Notes                                  |
|--------------------|----------------------|-------------------|----------------------------------------|
| Power              | `^`                  | `**`              | Python power operator                  |
| Multiplication     | `Γ—`, `Β·`, `β€’`        | `*`               | Unicode operators β†’ ASCII              |
| Division           | `Γ·`                  | `/`               | Unicode division sign β†’ ASCII          |
| Minus sign         | `βˆ’` (U+2212)         | `-`               | Typography minus β†’ ASCII hyphen-minus  |
| Comparisons        | `≀`, `β‰₯`, `β‰ `        | `<=`, `>=`, `!=`  | Relational operators (if parsing them) |
| Constants          | `Ο€`                  | `pi`              | Greek letter β†’ SymPy symbol name       |
| Thousands sep      | `80,000`             | `80000`           | Remove commas in numeric literals      |
| Currency           | `$`, `€`, `Β£`        | (removed)         | Strip before parsing numeric tails     |
| Extra whitespace   | multiple spaces/tabs | single space      | Collapse for cleaner parsing           |

Not handled (by design):
- **LaTeX** (`\\frac`, `\\sqrt`, etc.): route to `parse_latex` separately if needed.
- **Natural language prefix** ("Janet sells 16-3-4=9 eggs"): caller extracts math tail first.
- **Grouping `[` `]`**: context-dependent; avoid substituting without semantic analysis.

Version lock: sympy==1.14.0 (line 84 in requirements.txt at time of writing).
"""

from __future__ import annotations

import re


def normalize_for_parse_expr(text: str) -> str:
    """
    Normalize LLM-generated math text for SymPy's `parse_expr`.

    Converts common Unicode operators, currency symbols, and formatting quirks
    into ASCII Python-like syntax. This is the single source of truth for
    string preprocessing before SymPy parsing in this project.

    Parameters
    ----------
    text : str
        Raw string (potentially mixed prose and math from LLM).

    Returns
    -------
    str
        Normalized ASCII expression.

    Examples
    --------
    >>> normalize_for_parse_expr("2^3")
    '2**3'
    >>> normalize_for_parse_expr("16 Γ— 3 βˆ’ 4")
    '16 * 3 - 4'
    >>> normalize_for_parse_expr("$2,500")
    '2500'
    >>> normalize_for_parse_expr("Ο€/2")
    'pi/2'
    """
    s = text.strip()

    # Power: ^ β†’ **
    s = s.replace("^", "**")

    # Multiplication: Unicode operators β†’ *
    s = s.replace("Γ—", "*")
    s = s.replace("Β·", "*")
    s = s.replace("β€’", "*")
    s = s.replace("\u00d7", "*")  # U+00D7 MULTIPLICATION SIGN (Γ—)
    s = s.replace("\u22c5", "*")  # U+22C5 DOT OPERATOR (β‹…)
    s = s.replace("\u2022", "*")  # U+2022 BULLET (β€’)

    # Division: Unicode Γ· β†’ /
    s = s.replace("Γ·", "/")
    s = s.replace("\u00f7", "/")  # U+00F7 DIVISION SIGN

    # Minus: typography minus (U+2212) β†’ ASCII hyphen-minus
    s = s.replace("\u2212", "-")  # U+2212 MINUS SIGN (βˆ’)

    # Comparison operators (if ever parsing relations)
    s = s.replace("≀", "<=")
    s = s.replace("β‰₯", ">=")
    s = s.replace("β‰ ", "!=")
    s = s.replace("\u2264", "<=")  # U+2264 LESS-THAN OR EQUAL TO
    s = s.replace("\u2265", ">=")  # U+2265 GREATER-THAN OR EQUAL TO
    s = s.replace("\u2260", "!=")  # U+2260 NOT EQUAL TO

    # Greek constants: Ο€ β†’ pi (SymPy symbol name)
    s = s.replace("Ο€", "pi")
    s = s.replace("\u03c0", "pi")  # U+03C0 GREEK SMALL LETTER PI

    # Currency symbols: remove (caller typically strips or segments numeric tails)
    s = re.sub(r"[$€£Β₯β‚Ή]", "", s)

    # Thousands separators in numbers: 80,000 β†’ 80000
    # Match comma only between digits in a numeric context
    s = re.sub(r"(?<=\d),(?=\d{3}\b)", "", s)

    # Spoken "times" with ASCII letter x (grade-school / LLM): "4 x 90" must not
    # become 4*x*90 in SymPy (x parsed as a symbol β†’ false failures on chains).
    # Only between digit and digit or digit and '('.
    s = re.sub(r"(?<=\d)\s+[xX]\s+(?=\d|\()", "*", s)

    # Collapse multiple spaces/tabs to single space
    s = re.sub(r"[ \t]+", " ", s)

    # Collapse excessive newlines (keep at most double)
    s = re.sub(r"\n{3,}", "\n\n", s)

    return s.strip()


def prefer_arithmetic_tail(text: str) -> str:
    """
    Return substring starting from the first digit (if present), else full text.

    Useful when LLM outputs mix natural language with equations, e.g.:
      "Janet sells 16 - 3 - 4 = 9 eggs every day"
    This heuristic extracts "16 - 3 - 4 = 9 eggs every day" (digit onward),
    reducing risk of English words being parsed as symbols when implicit
    multiplication transformations are enabled.

    Parameters
    ----------
    text : str
        Potentially mixed prose + math.

    Returns
    -------
    str
        Substring from first digit onward, or original if no digit.

    Examples
    --------
    >>> prefer_arithmetic_tail("Janet sells 16-3-4=9")
    '16-3-4=9'
    >>> prefer_arithmetic_tail("no digits here")
    'no digits here'
    """
    s = normalize_for_parse_expr(text)
    m = re.search(r"\d", s)
    if m:
        return s[m.start() :].strip()
    return s.strip()