File size: 7,817 Bytes
98fc9b6
7813169
98fc9b6
 
 
 
 
 
 
 
7813169
98fc9b6
7813169
98fc9b6
7813169
 
 
 
 
 
 
fbabfde
7813169
 
fbabfde
7813169
fbabfde
 
7813169
 
 
 
 
 
 
 
fbabfde
 
7813169
 
fbabfde
7813169
 
 
 
 
fbabfde
7813169
 
 
98fc9b6
 
 
7813169
98fc9b6
7813169
98fc9b6
 
 
7813169
 
 
 
 
 
98fc9b6
 
7813169
 
 
98fc9b6
 
7813169
 
98fc9b6
7813169
 
98fc9b6
 
 
 
fbabfde
7813169
 
 
 
 
 
 
 
 
 
 
 
 
fbabfde
98fc9b6
7813169
98fc9b6
7813169
 
98fc9b6
fbabfde
98fc9b6
7813169
98fc9b6
7813169
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98fc9b6
 
 
7813169
98fc9b6
7813169
98fc9b6
 
 
7813169
 
 
98fc9b6
7813169
 
 
 
98fc9b6
7813169
98fc9b6
7813169
 
 
98fc9b6
 
 
 
 
7813169
98fc9b6
7813169
98fc9b6
7813169
98fc9b6
 
7813169
98fc9b6
 
 
7813169
98fc9b6
7813169
 
98fc9b6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
title: AutoMathReasoner (Calculus Environment)
emoji: ๐Ÿง 
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---

# โ™พ๏ธ AutoMathReasoner: Autonomous Mathematical Intelligence Environment

**AutoMathReasoner** is an OpenEnv-compliant reinforcement learning world formulated for the **Recursive Policy Refinement** of Language Models. The system focuses on the domain of **Symbolic Calculus (Indefinite Integration)**, utilizing a dense, multi-objective reward architecture to bridge complexity gaps in mathematical reasoning.

---

## ๐Ÿš€ Core Reasoning Technologies

The environment implements several advanced logic-steering protocols to ensure convergence on complex mathematical primitives.

### 1. Recursive Difficulty Ascent (LADDER)
The system employs a **Recursive Task Decomposition** mechanism where a failure on a parent task $\mathcal{T}_p$ triggers a search for a solvable basis $\{\mathcal{T}_1, \dots, \mathcal{T}_k\}$.

Given a complexity operator $\Phi$, we satisfy:

$$\Phi(\mathcal{T}_p) = \sum_{i=1}^n \omega_i \Phi(\mathcal{T}_i)$$

Where variants $\mathcal{T}_i$ represent "stepping stones" that allow the policy to acquire base identities before attempting the coupled root problem.

### 2. Test-Time Adaptive Policy (TTRL)
For "truly difficult" integrals at the boundary of the model's current capability, the system supports **Inference-Time Group Optimization**. When presented with a novel hard task $\mathcal{G}$, the model:
1. Generates $m$ simpler variants on-the-fly.
2. Performs a high-step micro-RL update on these variants.
3. Cold-starts the final inference on $\mathcal{G}$ with the adapted policy weights.

Mathematically, we solve for an optimal local parameter shift:

$$\theta^* = \arg \max_{\theta'} \mathbb{E}_{\mathcal{T} \sim \text{variants}(\mathcal{G})} \left[ R(\tau, \pi_{\theta'}) \right]$$

### 3. Process-Aware Reward Shaping
Unlike binary "sparse" reward systems, we employ **Dense Process Supervision**. Every primitive transformation (e.g. $u$-substitution, integration by parts) is identified as a logical node.

The reward $R_{\text{shape}}$ is assigned as the line integral over the reasoning trajectory $\tau$:
$$R_{\text{shape}} = \int_{\tau} \Psi(\mathbf{z}) d\mathbf{z}$$
where $\Psi$ evaluates the structural validity of each state transition relative to the ground-truth simplification steps.

### 4. Hard Negative Mining (Problem Persistence)
Failed tasks $\mathcal{T}_{fail}$ are not discarded. They are prioritized in the sampling buffer with a weight $W$ proportional to their failure frequency:
$$W(\mathcal{T}) \propto e^{\lambda \cdot \text{failures}(\mathcal{T})}$$
This forces the policy to repeatedly encounter "bottleneck" logic until the primitive is solved.

---

## ๐Ÿ—๏ธ System Architecture

The environment architecture follows a strictly decoupled schema between task generation, solution validation, and policy refinement.

```mermaid
graph TD
    subgraph EnvCore [Mathematical Environment Server]
        GE["Symbolic Generator (Sympy)"] -->|"Sample T"| Server["OpenEnv API (FastAPI)"]
        Server -->|"Verify F(x)"| VR["Numerical Verifier"]
        VR -->|"Law: FTC Derivative Test"| Server
        Server -->|"Compute Sum(R)"| RW["Reward Logic Engine"]
        RW --> Server
    end
    
    subgraph PolicyNode [Reinforcement Learning Client]
        Policy["Policy pi(theta)"] -->|"Action Trace (tau)"| Server
        Server -->|"Reward Observation"| Policy
    end
    
    classDef space fill:transparent,stroke:#9370DB,stroke-width:2px;
    classDef client fill:transparent,stroke:#008B8B,stroke-width:2px;
    
    class EnvCore space
    class PolicyNode client
```

---

## ๐Ÿ” Systemic Logic: Recursive Difficulty Ascent

The environment operates via **Autonomous Difficulty Scaling**. Instead of fixed-difficulty benchmarks, a problem $\mathcal{T}$ is decomposed into a hierarchical tree of simpler primitives. For any parent problem $\mathcal{T}_{\text{p}}$ that fails to elicit a reward, the system generates a set of variants $\{\mathcal{T}_i\}$ such that the complexity metric $\mathcal{M}$ satisfies:

$$\mathcal{M}(\mathcal{T}_i) < \mathcal{M}(\mathcal{T}_{\text{p}})$$

This ensures a continuous gradient for the learner, moving from fundamental algebraic identities to nested transcendental integrals.

---

## ๐ŸŽฏ The Reward Law

The terminal reward $R_{\Sigma}$ is a weighted composite of seven distinct mathematical and structural signals, designed to penalize hacking and reward rigorous proof-like trajectories:

$$R_{\Sigma} = \alpha C + \beta Q + \gamma P + \delta R_{\text{ref}} + \eta D + \zeta E + \lambda X$$

Where the weights are calibrated as $\alpha=0.35, \beta=0.15, \gamma=0.1, \delta=0.1, \eta=0.15, \zeta=0.05, \lambda=0.1$.

### 1. Fundamental Correctness ($C$)
Derived from the **Numerical Multi-point Quadrature Protocol**. A predicted solution $F_{\theta}(x)$ is verified against the target integrand $f(x)$ through the derivative identity:

$$C = \begin{cases} 1.0 & \text{if } \forall x_i \in \mathbb{X}, \quad \left| \frac{d}{dx}F_{\theta}(x_i) - f(x_i) \right| < 10^{-2} \\ 0.0 & \text{otherwise} \end{cases}$$

Where $\mathbb{X} = \{x_1, \dots, x_5\}$ is a set of random points sampled from $\mathcal{U}(-5, 5)$.

### 2. Reasoning Formatting ($Q$)
Calculates the structural density of the reasoning trace using a hyperbolic tangent squashing function to bound heuristic markers:

$$Q = \tanh(\omega \cdot \text{count}(\text{markers}))$$

### 3. Process Supervision ($P$)
Assigns a scalar reward for explicit step-wise transition logic. It algorithmically penalizes "Inferential Jumps" where the ratio of reasoning tokens to mathematical complexity falls below a critical threshold.

### 4. Reflection Logits ($R_{\text{ref}}$)
Rewards the presence of self-correction tokens when they lead to a terminal state correction. If the model reflects ($r=1$) but fails to correct the solution ($c=0$), it suffers a penalty of $-0.5$.

### 5. Trajectory Diversity ($D$)
Prevents the policy from converging on rote-memorized repetitive strings. If the current answer $A_t$ has been seen in history $\mathcal{H}$, an exponential penalty is applied:

$$D = \begin{cases} -\exp(1.0) & \text{if } A_t \in \mathcal{H} \\ 1.0 & \text{otherwise} \end{cases}$$

### 6. Information Density Efficiency ($E$)
Guides the model toward concise mathematical proofs using a Gaussian decay centered at an optimal token length $\phi=50$:

$$E = \exp\left(-\left(\frac{\text{len}(\tau)/4 - \phi}{\phi}\right)^2\right) - 1$$

### 7. Global Exploration Bonus ($X$)
Rewards token-level variance relative to the frequency of problem encounters $s$:

$$X = \frac{\log(1 + \nu)}{\sqrt{1 + s}}$$

Where $\nu$ is the ratio of unique tokens in the reasoning trace $\tau$.

---

## ๐Ÿ”„ The Interaction Loop

The environment manages the **Difficulty Gradient** to ensure the policy $\pi_{\theta}$ maintains exploration stability.

```mermaid
sequenceDiagram
    participant Model as Policy (pi)
    participant Engine as Recursion Engine
    participant Oracle as Calculus Verifier
    
    loop Optimization Batch
        Engine ->> Model: Sample Low-Complexity Variant
        Model ->> Oracle: Submit Solution F(x)
        Oracle -->> Model: Correctness Yield (C)
        
        Note over Model: Internal State Update
        
        Engine ->> Model: Sample Root Complexity Task
        Model ->> Oracle: Proof Trajectory (tau)
        Oracle -->> Model: Composite Reward (R)
    end
```

---

## ๐Ÿ’ป Running the Environment

### 1. Launch the Environment
```bash
# Install local calculus bindings
uv pip install -e .

# Start the environment server
uv run server
```

### 2. Training Initiation
```bash
# Executes the recursive training sequence
python train/train_grpo.py
```