Spaces:
Sleeping
Sleeping
File size: 7,817 Bytes
98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 fbabfde 7813169 fbabfde 7813169 fbabfde 7813169 fbabfde 7813169 fbabfde 7813169 fbabfde 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 fbabfde 7813169 fbabfde 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 fbabfde 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 7813169 98fc9b6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | ---
title: AutoMathReasoner (Calculus Environment)
emoji: ๐ง
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# โพ๏ธ AutoMathReasoner: Autonomous Mathematical Intelligence Environment
**AutoMathReasoner** is an OpenEnv-compliant reinforcement learning world formulated for the **Recursive Policy Refinement** of Language Models. The system focuses on the domain of **Symbolic Calculus (Indefinite Integration)**, utilizing a dense, multi-objective reward architecture to bridge complexity gaps in mathematical reasoning.
---
## ๐ Core Reasoning Technologies
The environment implements several advanced logic-steering protocols to ensure convergence on complex mathematical primitives.
### 1. Recursive Difficulty Ascent (LADDER)
The system employs a **Recursive Task Decomposition** mechanism where a failure on a parent task $\mathcal{T}_p$ triggers a search for a solvable basis $\{\mathcal{T}_1, \dots, \mathcal{T}_k\}$.
Given a complexity operator $\Phi$, we satisfy:
$$\Phi(\mathcal{T}_p) = \sum_{i=1}^n \omega_i \Phi(\mathcal{T}_i)$$
Where variants $\mathcal{T}_i$ represent "stepping stones" that allow the policy to acquire base identities before attempting the coupled root problem.
### 2. Test-Time Adaptive Policy (TTRL)
For "truly difficult" integrals at the boundary of the model's current capability, the system supports **Inference-Time Group Optimization**. When presented with a novel hard task $\mathcal{G}$, the model:
1. Generates $m$ simpler variants on-the-fly.
2. Performs a high-step micro-RL update on these variants.
3. Cold-starts the final inference on $\mathcal{G}$ with the adapted policy weights.
Mathematically, we solve for an optimal local parameter shift:
$$\theta^* = \arg \max_{\theta'} \mathbb{E}_{\mathcal{T} \sim \text{variants}(\mathcal{G})} \left[ R(\tau, \pi_{\theta'}) \right]$$
### 3. Process-Aware Reward Shaping
Unlike binary "sparse" reward systems, we employ **Dense Process Supervision**. Every primitive transformation (e.g. $u$-substitution, integration by parts) is identified as a logical node.
The reward $R_{\text{shape}}$ is assigned as the line integral over the reasoning trajectory $\tau$:
$$R_{\text{shape}} = \int_{\tau} \Psi(\mathbf{z}) d\mathbf{z}$$
where $\Psi$ evaluates the structural validity of each state transition relative to the ground-truth simplification steps.
### 4. Hard Negative Mining (Problem Persistence)
Failed tasks $\mathcal{T}_{fail}$ are not discarded. They are prioritized in the sampling buffer with a weight $W$ proportional to their failure frequency:
$$W(\mathcal{T}) \propto e^{\lambda \cdot \text{failures}(\mathcal{T})}$$
This forces the policy to repeatedly encounter "bottleneck" logic until the primitive is solved.
---
## ๐๏ธ System Architecture
The environment architecture follows a strictly decoupled schema between task generation, solution validation, and policy refinement.
```mermaid
graph TD
subgraph EnvCore [Mathematical Environment Server]
GE["Symbolic Generator (Sympy)"] -->|"Sample T"| Server["OpenEnv API (FastAPI)"]
Server -->|"Verify F(x)"| VR["Numerical Verifier"]
VR -->|"Law: FTC Derivative Test"| Server
Server -->|"Compute Sum(R)"| RW["Reward Logic Engine"]
RW --> Server
end
subgraph PolicyNode [Reinforcement Learning Client]
Policy["Policy pi(theta)"] -->|"Action Trace (tau)"| Server
Server -->|"Reward Observation"| Policy
end
classDef space fill:transparent,stroke:#9370DB,stroke-width:2px;
classDef client fill:transparent,stroke:#008B8B,stroke-width:2px;
class EnvCore space
class PolicyNode client
```
---
## ๐ Systemic Logic: Recursive Difficulty Ascent
The environment operates via **Autonomous Difficulty Scaling**. Instead of fixed-difficulty benchmarks, a problem $\mathcal{T}$ is decomposed into a hierarchical tree of simpler primitives. For any parent problem $\mathcal{T}_{\text{p}}$ that fails to elicit a reward, the system generates a set of variants $\{\mathcal{T}_i\}$ such that the complexity metric $\mathcal{M}$ satisfies:
$$\mathcal{M}(\mathcal{T}_i) < \mathcal{M}(\mathcal{T}_{\text{p}})$$
This ensures a continuous gradient for the learner, moving from fundamental algebraic identities to nested transcendental integrals.
---
## ๐ฏ The Reward Law
The terminal reward $R_{\Sigma}$ is a weighted composite of seven distinct mathematical and structural signals, designed to penalize hacking and reward rigorous proof-like trajectories:
$$R_{\Sigma} = \alpha C + \beta Q + \gamma P + \delta R_{\text{ref}} + \eta D + \zeta E + \lambda X$$
Where the weights are calibrated as $\alpha=0.35, \beta=0.15, \gamma=0.1, \delta=0.1, \eta=0.15, \zeta=0.05, \lambda=0.1$.
### 1. Fundamental Correctness ($C$)
Derived from the **Numerical Multi-point Quadrature Protocol**. A predicted solution $F_{\theta}(x)$ is verified against the target integrand $f(x)$ through the derivative identity:
$$C = \begin{cases} 1.0 & \text{if } \forall x_i \in \mathbb{X}, \quad \left| \frac{d}{dx}F_{\theta}(x_i) - f(x_i) \right| < 10^{-2} \\ 0.0 & \text{otherwise} \end{cases}$$
Where $\mathbb{X} = \{x_1, \dots, x_5\}$ is a set of random points sampled from $\mathcal{U}(-5, 5)$.
### 2. Reasoning Formatting ($Q$)
Calculates the structural density of the reasoning trace using a hyperbolic tangent squashing function to bound heuristic markers:
$$Q = \tanh(\omega \cdot \text{count}(\text{markers}))$$
### 3. Process Supervision ($P$)
Assigns a scalar reward for explicit step-wise transition logic. It algorithmically penalizes "Inferential Jumps" where the ratio of reasoning tokens to mathematical complexity falls below a critical threshold.
### 4. Reflection Logits ($R_{\text{ref}}$)
Rewards the presence of self-correction tokens when they lead to a terminal state correction. If the model reflects ($r=1$) but fails to correct the solution ($c=0$), it suffers a penalty of $-0.5$.
### 5. Trajectory Diversity ($D$)
Prevents the policy from converging on rote-memorized repetitive strings. If the current answer $A_t$ has been seen in history $\mathcal{H}$, an exponential penalty is applied:
$$D = \begin{cases} -\exp(1.0) & \text{if } A_t \in \mathcal{H} \\ 1.0 & \text{otherwise} \end{cases}$$
### 6. Information Density Efficiency ($E$)
Guides the model toward concise mathematical proofs using a Gaussian decay centered at an optimal token length $\phi=50$:
$$E = \exp\left(-\left(\frac{\text{len}(\tau)/4 - \phi}{\phi}\right)^2\right) - 1$$
### 7. Global Exploration Bonus ($X$)
Rewards token-level variance relative to the frequency of problem encounters $s$:
$$X = \frac{\log(1 + \nu)}{\sqrt{1 + s}}$$
Where $\nu$ is the ratio of unique tokens in the reasoning trace $\tau$.
---
## ๐ The Interaction Loop
The environment manages the **Difficulty Gradient** to ensure the policy $\pi_{\theta}$ maintains exploration stability.
```mermaid
sequenceDiagram
participant Model as Policy (pi)
participant Engine as Recursion Engine
participant Oracle as Calculus Verifier
loop Optimization Batch
Engine ->> Model: Sample Low-Complexity Variant
Model ->> Oracle: Submit Solution F(x)
Oracle -->> Model: Correctness Yield (C)
Note over Model: Internal State Update
Engine ->> Model: Sample Root Complexity Task
Model ->> Oracle: Proof Trajectory (tau)
Oracle -->> Model: Composite Reward (R)
end
```
---
## ๐ป Running the Environment
### 1. Launch the Environment
```bash
# Install local calculus bindings
uv pip install -e .
# Start the environment server
uv run server
```
### 2. Training Initiation
```bash
# Executes the recursive training sequence
python train/train_grpo.py
```
|