Spaces:
Sleeping
title: AutoMathReasoner (Calculus Environment)
emoji: π§
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
βΎοΈ AutoMathReasoner: Autonomous Mathematical Intelligence Environment
AutoMathReasoner is an OpenEnv-compliant reinforcement learning world formulated for the Recursive Policy Refinement of Language Models. The system focuses on the domain of Symbolic Calculus (Indefinite Integration), utilizing a dense, multi-objective reward architecture to bridge complexity gaps in mathematical reasoning.
π Core Reasoning Technologies
The environment implements several advanced logic-steering protocols to ensure convergence on complex mathematical primitives.
1. Recursive Difficulty Ascent (LADDER)
The system employs a Recursive Task Decomposition mechanism where a failure on a parent task $\mathcal{T}_p$ triggers a search for a solvable basis ${\mathcal{T}_1, \dots, \mathcal{T}_k}$.
Given a complexity operator $\Phi$, we satisfy:
Where variants $\mathcal{T}_i$ represent "stepping stones" that allow the policy to acquire base identities before attempting the coupled root problem.
2. Test-Time Adaptive Policy (TTRL)
For "truly difficult" integrals at the boundary of the model's current capability, the system supports Inference-Time Group Optimization. When presented with a novel hard task $\mathcal{G}$, the model:
- Generates $m$ simpler variants on-the-fly.
- Performs a high-step micro-RL update on these variants.
- Cold-starts the final inference on $\mathcal{G}$ with the adapted policy weights.
Mathematically, we solve for an optimal local parameter shift:
3. Process-Aware Reward Shaping
Unlike binary "sparse" reward systems, we employ Dense Process Supervision. Every primitive transformation (e.g. $u$-substitution, integration by parts) is identified as a logical node.
The reward $R_{\text{shape}}$ is assigned as the line integral over the reasoning trajectory $\tau$: where $\Psi$ evaluates the structural validity of each state transition relative to the ground-truth simplification steps.
4. Hard Negative Mining (Problem Persistence)
Failed tasks $\mathcal{T}_{fail}$ are not discarded. They are prioritized in the sampling buffer with a weight $W$ proportional to their failure frequency: This forces the policy to repeatedly encounter "bottleneck" logic until the primitive is solved.
ποΈ System Architecture
The environment architecture follows a strictly decoupled schema between task generation, solution validation, and policy refinement.
graph TD
subgraph EnvCore [Mathematical Environment Server]
GE["Symbolic Generator (Sympy)"] -->|"Sample T"| Server["OpenEnv API (FastAPI)"]
Server -->|"Verify F(x)"| VR["Numerical Verifier"]
VR -->|"Law: FTC Derivative Test"| Server
Server -->|"Compute Sum(R)"| RW["Reward Logic Engine"]
RW --> Server
end
subgraph PolicyNode [Reinforcement Learning Client]
Policy["Policy pi(theta)"] -->|"Action Trace (tau)"| Server
Server -->|"Reward Observation"| Policy
end
classDef space fill:transparent,stroke:#9370DB,stroke-width:2px;
classDef client fill:transparent,stroke:#008B8B,stroke-width:2px;
class EnvCore space
class PolicyNode client
π Systemic Logic: Recursive Difficulty Ascent
The environment operates via Autonomous Difficulty Scaling. Instead of fixed-difficulty benchmarks, a problem $\mathcal{T}$ is decomposed into a hierarchical tree of simpler primitives. For any parent problem $\mathcal{T}_{\text{p}}$ that fails to elicit a reward, the system generates a set of variants ${\mathcal{T}_i}$ such that the complexity metric $\mathcal{M}$ satisfies:
This ensures a continuous gradient for the learner, moving from fundamental algebraic identities to nested transcendental integrals.
π― The Reward Law
The terminal reward $R_{\Sigma}$ is a weighted composite of seven distinct mathematical and structural signals, designed to penalize hacking and reward rigorous proof-like trajectories:
Where the weights are calibrated as $\alpha=0.35, \beta=0.15, \gamma=0.1, \delta=0.1, \eta=0.15, \zeta=0.05, \lambda=0.1$.
1. Fundamental Correctness ($C$)
Derived from the Numerical Multi-point Quadrature Protocol. A predicted solution $F_{\theta}(x)$ is verified against the target integrand $f(x)$ through the derivative identity:
Where $\mathbb{X} = {x_1, \dots, x_5}$ is a set of random points sampled from $\mathcal{U}(-5, 5)$.
2. Reasoning Formatting ($Q$)
Calculates the structural density of the reasoning trace using a hyperbolic tangent squashing function to bound heuristic markers:
3. Process Supervision ($P$)
Assigns a scalar reward for explicit step-wise transition logic. It algorithmically penalizes "Inferential Jumps" where the ratio of reasoning tokens to mathematical complexity falls below a critical threshold.
4. Reflection Logits ($R_{\text{ref}}$)
Rewards the presence of self-correction tokens when they lead to a terminal state correction. If the model reflects ($r=1$) but fails to correct the solution ($c=0$), it suffers a penalty of $-0.5$.
5. Trajectory Diversity ($D$)
Prevents the policy from converging on rote-memorized repetitive strings. If the current answer $A_t$ has been seen in history $\mathcal{H}$, an exponential penalty is applied:
6. Information Density Efficiency ($E$)
Guides the model toward concise mathematical proofs using a Gaussian decay centered at an optimal token length $\phi=50$:
7. Global Exploration Bonus ($X$)
Rewards token-level variance relative to the frequency of problem encounters $s$:
Where $\nu$ is the ratio of unique tokens in the reasoning trace $\tau$.
π The Interaction Loop
The environment manages the Difficulty Gradient to ensure the policy $\pi_{\theta}$ maintains exploration stability.
sequenceDiagram
participant Model as Policy (pi)
participant Engine as Recursion Engine
participant Oracle as Calculus Verifier
loop Optimization Batch
Engine ->> Model: Sample Low-Complexity Variant
Model ->> Oracle: Submit Solution F(x)
Oracle -->> Model: Correctness Yield (C)
Note over Model: Internal State Update
Engine ->> Model: Sample Root Complexity Task
Model ->> Oracle: Proof Trajectory (tau)
Oracle -->> Model: Composite Reward (R)
end
π» Running the Environment
1. Launch the Environment
# Install local calculus bindings
uv pip install -e .
# Start the environment server
uv run server
2. Training Initiation
# Executes the recursive training sequence
python train/train_grpo.py