Spaces:
Sleeping
Sleeping
Merge branch 'main' of https://github.com/pratap-nitjsr/AutoMathReasoner
Browse files
README.md
CHANGED
|
@@ -19,11 +19,13 @@ pinned: false
|
|
| 19 |
The environment implements several advanced logic-steering protocols to ensure convergence on complex mathematical primitives.
|
| 20 |
|
| 21 |
### 1. Recursive Difficulty Ascent (LADDER)
|
| 22 |
-
The system employs a **Recursive Task
|
| 23 |
|
| 24 |
Given a complexity operator $\Phi$, we satisfy:
|
|
|
|
| 25 |
$$\Phi(\mathcal{T}_p) = \sum_{i=1}^n \omega_i \Phi(\mathcal{T}_i)$$
|
| 26 |
-
|
|
|
|
| 27 |
|
| 28 |
### 2. Test-Time Adaptive Policy (TTRL)
|
| 29 |
For "truly difficult" integrals at the boundary of the model's current capability, the system supports **Inference-Time Group Optimization**. When presented with a novel hard task $\mathcal{G}$, the model:
|
|
@@ -32,16 +34,17 @@ For "truly difficult" integrals at the boundary of the model's current capabilit
|
|
| 32 |
3. Cold-starts the final inference on $\mathcal{G}$ with the adapted policy weights.
|
| 33 |
|
| 34 |
Mathematically, we solve for an optimal local parameter shift:
|
| 35 |
-
|
|
|
|
| 36 |
|
| 37 |
### 3. Process-Aware Reward Shaping
|
| 38 |
-
Unlike binary "sparse" reward systems, we employ **Dense Process Supervision**. Every primitive transformation (e.g. $u$-substitution, integration by parts) is identified as a logical node.
|
| 39 |
|
| 40 |
The reward $R_{\text{shape}}$ is assigned as the line integral over the reasoning trajectory $\tau$:
|
| 41 |
$$R_{\text{shape}} = \int_{\tau} \Psi(\mathbf{z}) d\mathbf{z}$$
|
| 42 |
where $\Psi$ evaluates the structural validity of each state transition relative to the ground-truth simplification steps.
|
| 43 |
|
| 44 |
-
### 4. Hard Negative
|
| 45 |
Failed tasks $\mathcal{T}_{fail}$ are not discarded. They are prioritized in the sampling buffer with a weight $W$ proportional to their failure frequency:
|
| 46 |
$$W(\mathcal{T}) \propto e^{\lambda \cdot \text{failures}(\mathcal{T})}$$
|
| 47 |
This forces the policy to repeatedly encounter "bottleneck" logic until the primitive is solved.
|
|
@@ -76,7 +79,7 @@ graph TD
|
|
| 76 |
|
| 77 |
---
|
| 78 |
|
| 79 |
-
##
|
| 80 |
|
| 81 |
The environment operates via **Autonomous Difficulty Scaling**. Instead of fixed-difficulty benchmarks, a problem $\mathcal{T}$ is decomposed into a hierarchical tree of simpler primitives. For any parent problem $\mathcal{T}_{\text{p}}$ that fails to elicit a reward, the system generates a set of variants $\{\mathcal{T}_i\}$ such that the complexity metric $\mathcal{M}$ satisfies:
|
| 82 |
|
|
@@ -90,14 +93,14 @@ This ensures a continuous gradient for the learner, moving from fundamental alge
|
|
| 90 |
|
| 91 |
The terminal reward $R_{\Sigma}$ is a weighted composite of seven distinct mathematical and structural signals, designed to penalize hacking and reward rigorous proof-like trajectories:
|
| 92 |
|
| 93 |
-
$$R_{\Sigma} = \alpha C + \beta Q + \gamma P + \delta R_{\text{ref}} + \eta D + \zeta E + \lambda X
|
| 94 |
|
| 95 |
Where the weights are calibrated as $\alpha=0.35, \beta=0.15, \gamma=0.1, \delta=0.1, \eta=0.15, \zeta=0.05, \lambda=0.1$.
|
| 96 |
|
| 97 |
### 1. Fundamental Correctness ($C$)
|
| 98 |
Derived from the **Numerical Multi-point Quadrature Protocol**. A predicted solution $F_{\theta}(x)$ is verified against the target integrand $f(x)$ through the derivative identity:
|
| 99 |
|
| 100 |
-
$$C = \begin{cases} 1.0 & \text{if } \forall x_i \in \mathbb{X}, \quad | \frac{d}{dx}F_{\theta}(x_i) - f(x_i) | < 10^{-2} \\ 0.0 & \text{otherwise} \end{cases}$$
|
| 101 |
|
| 102 |
Where $\mathbb{X} = \{x_1, \dots, x_5\}$ is a set of random points sampled from $\mathcal{U}(-5, 5)$.
|
| 103 |
|
|
|
|
| 19 |
The environment implements several advanced logic-steering protocols to ensure convergence on complex mathematical primitives.
|
| 20 |
|
| 21 |
### 1. Recursive Difficulty Ascent (LADDER)
|
| 22 |
+
The system employs a **Recursive Task Decomposition** mechanism where a failure on a parent task $\mathcal{T}_p$ triggers a search for a solvable basis $\{\mathcal{T}_1, \dots, \mathcal{T}_k\}$.
|
| 23 |
|
| 24 |
Given a complexity operator $\Phi$, we satisfy:
|
| 25 |
+
|
| 26 |
$$\Phi(\mathcal{T}_p) = \sum_{i=1}^n \omega_i \Phi(\mathcal{T}_i)$$
|
| 27 |
+
|
| 28 |
+
Where variants $\mathcal{T}_i$ represent "stepping stones" that allow the policy to acquire base identities before attempting the coupled root problem.
|
| 29 |
|
| 30 |
### 2. Test-Time Adaptive Policy (TTRL)
|
| 31 |
For "truly difficult" integrals at the boundary of the model's current capability, the system supports **Inference-Time Group Optimization**. When presented with a novel hard task $\mathcal{G}$, the model:
|
|
|
|
| 34 |
3. Cold-starts the final inference on $\mathcal{G}$ with the adapted policy weights.
|
| 35 |
|
| 36 |
Mathematically, we solve for an optimal local parameter shift:
|
| 37 |
+
|
| 38 |
+
$$\theta^* = \arg \max_{\theta'} \mathbb{E}_{\mathcal{T} \sim \text{variants}(\mathcal{G})} \left[ R(\tau, \pi_{\theta'}) \right]$$
|
| 39 |
|
| 40 |
### 3. Process-Aware Reward Shaping
|
| 41 |
+
Unlike binary "sparse" reward systems, we employ **Dense Process Supervision**. Every primitive transformation (e.g. $u$-substitution, integration by parts) is identified as a logical node.
|
| 42 |
|
| 43 |
The reward $R_{\text{shape}}$ is assigned as the line integral over the reasoning trajectory $\tau$:
|
| 44 |
$$R_{\text{shape}} = \int_{\tau} \Psi(\mathbf{z}) d\mathbf{z}$$
|
| 45 |
where $\Psi$ evaluates the structural validity of each state transition relative to the ground-truth simplification steps.
|
| 46 |
|
| 47 |
+
### 4. Hard Negative Mining (Problem Persistence)
|
| 48 |
Failed tasks $\mathcal{T}_{fail}$ are not discarded. They are prioritized in the sampling buffer with a weight $W$ proportional to their failure frequency:
|
| 49 |
$$W(\mathcal{T}) \propto e^{\lambda \cdot \text{failures}(\mathcal{T})}$$
|
| 50 |
This forces the policy to repeatedly encounter "bottleneck" logic until the primitive is solved.
|
|
|
|
| 79 |
|
| 80 |
---
|
| 81 |
|
| 82 |
+
## 🔁 Systemic Logic: Recursive Difficulty Ascent
|
| 83 |
|
| 84 |
The environment operates via **Autonomous Difficulty Scaling**. Instead of fixed-difficulty benchmarks, a problem $\mathcal{T}$ is decomposed into a hierarchical tree of simpler primitives. For any parent problem $\mathcal{T}_{\text{p}}$ that fails to elicit a reward, the system generates a set of variants $\{\mathcal{T}_i\}$ such that the complexity metric $\mathcal{M}$ satisfies:
|
| 85 |
|
|
|
|
| 93 |
|
| 94 |
The terminal reward $R_{\Sigma}$ is a weighted composite of seven distinct mathematical and structural signals, designed to penalize hacking and reward rigorous proof-like trajectories:
|
| 95 |
|
| 96 |
+
$$R_{\Sigma} = \alpha C + \beta Q + \gamma P + \delta R_{\text{ref}} + \eta D + \zeta E + \lambda X$$
|
| 97 |
|
| 98 |
Where the weights are calibrated as $\alpha=0.35, \beta=0.15, \gamma=0.1, \delta=0.1, \eta=0.15, \zeta=0.05, \lambda=0.1$.
|
| 99 |
|
| 100 |
### 1. Fundamental Correctness ($C$)
|
| 101 |
Derived from the **Numerical Multi-point Quadrature Protocol**. A predicted solution $F_{\theta}(x)$ is verified against the target integrand $f(x)$ through the derivative identity:
|
| 102 |
|
| 103 |
+
$$C = \begin{cases} 1.0 & \text{if } \forall x_i \in \mathbb{X}, \quad \left| \frac{d}{dx}F_{\theta}(x_i) - f(x_i) \right| < 10^{-2} \\ 0.0 & \text{otherwise} \end{cases}$$
|
| 104 |
|
| 105 |
Where $\mathbb{X} = \{x_1, \dots, x_5\}$ is a set of random points sampled from $\mathcal{U}(-5, 5)$.
|
| 106 |
|