Harshit Shrivastav commited on
Commit
fbabfde
·
unverified ·
1 Parent(s): a553bf0

Fix formatting issues in README

Browse files

Corrected formatting issues and improved clarity in mathematical expressions throughout the README.

Files changed (1) hide show
  1. README.md +11 -8
README.md CHANGED
@@ -19,11 +19,13 @@ pinned: false
19
  The environment implements several advanced logic-steering protocols to ensure convergence on complex mathematical primitives.
20
 
21
  ### 1. Recursive Difficulty Ascent (LADDER)
22
- The system employs a **Recursive Task Decompositon** mechanism where a failure on a parent task $\mathcal{T}_p$ triggers a search for a solvable basis $\{\mathcal{T}_1, \dots, \mathcal{T}_k\}$.
23
 
24
  Given a complexity operator $\Phi$, we satisfy:
 
25
  $$\Phi(\mathcal{T}_p) = \sum_{i=1}^n \omega_i \Phi(\mathcal{T}_i)$$
26
- Where variants $\mathcal{T}_i$ represent "stepping stones" that allow the policy to acquire base-identities before attempting the coupled root problem.
 
27
 
28
  ### 2. Test-Time Adaptive Policy (TTRL)
29
  For "truly difficult" integrals at the boundary of the model's current capability, the system supports **Inference-Time Group Optimization**. When presented with a novel hard task $\mathcal{G}$, the model:
@@ -32,16 +34,17 @@ For "truly difficult" integrals at the boundary of the model's current capabilit
32
  3. Cold-starts the final inference on $\mathcal{G}$ with the adapted policy weights.
33
 
34
  Mathematically, we solve for an optimal local parameter shift:
35
- $$\theta^* = \arg \max_{\theta'} \mathbb{E}_{\mathcal{T} \sim \text{variants}(\mathcal{G})} [ R(\tau, \pi_{\theta'}) ]$$
 
36
 
37
  ### 3. Process-Aware Reward Shaping
38
- Unlike binary "sparse" reward systems, we employ **Dense Process Supervision**. Every primitive transformation (e.g. $u$-substitution, integration by parts) is identified as a logical node.
39
 
40
  The reward $R_{\text{shape}}$ is assigned as the line integral over the reasoning trajectory $\tau$:
41
  $$R_{\text{shape}} = \int_{\tau} \Psi(\mathbf{z}) d\mathbf{z}$$
42
  where $\Psi$ evaluates the structural validity of each state transition relative to the ground-truth simplification steps.
43
 
44
- ### 4. Hard Negative mining (Problem Persistence)
45
  Failed tasks $\mathcal{T}_{fail}$ are not discarded. They are prioritized in the sampling buffer with a weight $W$ proportional to their failure frequency:
46
  $$W(\mathcal{T}) \propto e^{\lambda \cdot \text{failures}(\mathcal{T})}$$
47
  This forces the policy to repeatedly encounter "bottleneck" logic until the primitive is solved.
@@ -76,7 +79,7 @@ graph TD
76
 
77
  ---
78
 
79
- ## 🏗️ Systemic Logic: Recursive Difficulty Ascent
80
 
81
  The environment operates via **Autonomous Difficulty Scaling**. Instead of fixed-difficulty benchmarks, a problem $\mathcal{T}$ is decomposed into a hierarchical tree of simpler primitives. For any parent problem $\mathcal{T}_{\text{p}}$ that fails to elicit a reward, the system generates a set of variants $\{\mathcal{T}_i\}$ such that the complexity metric $\mathcal{M}$ satisfies:
82
 
@@ -90,14 +93,14 @@ This ensures a continuous gradient for the learner, moving from fundamental alge
90
 
91
  The terminal reward $R_{\Sigma}$ is a weighted composite of seven distinct mathematical and structural signals, designed to penalize hacking and reward rigorous proof-like trajectories:
92
 
93
- $$R_{\Sigma} = \alpha C + \beta Q + \gamma P + \delta R_{\text{ref}} + \eta D + \zeta E + \lambda X + \epsilon$$
94
 
95
  Where the weights are calibrated as $\alpha=0.35, \beta=0.15, \gamma=0.1, \delta=0.1, \eta=0.15, \zeta=0.05, \lambda=0.1$.
96
 
97
  ### 1. Fundamental Correctness ($C$)
98
  Derived from the **Numerical Multi-point Quadrature Protocol**. A predicted solution $F_{\theta}(x)$ is verified against the target integrand $f(x)$ through the derivative identity:
99
 
100
- $$C = \begin{cases} 1.0 & \text{if } \forall x_i \in \mathbb{X}, \quad | \frac{d}{dx}F_{\theta}(x_i) - f(x_i) | < 10^{-2} \\ 0.0 & \text{otherwise} \end{cases}$$
101
 
102
  Where $\mathbb{X} = \{x_1, \dots, x_5\}$ is a set of random points sampled from $\mathcal{U}(-5, 5)$.
103
 
 
19
  The environment implements several advanced logic-steering protocols to ensure convergence on complex mathematical primitives.
20
 
21
  ### 1. Recursive Difficulty Ascent (LADDER)
22
+ The system employs a **Recursive Task Decomposition** mechanism where a failure on a parent task $\mathcal{T}_p$ triggers a search for a solvable basis $\{\mathcal{T}_1, \dots, \mathcal{T}_k\}$.
23
 
24
  Given a complexity operator $\Phi$, we satisfy:
25
+
26
  $$\Phi(\mathcal{T}_p) = \sum_{i=1}^n \omega_i \Phi(\mathcal{T}_i)$$
27
+
28
+ Where variants $\mathcal{T}_i$ represent "stepping stones" that allow the policy to acquire base identities before attempting the coupled root problem.
29
 
30
  ### 2. Test-Time Adaptive Policy (TTRL)
31
  For "truly difficult" integrals at the boundary of the model's current capability, the system supports **Inference-Time Group Optimization**. When presented with a novel hard task $\mathcal{G}$, the model:
 
34
  3. Cold-starts the final inference on $\mathcal{G}$ with the adapted policy weights.
35
 
36
  Mathematically, we solve for an optimal local parameter shift:
37
+
38
+ $$\theta^* = \arg \max_{\theta'} \mathbb{E}_{\mathcal{T} \sim \text{variants}(\mathcal{G})} \left[ R(\tau, \pi_{\theta'}) \right]$$
39
 
40
  ### 3. Process-Aware Reward Shaping
41
+ Unlike binary "sparse" reward systems, we employ **Dense Process Supervision**. Every primitive transformation (e.g. $u$-substitution, integration by parts) is identified as a logical node.
42
 
43
  The reward $R_{\text{shape}}$ is assigned as the line integral over the reasoning trajectory $\tau$:
44
  $$R_{\text{shape}} = \int_{\tau} \Psi(\mathbf{z}) d\mathbf{z}$$
45
  where $\Psi$ evaluates the structural validity of each state transition relative to the ground-truth simplification steps.
46
 
47
+ ### 4. Hard Negative Mining (Problem Persistence)
48
  Failed tasks $\mathcal{T}_{fail}$ are not discarded. They are prioritized in the sampling buffer with a weight $W$ proportional to their failure frequency:
49
  $$W(\mathcal{T}) \propto e^{\lambda \cdot \text{failures}(\mathcal{T})}$$
50
  This forces the policy to repeatedly encounter "bottleneck" logic until the primitive is solved.
 
79
 
80
  ---
81
 
82
+ ## 🔁 Systemic Logic: Recursive Difficulty Ascent
83
 
84
  The environment operates via **Autonomous Difficulty Scaling**. Instead of fixed-difficulty benchmarks, a problem $\mathcal{T}$ is decomposed into a hierarchical tree of simpler primitives. For any parent problem $\mathcal{T}_{\text{p}}$ that fails to elicit a reward, the system generates a set of variants $\{\mathcal{T}_i\}$ such that the complexity metric $\mathcal{M}$ satisfies:
85
 
 
93
 
94
  The terminal reward $R_{\Sigma}$ is a weighted composite of seven distinct mathematical and structural signals, designed to penalize hacking and reward rigorous proof-like trajectories:
95
 
96
+ $$R_{\Sigma} = \alpha C + \beta Q + \gamma P + \delta R_{\text{ref}} + \eta D + \zeta E + \lambda X$$
97
 
98
  Where the weights are calibrated as $\alpha=0.35, \beta=0.15, \gamma=0.1, \delta=0.1, \eta=0.15, \zeta=0.05, \lambda=0.1$.
99
 
100
  ### 1. Fundamental Correctness ($C$)
101
  Derived from the **Numerical Multi-point Quadrature Protocol**. A predicted solution $F_{\theta}(x)$ is verified against the target integrand $f(x)$ through the derivative identity:
102
 
103
+ $$C = \begin{cases} 1.0 & \text{if } \forall x_i \in \mathbb{X}, \quad \left| \frac{d}{dx}F_{\theta}(x_i) - f(x_i) \right| < 10^{-2} \\ 0.0 & \text{otherwise} \end{cases}$$
104
 
105
  Where $\mathbb{X} = \{x_1, \dots, x_5\}$ is a set of random points sampled from $\mathcal{U}(-5, 5)$.
106