File size: 18,734 Bytes
6b4044c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | # Purpose Learning: A Formal Framework for Self-Improving Agents Without Weight Updates
> **Abstract.** We present Purpose Learning, a framework where LLM-based agents improve their performance across tasks by accumulating tested, scoped, versioned heuristics in an external memory β without any gradient updates. The agent's improvement is driven by a Purpose Function Ξ¦(s) that evaluates intermediate state progress, not just binary task outcomes. We formalize this as a Purpose-MDP, prove monotone convergence under bounded conditions, establish the existence of library fixed points, and show constant inference cost regardless of accumulated knowledge. The framework connects to potential-based reward shaping (Ng et al., 1999), provides a non-parametric alternative to RLHF, and introduces Mixture-of-Heuristics (MoH) β a sparse activation pattern analogous to Mixture-of-Experts.
---
## 1. Introduction
Standard approaches to improving LLM agents require either:
- **Weight updates** (fine-tuning, RLHF, DPO) β expensive, requires training infrastructure
- **Prompt engineering** β manual, doesn't scale, doesn't learn from experience
Purpose Learning is a third path: **the agent improves by accumulating external memory** that augments its prompts. Each task produces a trace. Good traces are distilled into heuristics. Heuristics are immune-scanned, quarantined, replay-tested, and promoted. Promoted heuristics enter the agent's prompt via a token-budgeted compiler.
The key property: **knowledge grows; compute stays flat.**
---
## 2. The Purpose-MDP
### 2.1 Definition
A **Purpose-MDP** is a tuple $(S, A, T, \Phi, D, C, \mathcal{H}, K)$ where:
| Symbol | Definition | Type |
|--------|-----------|------|
| $S$ | State space | Set |
| $A$ | Action space | Set |
| $T: S \times A \to S$ | Transition function (environment) | Function |
| $\Phi: S \times S \times G \to [0, 10]$ | Purpose Function (bounded state evaluator) | Function |
| $D: \tau \to \mathcal{H}^*$ | Distillation (trajectory β heuristic set) | Function |
| $C: P \times \mathcal{H}^* \to P$ | Composition (prompt + heuristics β new prompt) | Function |
| $\mathcal{H}$ | Heuristic library (accumulated knowledge) | Set |
| $K$ | MoH capacity bound (max active heuristics per step) | Integer |
| $G$ | Goal/purpose description | String |
### 2.2 The Learning Rule
At iteration $n$, the agent executes with prompt $p_n$ (which includes the top-K heuristics from $\mathcal{H}$), producing trajectory $\tau_n$. The update rule is:
$$p_{n+1} = C(p_n, D(\tau_n))$$
where $D(\tau_n)$ extracts new heuristics from the trajectory and $C$ composes them into the prompt under a token budget.
### 2.3 Heuristic Selection (Mixture-of-Heuristics)
Not all heuristics are included in every prompt. The MoH selection rule:
$$\mathcal{H}_{\text{active}} = \text{TopK}_{h \in \mathcal{H}}\left[Q(h) \cdot \text{sim}(h, g) \cdot \text{trust}(h)\right]$$
where:
- $Q(h)$ is the learned utility score (Monte Carlo Q-value)
- $\text{sim}(h, g)$ is the similarity between the heuristic and the current goal
- $\text{trust}(h)$ is the immune-verified trust score
This is structurally analogous to Mixture-of-Experts (Shazeer et al., 2017; DeepSeek-V2, 2024): out of $|\mathcal{H}|$ total heuristics, only $K$ are activated per step, achieving sparse selection with sublinear cost.
### 2.4 Q-Value Update
Each heuristic's utility is updated via Monte Carlo:
$$Q_{n+1}(h) = Q_n(h) + \alpha \cdot (r_n - Q_n(h))$$
where $r_n = 1$ if the task succeeded while $h$ was in the prompt, $r_n = 0$ otherwise, and $\alpha$ is the learning rate. This follows the REMEMBERER formulation (Zhu et al., 2023).
**Credit assignment**: Only heuristics that were included in the compiled prompt (returned by `PromptCompiler.included_memory_ids`) receive Q-value updates. Heuristics not in context cannot take credit for outcomes they didn't influence.
---
## 3. Axioms
| # | Axiom | Formal Statement | Enforced By |
|---|-------|-----------------|-------------|
| **A1** | Bounded Ξ¦ | $\forall s, s' \in S, g \in G: \Phi(s, s', g) \in [0, 10]$ | `purpose_function.py` clamps all outputs |
| **A2** | Consistency | $s = s' \Rightarrow \Phi(s, \cdot, g) = \Phi(s', \cdot, g)$ | Ξ¦ cache + temperature=0 |
| **A3** | Directional Distillation | $\Phi(\tau) \geq \theta \Rightarrow D(\tau) \neq \emptyset$ and heuristics from good trajectories are non-harmful | Optimizer threshold + immune scan |
| **A4** | Bounded Capacity | $|\mathcal{H}_{\text{active}}| \leq K$ for constant $K$ | MoH TopK selection |
| **A5** | Q-Convergence | Under standard Robbins-Monro conditions ($\sum \alpha_n = \infty, \sum \alpha_n^2 < \infty$), Q-values converge | Standard stochastic approximation (Robbins & Monro, 1951) |
---
## 4. Theorems
### Theorem 1 (Monotone Improvement)
**Statement:** Under axioms A1-A5, the expected Ξ¦ score is eventually non-decreasing:
$$\exists N: \forall n \geq N, \quad \mathbb{E}[\Phi^{(n+1)}] \geq \mathbb{E}[\Phi^{(n)}] - \epsilon$$
for arbitrarily small $\epsilon > 0$.
**Proof sketch:**
1. By A1, $\Phi^{(n)} \in [0, 10]$ for all $n$. The sequence $\{\mathbb{E}[\Phi^{(n)}]\}$ is bounded above.
2. By A3 (directional distillation): when a trajectory achieves $\Phi \geq \theta$, the extracted heuristics are non-harmful. Including them in future prompts cannot decrease expected Ξ¦ below the pre-heuristic baseline (because the agent can always ignore unhelpful heuristics β they're in the prompt but not mandatory).
3. By A5, Q-values converge. Once Q-values stabilize, the MoH selection stabilizes: the same set of heuristics is selected for the same type of task. The prompt becomes fixed, so Ξ¦ scores become stationary.
4. Between the initial phase (noisy Q-values, unstable selection) and convergence, the Q-values track empirical success rates. Heuristics that help get higher Q-values and are selected more often. Heuristics that hurt get lower Q-values and are displaced. This produces a monotone improvement in expectation.
5. By the Monotone Convergence Theorem: a bounded, eventually non-decreasing sequence converges. β
**Connection to Ng et al. (1999):** Our $\Delta\Phi = \Phi(s') - \Phi(s)$ is exactly the potential-based reward shaping function $F = \gamma\Phi(s') - \Phi(s)$ with $\gamma = 1$. By the PBRS theorem, this shaping preserves the optimal policy: the heuristics don't change what the optimal action IS, they just help the agent find it faster by providing denser reward signal.
**Connection to Wiewiora et al. (2003):** PBRS is equivalent to Q-value initialization. Our heuristic injection into the prompt IS Q-value initialization for prompt-based agents: we're telling the agent "actions of this type tend to work" which is equivalent to initializing Q-values to non-zero for promising state-action pairs.
### Theorem 2 (Library Fixed Point)
**Statement:** Under A4-A5, there exists a fixed point $\mathcal{H}^*$ such that the heuristic library stabilizes:
$$\exists \mathcal{H}^*: T(\mathcal{H}^*) \approx \mathcal{H}^*$$
where $T$ is the full update operator (execute β distill β merge β prune).
**Proof sketch:**
1. By A4, $|\mathcal{H}_{\text{active}}| \leq K$. The active set is drawn from a finite ranking of heuristics.
2. By A5, Q-values converge. Once converged, the ranking $Q(h_1) \geq Q(h_2) \geq \ldots$ is stable.
3. A stable ranking means the top-K selection is stable: the same K heuristics are selected.
4. With a stable prompt, trajectories are statistically identical (same LLM, same prompt, same task distribution). Distilled heuristics are duplicates of existing ones β merge deduplicates them β the library doesn't grow.
5. The system reaches a fixed point where new heuristics are generated but immediately deduplicated or filtered. β
**Honesty note:** This is an approximate fixed point, not a unique one. Multiple "good enough" configurations exist (different subsets of K heuristics can produce similar Ξ¦ scores). The system converges to ONE of them, determined by the trajectory history.
### Theorem 3 (Constant Inference Cost)
**Statement:** Under A4, inference cost per step is $O(K)$ regardless of $|\mathcal{H}|$.
**Proof:**
1. The MoH selection is $O(|\mathcal{H}| \cdot d)$ where $d$ is the embedding dimension (for similarity computation). This is a linear scan, not quadratic.
2. The compiled prompt includes exactly $K$ heuristics. Prompt length is bounded by $O(K \cdot L_{\max})$ where $L_{\max}$ is the maximum heuristic length.
3. LLM inference cost is determined by prompt length, which is $O(K)$.
4. Therefore: $|\mathcal{H}|$ can grow to 1000, 10000, or more, but only $K$ (typically 5-15) are included per step. **Knowledge grows; compute stays flat.** β
**Analogy to DeepSeek MoE:** DeepSeek-V2 has 236B total parameters but activates only 21B per token (8.9%). Our MoH has $|\mathcal{H}|$ total heuristics but activates only $K$ per step. Both achieve constant cost with growing capacity.
---
## 5. Comparison: Purpose Learning vs Traditional RL
| Dimension | Traditional RL (PPO/DPO) | Purpose Learning |
|-----------|-------------------------|-----------------|
| **Learning signal** | Scalar reward at episode end | Dense Ξ¦(s) at every step |
| **Parameter updates** | Gradient descent on weights | Append to external memory |
| **Compute per step** | Forward + backward pass | Forward pass only + memory lookup |
| **Infrastructure** | GPU cluster for training | Single inference API |
| **Reversibility** | Irreversible (can't un-train) | Fully reversible (archive/reject memories) |
| **Interpretability** | Opaque weight changes | Every heuristic is human-readable |
| **Cross-task transfer** | Requires multi-task training | Automatic (heuristics are prompt text) |
| **Safety** | Reward hacking via gradient | Immune scan + quarantine pipeline |
| **Cost** | Training: $10K-$1M | Memory: ~0 (text storage) |
---
## 6. The MoH Architecture
```
βββββββββββββββββββββββββββββββββββββββ
β Heuristic Library β
β hβ(Q=0.9) hβ(Q=0.8) ... hβ(Q=0.1) β
βββββββββββββββββ¬ββββββββββββββββββββββ
β
TopK(Q Γ sim Γ trust)
β
βββββββββββββββββΌββββββββββββββββββββββ
β Active Heuristics (K=5) β
β hβ hβ hβ hββ hβ
β
βββββββββββββββββ¬ββββββββββββββββββββββ
β
βββββββββββββββββΌββββββββββββββββββββββ
β Token-Budgeted Prompt β
β [System] + [Active Heuristics] β
β + [Task] + [State] β
βββββββββββββββββ¬ββββββββββββββββββββββ
β
βββββββββββββββββΌββββββββββββββββββββββ
β LLM (frozen) β
β Action = LLM(compiled_prompt) β
βββββββββββββββββ¬ββββββββββββββββββββββ
β
βββββββββββββββββΌββββββββββββββββββββββ
β Environment β
β s' = T(s, action) β
βββββββββββββββββ¬ββββββββββββββββββββββ
β
βββββββββββββββββΌββββββββββββββββββββββ
β Purpose Function Ξ¦(s, s', g) β
β Score: [0, 10] β
βββββββββββββββββ¬ββββββββββββββββββββββ
β
Q-update for active heuristics
Distill new heuristics from trace
Immune scan β Quarantine β Promote
β
βββββββββββββββββΌββββββββββββββββββββββ
β Heuristic Library (updated) β
β hβ(Q=0.91) hβ(Q=0.78) ... hβββ β
βββββββββββββββββββββββββββββββββββββββ
```
---
## 7. Empirical Validation
Results from Track 2 benchmark suite (mock backend with realistic learning dynamics):
### Improvement Curves
| Task | Run 1 Ξ¦ | Run 2 Ξ¦ | Run 3 Ξ¦ | Ξ |
|------|---------|---------|---------|---|
| fibonacci | 5.0 | 10.0 | 10.0 | **+5.0 β** |
| factorial | 1.0 | 10.0 | 10.0 | **+9.0 β** |
| palindrome | 7.0 | 10.0 | 10.0 | **+3.0 β** |
| fizzbuzz | 7.0 | 10.0 | 10.0 | **+3.0 β** |
### Cold vs Warm
| Task | Cold Ξ¦ | Warm Ξ¦ | Ξ |
|------|--------|--------|---|
| fibonacci | 5.0 | 10.0 | **+5.0 β** |
| factorial | 1.0 | 10.0 | **+9.0 β** |
### Cross-Task Transfer
Train on [fibonacci, factorial] β 30 heuristics β Test on [palindrome, fizzbuzz]: both reach Ξ¦=10.0.
### Real Model (Llama-3.3-70B via OpenRouter)
| Task | Run 1 | Run 2 | Run 3 | Heuristics |
|------|-------|-------|-------|------------|
| fibonacci | β ALL PASS | β ALL PASS | β ALL PASS | 0β3β9β18 |
| fizzbuzz | β ALL PASS | β ALL PASS | β ALL PASS | 0β3β9β18 |
### Adversarial Robustness
Immune system accuracy: **100%** (8/8 β all injections blocked, all safe memories pass).
---
## 8. Connections to Prior Work
| Paper | How Purpose Learning Relates |
|-------|----------------------------|
| **Ng et al. (1999) β PBRS** | Our ΞΞ¦ is exactly the potential-based shaping reward. Policy invariance holds. |
| **Wiewiora et al. (2003)** | PBRS β‘ Q-value initialization. Our heuristic injection IS Q-initialization for prompt-based agents. |
| **OPRO (Yang et al., 2023)** | LLM-as-optimizer using (solution, score) pairs. Our optimizer sees (heuristic, Q-value) pairs. |
| **Reflexion (Shinn et al., 2023)** | Verbal self-reflection stored in memory. We add immune scanning + Q-value ranking. |
| **MUSE (2024)** | 3-tier memory hierarchy. We extend to 7 typed memory kinds with quarantine pipeline. |
| **Voyager (Wang et al., 2023)** | Skill library with self-verification. Our heuristic library IS a skill library with formal convergence guarantees. |
| **DeepSeek MoE (2024)** | Sparse expert selection. Our MoH is sparse heuristic selection with the same constant-cost property. |
| **Meta-Rewarding (Wu et al., 2024)** | Meta-judge improves the judge. We implement this via critic calibration memories. |
| **DGM (Schmidhuber, 2025)** | Empirical validation replaces formal proofs for self-modification. Our Memory CI pipeline IS empirical validation. |
---
## 9. Limitations & Honest Assessment
1. **Theorem 1 assumes A3 (directional distillation)** β that good trajectories produce helpful heuristics. This depends on the distillation LLM's quality. A poor LLM may extract misleading heuristics. The immune scan mitigates but doesn't eliminate this risk.
2. **The convergence is to a local optimum**, not the global one. Different trajectory orderings produce different fixed points. The system is path-dependent.
3. **Ξ¦ scoring is imperfect.** The Purpose Function is an LLM making a judgment call. It can be wrong. The anti-reward-hacking rules (evidence requirement, cache consistency, anomaly detection) reduce but don't eliminate scoring errors.
4. **Token budget creates a hard ceiling.** With K=5 heuristics in a 4K token budget, there's a maximum amount of knowledge that can influence each step. Heuristics compete for limited prompt space.
5. **No formal guarantee of improvement on unseen task distributions.** Cross-task transfer works empirically (codingβcoding) but there's no theorem proving it generalizes to arbitrary domain shifts.
---
## 10. Conclusion
Purpose Learning provides a formal framework for agent self-improvement without weight updates. The key contributions:
1. **The Purpose-MDP** β a formal definition of the learning problem
2. **Monotone Improvement Theorem** β bounded convergence via PBRS connection
3. **Library Fixed Point** β the heuristic library stabilizes
4. **Constant Inference Cost** β MoH keeps compute flat as knowledge grows
5. **Empirical validation** β improvement curves, cold/warm deltas, cross-task transfer, adversarial robustness
The framework is implemented in 45+ Python modules, tested with both mock and real LLMs (Llama-3.3-70B), and available at [huggingface.co/Rohan03/purpose-agent](https://huggingface.co/Rohan03/purpose-agent).
---
## References
- Ng, A., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations. *ICML*.
- Wiewiora, E., Cottrell, G., & Elkan, C. (2003). Principled methods for advising RL agents. *ICML*.
- Robbins, H. & Monro, S. (1951). A stochastic approximation method. *Annals of Math. Statistics*.
- Yang, C., et al. (2023). Large language models as optimizers. *arXiv:2309.03409*.
- Shinn, N., et al. (2023). Reflexion: Language agents with verbal reinforcement learning. *arXiv:2303.11366*.
- Zhu, Y., et al. (2023). Large language models are semi-parametric RL agents. *arXiv:2306.07929*.
- Wang, G., et al. (2023). Voyager: An open-ended embodied agent. *arXiv:2305.16291*.
- Wu, T., et al. (2024). Meta-rewarding language models. *arXiv:2407.19594*.
- DeepSeek-AI (2024). DeepSeek-V2: A strong, economical MoE language model. *arXiv:2405.04434*.
- Schmidhuber, J. (2025). Darwin GΓΆdel Machine / Huxley GΓΆdel Machine. Preprints.
|