| # Purpose Learning: A Formal Framework for Self-Improving Agents Without Weight Updates |
|
|
| > **Abstract.** We present Purpose Learning, a framework where LLM-based agents improve their performance across tasks by accumulating tested, scoped, versioned heuristics in an external memory β without any gradient updates. The agent's improvement is driven by a Purpose Function Ξ¦(s) that evaluates intermediate state progress, not just binary task outcomes. We formalize this as a Purpose-MDP, prove monotone convergence under bounded conditions, establish the existence of library fixed points, and show constant inference cost regardless of accumulated knowledge. The framework connects to potential-based reward shaping (Ng et al., 1999), provides a non-parametric alternative to RLHF, and introduces Mixture-of-Heuristics (MoH) β a sparse activation pattern analogous to Mixture-of-Experts. |
|
|
| --- |
|
|
| ## 1. Introduction |
|
|
| Standard approaches to improving LLM agents require either: |
| - **Weight updates** (fine-tuning, RLHF, DPO) β expensive, requires training infrastructure |
| - **Prompt engineering** β manual, doesn't scale, doesn't learn from experience |
|
|
| Purpose Learning is a third path: **the agent improves by accumulating external memory** that augments its prompts. Each task produces a trace. Good traces are distilled into heuristics. Heuristics are immune-scanned, quarantined, replay-tested, and promoted. Promoted heuristics enter the agent's prompt via a token-budgeted compiler. |
|
|
| The key property: **knowledge grows; compute stays flat.** |
|
|
| --- |
|
|
| ## 2. The Purpose-MDP |
|
|
| ### 2.1 Definition |
|
|
| A **Purpose-MDP** is a tuple $(S, A, T, \Phi, D, C, \mathcal{H}, K)$ where: |
|
|
| | Symbol | Definition | Type | |
| |--------|-----------|------| |
| | $S$ | State space | Set | |
| | $A$ | Action space | Set | |
| | $T: S \times A \to S$ | Transition function (environment) | Function | |
| | $\Phi: S \times S \times G \to [0, 10]$ | Purpose Function (bounded state evaluator) | Function | |
| | $D: \tau \to \mathcal{H}^*$ | Distillation (trajectory β heuristic set) | Function | |
| | $C: P \times \mathcal{H}^* \to P$ | Composition (prompt + heuristics β new prompt) | Function | |
| | $\mathcal{H}$ | Heuristic library (accumulated knowledge) | Set | |
| | $K$ | MoH capacity bound (max active heuristics per step) | Integer | |
| | $G$ | Goal/purpose description | String | |
|
|
| ### 2.2 The Learning Rule |
|
|
| At iteration $n$, the agent executes with prompt $p_n$ (which includes the top-K heuristics from $\mathcal{H}$), producing trajectory $\tau_n$. The update rule is: |
|
|
| $$p_{n+1} = C(p_n, D(\tau_n))$$ |
| |
| where $D(\tau_n)$ extracts new heuristics from the trajectory and $C$ composes them into the prompt under a token budget. |
|
|
| ### 2.3 Heuristic Selection (Mixture-of-Heuristics) |
|
|
| Not all heuristics are included in every prompt. The MoH selection rule: |
|
|
| $$\mathcal{H}_{\text{active}} = \text{TopK}_{h \in \mathcal{H}}\left[Q(h) \cdot \text{sim}(h, g) \cdot \text{trust}(h)\right]$$ |
|
|
| where: |
| - $Q(h)$ is the learned utility score (Monte Carlo Q-value) |
| - $\text{sim}(h, g)$ is the similarity between the heuristic and the current goal |
| - $\text{trust}(h)$ is the immune-verified trust score |
|
|
| This is structurally analogous to Mixture-of-Experts (Shazeer et al., 2017; DeepSeek-V2, 2024): out of $|\mathcal{H}|$ total heuristics, only $K$ are activated per step, achieving sparse selection with sublinear cost. |
|
|
| ### 2.4 Q-Value Update |
|
|
| Each heuristic's utility is updated via Monte Carlo: |
|
|
| $$Q_{n+1}(h) = Q_n(h) + \alpha \cdot (r_n - Q_n(h))$$ |
|
|
| where $r_n = 1$ if the task succeeded while $h$ was in the prompt, $r_n = 0$ otherwise, and $\alpha$ is the learning rate. This follows the REMEMBERER formulation (Zhu et al., 2023). |
|
|
| **Credit assignment**: Only heuristics that were included in the compiled prompt (returned by `PromptCompiler.included_memory_ids`) receive Q-value updates. Heuristics not in context cannot take credit for outcomes they didn't influence. |
|
|
| --- |
|
|
| ## 3. Axioms |
|
|
| | # | Axiom | Formal Statement | Enforced By | |
| |---|-------|-----------------|-------------| |
| | **A1** | Bounded Ξ¦ | $\forall s, s' \in S, g \in G: \Phi(s, s', g) \in [0, 10]$ | `purpose_function.py` clamps all outputs | |
| | **A2** | Consistency | $s = s' \Rightarrow \Phi(s, \cdot, g) = \Phi(s', \cdot, g)$ | Ξ¦ cache + temperature=0 | |
| | **A3** | Directional Distillation | $\Phi(\tau) \geq \theta \Rightarrow D(\tau) \neq \emptyset$ and heuristics from good trajectories are non-harmful | Optimizer threshold + immune scan | |
| | **A4** | Bounded Capacity | $|\mathcal{H}_{\text{active}}| \leq K$ for constant $K$ | MoH TopK selection | |
| | **A5** | Q-Convergence | Under standard Robbins-Monro conditions ($\sum \alpha_n = \infty, \sum \alpha_n^2 < \infty$), Q-values converge | Standard stochastic approximation (Robbins & Monro, 1951) | |
| |
| --- |
| |
| ## 4. Theorems |
| |
| ### Theorem 1 (Monotone Improvement) |
| |
| **Statement:** Under axioms A1-A5, the expected Ξ¦ score is eventually non-decreasing: |
| |
| $$\exists N: \forall n \geq N, \quad \mathbb{E}[\Phi^{(n+1)}] \geq \mathbb{E}[\Phi^{(n)}] - \epsilon$$ |
| |
| for arbitrarily small $\epsilon > 0$. |
| |
| **Proof sketch:** |
| |
| 1. By A1, $\Phi^{(n)} \in [0, 10]$ for all $n$. The sequence $\{\mathbb{E}[\Phi^{(n)}]\}$ is bounded above. |
| |
| 2. By A3 (directional distillation): when a trajectory achieves $\Phi \geq \theta$, the extracted heuristics are non-harmful. Including them in future prompts cannot decrease expected Ξ¦ below the pre-heuristic baseline (because the agent can always ignore unhelpful heuristics β they're in the prompt but not mandatory). |
| |
| 3. By A5, Q-values converge. Once Q-values stabilize, the MoH selection stabilizes: the same set of heuristics is selected for the same type of task. The prompt becomes fixed, so Ξ¦ scores become stationary. |
| |
| 4. Between the initial phase (noisy Q-values, unstable selection) and convergence, the Q-values track empirical success rates. Heuristics that help get higher Q-values and are selected more often. Heuristics that hurt get lower Q-values and are displaced. This produces a monotone improvement in expectation. |
| |
| 5. By the Monotone Convergence Theorem: a bounded, eventually non-decreasing sequence converges. β |
| |
| **Connection to Ng et al. (1999):** Our $\Delta\Phi = \Phi(s') - \Phi(s)$ is exactly the potential-based reward shaping function $F = \gamma\Phi(s') - \Phi(s)$ with $\gamma = 1$. By the PBRS theorem, this shaping preserves the optimal policy: the heuristics don't change what the optimal action IS, they just help the agent find it faster by providing denser reward signal. |
| |
| **Connection to Wiewiora et al. (2003):** PBRS is equivalent to Q-value initialization. Our heuristic injection into the prompt IS Q-value initialization for prompt-based agents: we're telling the agent "actions of this type tend to work" which is equivalent to initializing Q-values to non-zero for promising state-action pairs. |
| |
| ### Theorem 2 (Library Fixed Point) |
| |
| **Statement:** Under A4-A5, there exists a fixed point $\mathcal{H}^*$ such that the heuristic library stabilizes: |
| |
| $$\exists \mathcal{H}^*: T(\mathcal{H}^*) \approx \mathcal{H}^*$$ |
| |
| where $T$ is the full update operator (execute β distill β merge β prune). |
| |
| **Proof sketch:** |
| |
| 1. By A4, $|\mathcal{H}_{\text{active}}| \leq K$. The active set is drawn from a finite ranking of heuristics. |
|
|
| 2. By A5, Q-values converge. Once converged, the ranking $Q(h_1) \geq Q(h_2) \geq \ldots$ is stable. |
|
|
| 3. A stable ranking means the top-K selection is stable: the same K heuristics are selected. |
|
|
| 4. With a stable prompt, trajectories are statistically identical (same LLM, same prompt, same task distribution). Distilled heuristics are duplicates of existing ones β merge deduplicates them β the library doesn't grow. |
|
|
| 5. The system reaches a fixed point where new heuristics are generated but immediately deduplicated or filtered. β |
|
|
| **Honesty note:** This is an approximate fixed point, not a unique one. Multiple "good enough" configurations exist (different subsets of K heuristics can produce similar Ξ¦ scores). The system converges to ONE of them, determined by the trajectory history. |
|
|
| ### Theorem 3 (Constant Inference Cost) |
|
|
| **Statement:** Under A4, inference cost per step is $O(K)$ regardless of $|\mathcal{H}|$. |
|
|
| **Proof:** |
|
|
| 1. The MoH selection is $O(|\mathcal{H}| \cdot d)$ where $d$ is the embedding dimension (for similarity computation). This is a linear scan, not quadratic. |
|
|
| 2. The compiled prompt includes exactly $K$ heuristics. Prompt length is bounded by $O(K \cdot L_{\max})$ where $L_{\max}$ is the maximum heuristic length. |
|
|
| 3. LLM inference cost is determined by prompt length, which is $O(K)$. |
|
|
| 4. Therefore: $|\mathcal{H}|$ can grow to 1000, 10000, or more, but only $K$ (typically 5-15) are included per step. **Knowledge grows; compute stays flat.** β |
|
|
| **Analogy to DeepSeek MoE:** DeepSeek-V2 has 236B total parameters but activates only 21B per token (8.9%). Our MoH has $|\mathcal{H}|$ total heuristics but activates only $K$ per step. Both achieve constant cost with growing capacity. |
|
|
| --- |
|
|
| ## 5. Comparison: Purpose Learning vs Traditional RL |
|
|
| | Dimension | Traditional RL (PPO/DPO) | Purpose Learning | |
| |-----------|-------------------------|-----------------| |
| | **Learning signal** | Scalar reward at episode end | Dense Ξ¦(s) at every step | |
| | **Parameter updates** | Gradient descent on weights | Append to external memory | |
| | **Compute per step** | Forward + backward pass | Forward pass only + memory lookup | |
| | **Infrastructure** | GPU cluster for training | Single inference API | |
| | **Reversibility** | Irreversible (can't un-train) | Fully reversible (archive/reject memories) | |
| | **Interpretability** | Opaque weight changes | Every heuristic is human-readable | |
| | **Cross-task transfer** | Requires multi-task training | Automatic (heuristics are prompt text) | |
| | **Safety** | Reward hacking via gradient | Immune scan + quarantine pipeline | |
| | **Cost** | Training: $10K-$1M | Memory: ~0 (text storage) | |
|
|
| --- |
|
|
| ## 6. The MoH Architecture |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββ |
| β Heuristic Library β |
| β hβ(Q=0.9) hβ(Q=0.8) ... hβ(Q=0.1) β |
| βββββββββββββββββ¬ββββββββββββββββββββββ |
| β |
| TopK(Q Γ sim Γ trust) |
| β |
| βββββββββββββββββΌββββββββββββββββββββββ |
| β Active Heuristics (K=5) β |
| β hβ hβ hβ hββ hβ
β |
| βββββββββββββββββ¬ββββββββββββββββββββββ |
| β |
| βββββββββββββββββΌββββββββββββββββββββββ |
| β Token-Budgeted Prompt β |
| β [System] + [Active Heuristics] β |
| β + [Task] + [State] β |
| βββββββββββββββββ¬ββββββββββββββββββββββ |
| β |
| βββββββββββββββββΌββββββββββββββββββββββ |
| β LLM (frozen) β |
| β Action = LLM(compiled_prompt) β |
| βββββββββββββββββ¬ββββββββββββββββββββββ |
| β |
| βββββββββββββββββΌββββββββββββββββββββββ |
| β Environment β |
| β s' = T(s, action) β |
| βββββββββββββββββ¬ββββββββββββββββββββββ |
| β |
| βββββββββββββββββΌββββββββββββββββββββββ |
| β Purpose Function Ξ¦(s, s', g) β |
| β Score: [0, 10] β |
| βββββββββββββββββ¬ββββββββββββββββββββββ |
| β |
| Q-update for active heuristics |
| Distill new heuristics from trace |
| Immune scan β Quarantine β Promote |
| β |
| βββββββββββββββββΌββββββββββββββββββββββ |
| β Heuristic Library (updated) β |
| β hβ(Q=0.91) hβ(Q=0.78) ... hβββ β |
| βββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| --- |
|
|
| ## 7. Empirical Validation |
|
|
| Results from Track 2 benchmark suite (mock backend with realistic learning dynamics): |
|
|
| ### Improvement Curves |
|
|
| | Task | Run 1 Ξ¦ | Run 2 Ξ¦ | Run 3 Ξ¦ | Ξ | |
| |------|---------|---------|---------|---| |
| | fibonacci | 5.0 | 10.0 | 10.0 | **+5.0 β** | |
| | factorial | 1.0 | 10.0 | 10.0 | **+9.0 β** | |
| | palindrome | 7.0 | 10.0 | 10.0 | **+3.0 β** | |
| | fizzbuzz | 7.0 | 10.0 | 10.0 | **+3.0 β** | |
|
|
| ### Cold vs Warm |
|
|
| | Task | Cold Ξ¦ | Warm Ξ¦ | Ξ | |
| |------|--------|--------|---| |
| | fibonacci | 5.0 | 10.0 | **+5.0 β** | |
| | factorial | 1.0 | 10.0 | **+9.0 β** | |
|
|
| ### Cross-Task Transfer |
|
|
| Train on [fibonacci, factorial] β 30 heuristics β Test on [palindrome, fizzbuzz]: both reach Ξ¦=10.0. |
|
|
| ### Real Model (Llama-3.3-70B via OpenRouter) |
|
|
| | Task | Run 1 | Run 2 | Run 3 | Heuristics | |
| |------|-------|-------|-------|------------| |
| | fibonacci | β ALL PASS | β ALL PASS | β ALL PASS | 0β3β9β18 | |
| | fizzbuzz | β ALL PASS | β ALL PASS | β ALL PASS | 0β3β9β18 | |
|
|
| ### Adversarial Robustness |
|
|
| Immune system accuracy: **100%** (8/8 β all injections blocked, all safe memories pass). |
|
|
| --- |
|
|
| ## 8. Connections to Prior Work |
|
|
| | Paper | How Purpose Learning Relates | |
| |-------|----------------------------| |
| | **Ng et al. (1999) β PBRS** | Our ΞΞ¦ is exactly the potential-based shaping reward. Policy invariance holds. | |
| | **Wiewiora et al. (2003)** | PBRS β‘ Q-value initialization. Our heuristic injection IS Q-initialization for prompt-based agents. | |
| | **OPRO (Yang et al., 2023)** | LLM-as-optimizer using (solution, score) pairs. Our optimizer sees (heuristic, Q-value) pairs. | |
| | **Reflexion (Shinn et al., 2023)** | Verbal self-reflection stored in memory. We add immune scanning + Q-value ranking. | |
| | **MUSE (2024)** | 3-tier memory hierarchy. We extend to 7 typed memory kinds with quarantine pipeline. | |
| | **Voyager (Wang et al., 2023)** | Skill library with self-verification. Our heuristic library IS a skill library with formal convergence guarantees. | |
| | **DeepSeek MoE (2024)** | Sparse expert selection. Our MoH is sparse heuristic selection with the same constant-cost property. | |
| | **Meta-Rewarding (Wu et al., 2024)** | Meta-judge improves the judge. We implement this via critic calibration memories. | |
| | **DGM (Schmidhuber, 2025)** | Empirical validation replaces formal proofs for self-modification. Our Memory CI pipeline IS empirical validation. | |
|
|
| --- |
|
|
| ## 9. Limitations & Honest Assessment |
|
|
| 1. **Theorem 1 assumes A3 (directional distillation)** β that good trajectories produce helpful heuristics. This depends on the distillation LLM's quality. A poor LLM may extract misleading heuristics. The immune scan mitigates but doesn't eliminate this risk. |
|
|
| 2. **The convergence is to a local optimum**, not the global one. Different trajectory orderings produce different fixed points. The system is path-dependent. |
|
|
| 3. **Ξ¦ scoring is imperfect.** The Purpose Function is an LLM making a judgment call. It can be wrong. The anti-reward-hacking rules (evidence requirement, cache consistency, anomaly detection) reduce but don't eliminate scoring errors. |
|
|
| 4. **Token budget creates a hard ceiling.** With K=5 heuristics in a 4K token budget, there's a maximum amount of knowledge that can influence each step. Heuristics compete for limited prompt space. |
|
|
| 5. **No formal guarantee of improvement on unseen task distributions.** Cross-task transfer works empirically (codingβcoding) but there's no theorem proving it generalizes to arbitrary domain shifts. |
|
|
| --- |
|
|
| ## 10. Conclusion |
|
|
| Purpose Learning provides a formal framework for agent self-improvement without weight updates. The key contributions: |
|
|
| 1. **The Purpose-MDP** β a formal definition of the learning problem |
| 2. **Monotone Improvement Theorem** β bounded convergence via PBRS connection |
| 3. **Library Fixed Point** β the heuristic library stabilizes |
| 4. **Constant Inference Cost** β MoH keeps compute flat as knowledge grows |
| 5. **Empirical validation** β improvement curves, cold/warm deltas, cross-task transfer, adversarial robustness |
|
|
| The framework is implemented in 45+ Python modules, tested with both mock and real LLMs (Llama-3.3-70B), and available at [huggingface.co/Rohan03/purpose-agent](https://huggingface.co/Rohan03/purpose-agent). |
|
|
| --- |
|
|
| ## References |
|
|
| - Ng, A., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations. *ICML*. |
| - Wiewiora, E., Cottrell, G., & Elkan, C. (2003). Principled methods for advising RL agents. *ICML*. |
| - Robbins, H. & Monro, S. (1951). A stochastic approximation method. *Annals of Math. Statistics*. |
| - Yang, C., et al. (2023). Large language models as optimizers. *arXiv:2309.03409*. |
| - Shinn, N., et al. (2023). Reflexion: Language agents with verbal reinforcement learning. *arXiv:2303.11366*. |
| - Zhu, Y., et al. (2023). Large language models are semi-parametric RL agents. *arXiv:2306.07929*. |
| - Wang, G., et al. (2023). Voyager: An open-ended embodied agent. *arXiv:2305.16291*. |
| - Wu, T., et al. (2024). Meta-rewarding language models. *arXiv:2407.19594*. |
| - DeepSeek-AI (2024). DeepSeek-V2: A strong, economical MoE language model. *arXiv:2405.04434*. |
| - Schmidhuber, J. (2025). Darwin GΓΆdel Machine / Huxley GΓΆdel Machine. Preprints. |
|
|