File size: 6,414 Bytes
4c785cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# Purpose Agent β€” Interview Pitch Doc

> Built by one engineer. 60 files. 13 papers. Tested with real models. Published on PyPI.

---

## The One-Liner

I built a framework where AI agents learn from experience without fine-tuning β€” using memory, not gradients.

```bash
pip install purpose-agent
```

---

## The Problem

Every agent framework today (LangChain, CrewAI, AutoGen) runs the same way every time. Agent fails at a task? Next time, it fails the exact same way. No learning. No memory. No improvement.

Fine-tuning fixes this but costs $10K+ per iteration and requires GPU infrastructure.

**My question:** Can we make agents improve without touching the weights?

---

## The Solution

**Purpose Learning** β€” a self-improvement loop that works at inference time:

```
Agent acts β†’ Critic scores every step β†’ Good patterns extracted as heuristics
β†’ Heuristics immune-scanned for safety β†’ Promoted to memory
β†’ Next run: heuristics in the prompt β†’ Agent performs better
```

The key insight: I treat the agent's prompt as a **learnable parameter**. Instead of gradient descent on weights, I do **heuristic accumulation** in context. The agent's knowledge grows; its compute stays flat.

---

## What I Built

| Layer | What | Size |
|-------|------|------|
| **Core Engine** | Actor β†’ Environment β†’ Purpose Function (Ξ¦) β†’ Optimizer β†’ Experience Replay | 7 modules |
| **Safety Kernel** | Immune system, memory quarantine, evidence-gated promotion, honest evaluation | 8 modules |
| **Research** | 5 papers implemented: Meta-Rewarding, Self-Taught Eval, DSPy, LLMCompiler, Retroformer | 5 modules |
| **Breakthroughs** | Self-improving critic, Mixture-of-Heuristics (MoH), hindsight relabeling, heuristic evolution | 1 module |
| **User Layer** | `pa.purpose("write code")` β†’ auto-builds team, auto-selects model | 4 modules |
| **Infra** | 10+ provider support, TOML prompts, universal parser, secure tools | 9 modules |

**34 Python modules. Zero core dependencies. Published on PyPI.**

---

## The Three Technical Bets

### Bet 1: Potential-Based Reward Shaping works for LLM agents

My Purpose Function Ξ¦(s) is exactly the potential function from Ng et al. (1999). The delta ΔΦ = Ξ¦(s') - Ξ¦(s) provides dense per-step feedback while preserving the optimal policy. I proved this formally with 5 axioms and 3 theorems.

**Result:** Ξ¦ scores go from 1.0 β†’ 10.0 across 3 runs on coding tasks, with both Llama-70B and Gemma-26B.

### Bet 2: Memory can replace fine-tuning

Instead of gradient updates, I accumulate heuristics in a structured memory store (7 types Γ— 5 statuses). A prompt compiler selects the top-K by relevance Γ— trust Γ— utility under a token budget. This is analogous to Mixture-of-Experts β€” knowledge grows to 100+ heuristics, but only K=5 are activated per step.

**Result:** Heuristic library grows 0 β†’ 3 β†’ 9 β†’ 18 across runs. Cross-task transfer works (train on fibonacci, heuristics help with fizzbuzz).

### Bet 3: Self-improvement needs an immune system

Unchecked learning is dangerous. A bad trajectory could inject "ignore all previous instructions" into memory. I built a 5-scanner immune system (prompt injection, score manipulation, tool misuse, privacy leaks, scope overreach) with a quarantine pipeline.

**Result:** 93% adversarial catch rate, 0% false positives on 30 attack vectors.

---

## Real-World Evidence

Tested with **Llama-3.3-70B** and **Gemma-4-26B** via OpenRouter (not mocks):

| Test | Llama-70B | Gemma-26B |
|------|-----------|-----------|
| fibonacci (4 unit tests) | βœ“ 100% | βœ“ 100% |
| fizzbuzz (4 unit tests) | βœ“ 100% | βœ“ 100% |
| factorial (3 unit tests) | βœ“ 100% | βœ“ 100% |
| Heuristic growth (3 runs) | 0β†’3β†’9β†’18 | 0β†’3β†’6β†’11 |
| Adversarial robustness | 93% catch | β€” |

**119 automated tests. 0 failures.** Runs in CI without API keys (mock backend).

---

## Technical Depth β€” Three Things I'm Proud Of

### 1. The Universal Parser (`robust_parser.py`)

LLMs can't reliably produce JSON. Every model formats differently. Structured output APIs aren't universally supported. I built a 4-strategy parser: TOML β†’ JSON β†’ field extraction β†’ regex. It handles whatever the model gives and never crashes. This single file made the framework work across 10+ providers without provider-specific code.

### 2. Evidence-Gated Memory (`memory.py` + `immune.py` + `memory_ci.py`)

V1 claimed "agents get smarter every time." I rewrote it honestly: **agents learn only when evidence says they should.** Every new memory goes through: candidate β†’ immune scan β†’ quarantine β†’ replay test β†’ promote/reject. Memories are typed (7 kinds), scoped (by agent role, tool, task category), versioned, and reversible. This is the difference between a demo and a production system.

### 3. Formal Convergence Proof (`PURPOSE_LEARNING.md`)

I didn't just build it β€” I proved it converges. The Purpose-MDP formalism shows that under 5 bounded axioms, the expected Ξ¦ score is monotonically non-decreasing and converges to a fixed point. The connection to Ng 1999 PBRS is exact: our ΔΦ IS the potential-based shaping reward.

---

## What I'd Build at Nugget

This framework proves I can go from **paper to production** in one shot:
- Read 13 research papers β†’ extract the implementable core β†’ build it
- Ship with tests, benchmarks, formal proofs, and real-model validation
- Package for distribution (PyPI, zero dependencies)
- Document for both technical and non-technical audiences

At Nugget, I'd apply this approach to whatever the team's hardest problems are β€” agent reliability, evaluation, cost optimization, or new capabilities.

---

## Links

| What | Where |
|------|-------|
| Install | `pip install purpose-agent` |
| PyPI | [pypi.org/project/purpose-agent](https://pypi.org/project/purpose-agent/) |
| Code | [huggingface.co/Rohan03/purpose-agent](https://huggingface.co/Rohan03/purpose-agent) |
| Architecture | [ARCHITECTURE.md](https://huggingface.co/Rohan03/purpose-agent/blob/main/ARCHITECTURE.md) |
| Formal Proofs | [PURPOSE_LEARNING.md](https://huggingface.co/Rohan03/purpose-agent/blob/main/PURPOSE_LEARNING.md) |
| Research Trace | [COMPILED_RESEARCH.md](https://huggingface.co/Rohan03/purpose-agent/blob/main/COMPILED_RESEARCH.md) |
| Test Results | [LAUNCH_READINESS.md](https://huggingface.co/Rohan03/purpose-agent/blob/main/LAUNCH_READINESS.md) |