File size: 8,393 Bytes
ca2cef5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a99d027
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
library_name: purpose-agent
license: mit
language:
  - en
tags:
  - reinforcement-learning
  - agents
  - self-improving
  - experience-replay
  - llm-as-judge
  - state-value-evaluation
  - memory-augmented
  - react
  - orchestration
  - modular
pipeline_tag: text-generation
---

# Purpose Agent β€” Self-Improving Agentic Framework via State-Value Evaluation

A lightweight, modular framework where an LLM agent improves across tasks **without weight updates** β€” using an RL-inspired self-reflection loop with a "Purpose Function" that evaluates intermediate state improvements.

## Core Philosophy

The agent improves via a **Purpose Function Ξ¦(s)** that measures distance-to-goal at every step. It rewards the agent **only if Ξ¦(s_new) > Ξ¦(s_current)**. High-reward trajectories are distilled into reusable heuristics stored in a 3-tier memory system, so the agent gets smarter on each subsequent task.

**No real-time backprop. No PPO/DPO. Minimal infrastructure costs.**

## Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     ORCHESTRATOR LOOP                          β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   action   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   s_new              β”‚
β”‚  β”‚  ACTOR   β”‚ ────────►  β”‚ ENVIRONMENT β”‚ ──────────┐          β”‚
β”‚  β”‚(+memory) β”‚            β”‚ (your code) β”‚           β”‚          β”‚
β”‚  β””β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚          β”‚
β”‚       β”‚                                             β–Ό          β”‚
β”‚       β”‚  heuristics    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   (s, a, s')        β”‚
β”‚       │◄───────────────│   OPTIMIZER    │◄─────────┐          β”‚
β”‚       β”‚                β”‚ (distillation) β”‚          β”‚          β”‚
β”‚       β”‚                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚          β”‚
β”‚       β”‚                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   Ξ¦(s)β†’Ξ¦(s')       β”‚
β”‚       β”‚                β”‚   PURPOSE FN   │───────────          β”‚
β”‚       β”‚                β”‚ (state critic) β”‚          β”‚          β”‚
β”‚       β”‚                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚          β”‚
β”‚       β”‚                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚          β”‚
β”‚       └────────────────│ EXPERIENCE     β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                        β”‚ REPLAY BUFFER  β”‚                      β”‚
β”‚                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Modules

| Module | File | Role |
|--------|------|------|
| **Actor** | `actor.py` | ReAct-style agent with 3-tier memory-augmented prompts |
| **Purpose Function** | `purpose_function.py` | Strict, non-hackable LLM critic that scores Ξ¦(s) transitions |
| **Experience Replay** | `experience_replay.py` | Trajectory storage with two-phase retrieval (similarity + Q-value) |
| **Optimizer** | `optimizer.py` | Distills winning trajectories into reusable heuristics |
| **Orchestrator** | `orchestrator.py` | Main loop tying everything together |
| **LLM Backend** | `llm_backend.py` | Swappable inference layer (HF, OpenAI, Ollama, custom) |
| **Types** | `types.py` | Shared data structures (State, Action, Trajectory, Heuristic, etc.) |

## Literature Foundation

| Paper | Contribution to this framework |
|-------|-------------------------------|
| [MUSE](https://arxiv.org/abs/2510.08002) | 3-tier memory hierarchy (strategic/procedural/tool) |
| [LATS](https://arxiv.org/abs/2310.04406) | LLM-as-value-function V(s) pattern |
| [REMEMBERER](https://arxiv.org/abs/2306.07929) | Q-value experience replay with Bellman updates |
| [Reflexion](https://arxiv.org/abs/2303.11366) | Verbal reinforcement via episodic self-reflection |
| [SPC](https://arxiv.org/abs/2504.19162) | Anti-reward-hacking via adversarial critic patterns |
| [CER](https://arxiv.org/abs/2506.06698) | Contextual experience distillation (Dynamics + Skills) |
| [MemRL](https://arxiv.org/abs/2601.03192) | Two-phase retrieval (semantic recall β†’ Q-value re-rank) |
| [Voyager](https://arxiv.org/abs/2305.16291) | Skill library as long-term memory |

## Quick Start

```python
from purpose_agent import Orchestrator, State
from purpose_agent.llm_backend import HFInferenceBackend
from purpose_agent.orchestrator import Environment, Action

# 1. Define your environment
class MyEnv(Environment):
    def execute(self, action, current_state):
        # Your environment logic
        return State(data={...})

# 2. Create orchestrator with any LLM backend
orch = Orchestrator(
    llm=HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras"),
    environment=MyEnv(),
    available_actions={"search": "Search for items", "navigate": "Go somewhere"},
    persistence_dir="./agent_memory",
)

# 3. Run tasks β€” the agent self-improves across runs
result = orch.run_task(purpose="Find the answer to X", max_steps=20)
print(result.summary())
print(orch.get_heuristic_report())  # See what it learned
```

## Swapping LLM Backends

```python
# HuggingFace Inference Providers (cheapest)
from purpose_agent.llm_backend import HFInferenceBackend
llm = HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras")

# OpenAI
from purpose_agent.llm_backend import OpenAICompatibleBackend
llm = OpenAICompatibleBackend(model="gpt-4o")

# Local Ollama
llm = OpenAICompatibleBackend(
    model="llama3.2",
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

# Use DIFFERENT models for Actor vs Critic (recommended for production)
orch = Orchestrator(
    llm=cheap_fast_model,         # Actor β€” needs throughput
    critic_llm=strong_model,      # Purpose Function β€” needs accuracy
    optimizer_llm=cheap_fast_model,  # Runs infrequently
    environment=my_env,
)
```

## Purpose Function β€” Anti-Reward-Hacking Design

The Purpose Function system prompt enforces 7 strict rules:

1. **EVIDENCE REQUIRED** β€” Every score must cite specific observable state changes
2. **NO CREDIT FOR INTENTIONS** β€” Scores based on actual state, not agent's predictions
3. **NO SYCOPHANCY** β€” Lateral moves get Ξ”=0.0, regressions get negative Ξ”
4. **MONOTONIC SCALE** β€” Ξ¦ 0.0–10.0 proportional to progress
5. **ANTI-GAMING** β€” Superficial state manipulation flagged and penalized
6. **CONSISTENCY** β€” Identical states must receive identical Ξ¦ scores (cache-enforced)
7. **CONFIDENCE** β€” Ambiguous evaluations get reduced delta magnitude

Additional programmatic safeguards:
- Score caching prevents inconsistent evaluations
- Anomaly detection flags suspiciously large single-step jumps
- Confidence threshold filters uncertain scores
- Z-score normalization prevents score inflation over long trajectories

## 3-Tier Memory System

Based on MUSE (arxiv:2510.08002):

| Tier | Content | Loading | Update Trigger |
|------|---------|---------|----------------|
| **Strategic** | `<Dilemma, Strategy>` pairs | Always in system prompt | After each task |
| **Procedural** | Step-by-step SOPs | Index in prompt, details on demand | After high-reward trajectory |
| **Tool** | Per-action tips | Returned per step | When new patterns prove effective |

## Running the Demo

```bash
python demo.py
```

Runs 17 unit tests + full end-to-end demo with a simulated TreasureMaze environment. No API keys needed β€” uses MockLLMBackend.

## Dependencies

- **Core framework**: Python 3.10+ (stdlib only)
- **HF backend**: `huggingface_hub`
- **OpenAI backend**: `openai`
- **Production embeddings**: `sentence-transformers` (optional, for better retrieval)

## License

MIT