File size: 8,393 Bytes
ca2cef5 a99d027 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | ---
library_name: purpose-agent
license: mit
language:
- en
tags:
- reinforcement-learning
- agents
- self-improving
- experience-replay
- llm-as-judge
- state-value-evaluation
- memory-augmented
- react
- orchestration
- modular
pipeline_tag: text-generation
---
# Purpose Agent β Self-Improving Agentic Framework via State-Value Evaluation
A lightweight, modular framework where an LLM agent improves across tasks **without weight updates** β using an RL-inspired self-reflection loop with a "Purpose Function" that evaluates intermediate state improvements.
## Core Philosophy
The agent improves via a **Purpose Function Ξ¦(s)** that measures distance-to-goal at every step. It rewards the agent **only if Ξ¦(s_new) > Ξ¦(s_current)**. High-reward trajectories are distilled into reusable heuristics stored in a 3-tier memory system, so the agent gets smarter on each subsequent task.
**No real-time backprop. No PPO/DPO. Minimal infrastructure costs.**
## Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ORCHESTRATOR LOOP β
β β
β ββββββββββββ action βββββββββββββββ s_new β
β β ACTOR β βββββββββΊ β ENVIRONMENT β βββββββββββ β
β β(+memory) β β (your code) β β β
β ββββββ²ββββββ βββββββββββββββ β β
β β βΌ β
β β heuristics ββββββββββββββββββ (s, a, s') β
β ββββββββββββββββββ OPTIMIZER ββββββββββββ β
β β β (distillation) β β β
β β ββββββββββββββββββ β β
β β ββββββββββββββββββ Ξ¦(s)βΞ¦(s') β
β β β PURPOSE FN ββββββββββββ€ β
β β β (state critic) β β β
β β ββββββββββββββββββ β β
β β ββββββββββββββββββ β β
β ββββββββββββββββββ EXPERIENCE ββββββββββββ β
β β REPLAY BUFFER β β
β ββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
## Modules
| Module | File | Role |
|--------|------|------|
| **Actor** | `actor.py` | ReAct-style agent with 3-tier memory-augmented prompts |
| **Purpose Function** | `purpose_function.py` | Strict, non-hackable LLM critic that scores Ξ¦(s) transitions |
| **Experience Replay** | `experience_replay.py` | Trajectory storage with two-phase retrieval (similarity + Q-value) |
| **Optimizer** | `optimizer.py` | Distills winning trajectories into reusable heuristics |
| **Orchestrator** | `orchestrator.py` | Main loop tying everything together |
| **LLM Backend** | `llm_backend.py` | Swappable inference layer (HF, OpenAI, Ollama, custom) |
| **Types** | `types.py` | Shared data structures (State, Action, Trajectory, Heuristic, etc.) |
## Literature Foundation
| Paper | Contribution to this framework |
|-------|-------------------------------|
| [MUSE](https://arxiv.org/abs/2510.08002) | 3-tier memory hierarchy (strategic/procedural/tool) |
| [LATS](https://arxiv.org/abs/2310.04406) | LLM-as-value-function V(s) pattern |
| [REMEMBERER](https://arxiv.org/abs/2306.07929) | Q-value experience replay with Bellman updates |
| [Reflexion](https://arxiv.org/abs/2303.11366) | Verbal reinforcement via episodic self-reflection |
| [SPC](https://arxiv.org/abs/2504.19162) | Anti-reward-hacking via adversarial critic patterns |
| [CER](https://arxiv.org/abs/2506.06698) | Contextual experience distillation (Dynamics + Skills) |
| [MemRL](https://arxiv.org/abs/2601.03192) | Two-phase retrieval (semantic recall β Q-value re-rank) |
| [Voyager](https://arxiv.org/abs/2305.16291) | Skill library as long-term memory |
## Quick Start
```python
from purpose_agent import Orchestrator, State
from purpose_agent.llm_backend import HFInferenceBackend
from purpose_agent.orchestrator import Environment, Action
# 1. Define your environment
class MyEnv(Environment):
def execute(self, action, current_state):
# Your environment logic
return State(data={...})
# 2. Create orchestrator with any LLM backend
orch = Orchestrator(
llm=HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras"),
environment=MyEnv(),
available_actions={"search": "Search for items", "navigate": "Go somewhere"},
persistence_dir="./agent_memory",
)
# 3. Run tasks β the agent self-improves across runs
result = orch.run_task(purpose="Find the answer to X", max_steps=20)
print(result.summary())
print(orch.get_heuristic_report()) # See what it learned
```
## Swapping LLM Backends
```python
# HuggingFace Inference Providers (cheapest)
from purpose_agent.llm_backend import HFInferenceBackend
llm = HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras")
# OpenAI
from purpose_agent.llm_backend import OpenAICompatibleBackend
llm = OpenAICompatibleBackend(model="gpt-4o")
# Local Ollama
llm = OpenAICompatibleBackend(
model="llama3.2",
base_url="http://localhost:11434/v1",
api_key="ollama",
)
# Use DIFFERENT models for Actor vs Critic (recommended for production)
orch = Orchestrator(
llm=cheap_fast_model, # Actor β needs throughput
critic_llm=strong_model, # Purpose Function β needs accuracy
optimizer_llm=cheap_fast_model, # Runs infrequently
environment=my_env,
)
```
## Purpose Function β Anti-Reward-Hacking Design
The Purpose Function system prompt enforces 7 strict rules:
1. **EVIDENCE REQUIRED** β Every score must cite specific observable state changes
2. **NO CREDIT FOR INTENTIONS** β Scores based on actual state, not agent's predictions
3. **NO SYCOPHANCY** β Lateral moves get Ξ=0.0, regressions get negative Ξ
4. **MONOTONIC SCALE** β Ξ¦ 0.0β10.0 proportional to progress
5. **ANTI-GAMING** β Superficial state manipulation flagged and penalized
6. **CONSISTENCY** β Identical states must receive identical Ξ¦ scores (cache-enforced)
7. **CONFIDENCE** β Ambiguous evaluations get reduced delta magnitude
Additional programmatic safeguards:
- Score caching prevents inconsistent evaluations
- Anomaly detection flags suspiciously large single-step jumps
- Confidence threshold filters uncertain scores
- Z-score normalization prevents score inflation over long trajectories
## 3-Tier Memory System
Based on MUSE (arxiv:2510.08002):
| Tier | Content | Loading | Update Trigger |
|------|---------|---------|----------------|
| **Strategic** | `<Dilemma, Strategy>` pairs | Always in system prompt | After each task |
| **Procedural** | Step-by-step SOPs | Index in prompt, details on demand | After high-reward trajectory |
| **Tool** | Per-action tips | Returned per step | When new patterns prove effective |
## Running the Demo
```bash
python demo.py
```
Runs 17 unit tests + full end-to-end demo with a simulated TreasureMaze environment. No API keys needed β uses MockLLMBackend.
## Dependencies
- **Core framework**: Python 3.10+ (stdlib only)
- **HF backend**: `huggingface_hub`
- **OpenAI backend**: `openai`
- **Production embeddings**: `sentence-transformers` (optional, for better retrieval)
## License
MIT
|