Add comprehensive README
Browse files
README.md
ADDED
|
@@ -0,0 +1,161 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Purpose Agent β Self-Improving Agentic Framework via State-Value Evaluation
|
| 2 |
+
|
| 3 |
+
A lightweight, modular framework where an LLM agent improves across tasks **without weight updates** β using an RL-inspired self-reflection loop with a "Purpose Function" that evaluates intermediate state improvements.
|
| 4 |
+
|
| 5 |
+
## Core Philosophy
|
| 6 |
+
|
| 7 |
+
The agent improves via a **Purpose Function Ξ¦(s)** that measures distance-to-goal at every step. It rewards the agent **only if Ξ¦(s_new) > Ξ¦(s_current)**. High-reward trajectories are distilled into reusable heuristics stored in a 3-tier memory system, so the agent gets smarter on each subsequent task.
|
| 8 |
+
|
| 9 |
+
**No real-time backprop. No PPO/DPO. Minimal infrastructure costs.**
|
| 10 |
+
|
| 11 |
+
## Architecture
|
| 12 |
+
|
| 13 |
+
```
|
| 14 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 15 |
+
β ORCHESTRATOR LOOP β
|
| 16 |
+
β β
|
| 17 |
+
β ββββββββββββ action βββββββββββββββ s_new β
|
| 18 |
+
β β ACTOR β βββββββββΊ β ENVIRONMENT β βββββββββββ β
|
| 19 |
+
β β(+memory) β β (your code) β β β
|
| 20 |
+
β ββββββ²ββββββ βββββββββββββββ β β
|
| 21 |
+
β β βΌ β
|
| 22 |
+
β β heuristics ββββββββββββββββββ (s, a, s') β
|
| 23 |
+
β ββββββββββββββββββ OPTIMIZER ββββββββββββ β
|
| 24 |
+
β β β (distillation) β β β
|
| 25 |
+
β β ββββββββββββββββββ β β
|
| 26 |
+
β β ββββββββββββββββββ Ξ¦(s)βΞ¦(s') β
|
| 27 |
+
β β β PURPOSE FN ββββββββββββ€ β
|
| 28 |
+
β β β (state critic) β β β
|
| 29 |
+
β β ββββββββββββββββββ β β
|
| 30 |
+
β β ββββββββββββββββββ β β
|
| 31 |
+
β ββββββββββββββββββ EXPERIENCE ββββββββββββ β
|
| 32 |
+
β β REPLAY BUFFER β β
|
| 33 |
+
β ββββββββββββββββββ β
|
| 34 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
## Modules
|
| 38 |
+
|
| 39 |
+
| Module | File | Role |
|
| 40 |
+
|--------|------|------|
|
| 41 |
+
| **Actor** | `actor.py` | ReAct-style agent with 3-tier memory-augmented prompts |
|
| 42 |
+
| **Purpose Function** | `purpose_function.py` | Strict, non-hackable LLM critic that scores Ξ¦(s) transitions |
|
| 43 |
+
| **Experience Replay** | `experience_replay.py` | Trajectory storage with two-phase retrieval (similarity + Q-value) |
|
| 44 |
+
| **Optimizer** | `optimizer.py` | Distills winning trajectories into reusable heuristics |
|
| 45 |
+
| **Orchestrator** | `orchestrator.py` | Main loop tying everything together |
|
| 46 |
+
| **LLM Backend** | `llm_backend.py` | Swappable inference layer (HF, OpenAI, Ollama, custom) |
|
| 47 |
+
| **Types** | `types.py` | Shared data structures (State, Action, Trajectory, Heuristic, etc.) |
|
| 48 |
+
|
| 49 |
+
## Literature Foundation
|
| 50 |
+
|
| 51 |
+
| Paper | Contribution to this framework |
|
| 52 |
+
|-------|-------------------------------|
|
| 53 |
+
| [MUSE](https://arxiv.org/abs/2510.08002) | 3-tier memory hierarchy (strategic/procedural/tool) |
|
| 54 |
+
| [LATS](https://arxiv.org/abs/2310.04406) | LLM-as-value-function V(s) pattern |
|
| 55 |
+
| [REMEMBERER](https://arxiv.org/abs/2306.07929) | Q-value experience replay with Bellman updates |
|
| 56 |
+
| [Reflexion](https://arxiv.org/abs/2303.11366) | Verbal reinforcement via episodic self-reflection |
|
| 57 |
+
| [SPC](https://arxiv.org/abs/2504.19162) | Anti-reward-hacking via adversarial critic patterns |
|
| 58 |
+
| [CER](https://arxiv.org/abs/2506.06698) | Contextual experience distillation (Dynamics + Skills) |
|
| 59 |
+
| [MemRL](https://arxiv.org/abs/2601.03192) | Two-phase retrieval (semantic recall β Q-value re-rank) |
|
| 60 |
+
| [Voyager](https://arxiv.org/abs/2305.16291) | Skill library as long-term memory |
|
| 61 |
+
|
| 62 |
+
## Quick Start
|
| 63 |
+
|
| 64 |
+
```python
|
| 65 |
+
from purpose_agent import Orchestrator, State
|
| 66 |
+
from purpose_agent.llm_backend import HFInferenceBackend
|
| 67 |
+
from purpose_agent.orchestrator import Environment, Action
|
| 68 |
+
|
| 69 |
+
# 1. Define your environment
|
| 70 |
+
class MyEnv(Environment):
|
| 71 |
+
def execute(self, action, current_state):
|
| 72 |
+
# Your environment logic
|
| 73 |
+
return State(data={...})
|
| 74 |
+
|
| 75 |
+
# 2. Create orchestrator with any LLM backend
|
| 76 |
+
orch = Orchestrator(
|
| 77 |
+
llm=HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras"),
|
| 78 |
+
environment=MyEnv(),
|
| 79 |
+
available_actions={"search": "Search for items", "navigate": "Go somewhere"},
|
| 80 |
+
persistence_dir="./agent_memory",
|
| 81 |
+
)
|
| 82 |
+
|
| 83 |
+
# 3. Run tasks β the agent self-improves across runs
|
| 84 |
+
result = orch.run_task(purpose="Find the answer to X", max_steps=20)
|
| 85 |
+
print(result.summary())
|
| 86 |
+
print(orch.get_heuristic_report()) # See what it learned
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
## Swapping LLM Backends
|
| 90 |
+
|
| 91 |
+
```python
|
| 92 |
+
# HuggingFace Inference Providers (cheapest)
|
| 93 |
+
from purpose_agent.llm_backend import HFInferenceBackend
|
| 94 |
+
llm = HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras")
|
| 95 |
+
|
| 96 |
+
# OpenAI
|
| 97 |
+
from purpose_agent.llm_backend import OpenAICompatibleBackend
|
| 98 |
+
llm = OpenAICompatibleBackend(model="gpt-4o")
|
| 99 |
+
|
| 100 |
+
# Local Ollama
|
| 101 |
+
llm = OpenAICompatibleBackend(
|
| 102 |
+
model="llama3.2",
|
| 103 |
+
base_url="http://localhost:11434/v1",
|
| 104 |
+
api_key="ollama",
|
| 105 |
+
)
|
| 106 |
+
|
| 107 |
+
# Use DIFFERENT models for Actor vs Critic (recommended for production)
|
| 108 |
+
orch = Orchestrator(
|
| 109 |
+
llm=cheap_fast_model, # Actor β needs throughput
|
| 110 |
+
critic_llm=strong_model, # Purpose Function β needs accuracy
|
| 111 |
+
optimizer_llm=cheap_fast_model, # Runs infrequently
|
| 112 |
+
environment=my_env,
|
| 113 |
+
)
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
## Purpose Function β Anti-Reward-Hacking Design
|
| 117 |
+
|
| 118 |
+
The Purpose Function system prompt enforces 7 strict rules:
|
| 119 |
+
|
| 120 |
+
1. **EVIDENCE REQUIRED** β Every score must cite specific observable state changes
|
| 121 |
+
2. **NO CREDIT FOR INTENTIONS** β Scores based on actual state, not agent's predictions
|
| 122 |
+
3. **NO SYCOPHANCY** β Lateral moves get Ξ=0.0, regressions get negative Ξ
|
| 123 |
+
4. **MONOTONIC SCALE** β Ξ¦ 0.0β10.0 proportional to progress
|
| 124 |
+
5. **ANTI-GAMING** β Superficial state manipulation flagged and penalized
|
| 125 |
+
6. **CONSISTENCY** β Identical states must receive identical Ξ¦ scores (cache-enforced)
|
| 126 |
+
7. **CONFIDENCE** β Ambiguous evaluations get reduced delta magnitude
|
| 127 |
+
|
| 128 |
+
Additional programmatic safeguards:
|
| 129 |
+
- Score caching prevents inconsistent evaluations
|
| 130 |
+
- Anomaly detection flags suspiciously large single-step jumps
|
| 131 |
+
- Confidence threshold filters uncertain scores
|
| 132 |
+
- Z-score normalization prevents score inflation over long trajectories
|
| 133 |
+
|
| 134 |
+
## 3-Tier Memory System
|
| 135 |
+
|
| 136 |
+
Based on MUSE (arxiv:2510.08002):
|
| 137 |
+
|
| 138 |
+
| Tier | Content | Loading | Update Trigger |
|
| 139 |
+
|------|---------|---------|----------------|
|
| 140 |
+
| **Strategic** | `<Dilemma, Strategy>` pairs | Always in system prompt | After each task |
|
| 141 |
+
| **Procedural** | Step-by-step SOPs | Index in prompt, details on demand | After high-reward trajectory |
|
| 142 |
+
| **Tool** | Per-action tips | Returned per step | When new patterns prove effective |
|
| 143 |
+
|
| 144 |
+
## Running the Demo
|
| 145 |
+
|
| 146 |
+
```bash
|
| 147 |
+
python demo.py
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
Runs 17 unit tests + full end-to-end demo with a simulated TreasureMaze environment. No API keys needed β uses MockLLMBackend.
|
| 151 |
+
|
| 152 |
+
## Dependencies
|
| 153 |
+
|
| 154 |
+
- **Core framework**: Python 3.10+ (stdlib only)
|
| 155 |
+
- **HF backend**: `huggingface_hub`
|
| 156 |
+
- **OpenAI backend**: `openai`
|
| 157 |
+
- **Production embeddings**: `sentence-transformers` (optional, for better retrieval)
|
| 158 |
+
|
| 159 |
+
## License
|
| 160 |
+
|
| 161 |
+
MIT
|