purpose-agent / README.md
Rohan03's picture
Add YAML metadata to repo card
ca2cef5 verified
|
raw
history blame
8.39 kB
---
library_name: purpose-agent
license: mit
language:
- en
tags:
- reinforcement-learning
- agents
- self-improving
- experience-replay
- llm-as-judge
- state-value-evaluation
- memory-augmented
- react
- orchestration
- modular
pipeline_tag: text-generation
---
# Purpose Agent β€” Self-Improving Agentic Framework via State-Value Evaluation
A lightweight, modular framework where an LLM agent improves across tasks **without weight updates** β€” using an RL-inspired self-reflection loop with a "Purpose Function" that evaluates intermediate state improvements.
## Core Philosophy
The agent improves via a **Purpose Function Ξ¦(s)** that measures distance-to-goal at every step. It rewards the agent **only if Ξ¦(s_new) > Ξ¦(s_current)**. High-reward trajectories are distilled into reusable heuristics stored in a 3-tier memory system, so the agent gets smarter on each subsequent task.
**No real-time backprop. No PPO/DPO. Minimal infrastructure costs.**
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ORCHESTRATOR LOOP β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” action β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” s_new β”‚
β”‚ β”‚ ACTOR β”‚ ────────► β”‚ ENVIRONMENT β”‚ ──────────┐ β”‚
β”‚ β”‚(+memory) β”‚ β”‚ (your code) β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β–Ό β”‚
β”‚ β”‚ heuristics β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” (s, a, s') β”‚
β”‚ │◄───────────────│ OPTIMIZER │◄─────────┐ β”‚
β”‚ β”‚ β”‚ (distillation) β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” Ξ¦(s)β†’Ξ¦(s') β”‚
β”‚ β”‚ β”‚ PURPOSE FN │─────────── β”‚
β”‚ β”‚ β”‚ (state critic) β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ └────────────────│ EXPERIENCE β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ REPLAY BUFFER β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Modules
| Module | File | Role |
|--------|------|------|
| **Actor** | `actor.py` | ReAct-style agent with 3-tier memory-augmented prompts |
| **Purpose Function** | `purpose_function.py` | Strict, non-hackable LLM critic that scores Ξ¦(s) transitions |
| **Experience Replay** | `experience_replay.py` | Trajectory storage with two-phase retrieval (similarity + Q-value) |
| **Optimizer** | `optimizer.py` | Distills winning trajectories into reusable heuristics |
| **Orchestrator** | `orchestrator.py` | Main loop tying everything together |
| **LLM Backend** | `llm_backend.py` | Swappable inference layer (HF, OpenAI, Ollama, custom) |
| **Types** | `types.py` | Shared data structures (State, Action, Trajectory, Heuristic, etc.) |
## Literature Foundation
| Paper | Contribution to this framework |
|-------|-------------------------------|
| [MUSE](https://arxiv.org/abs/2510.08002) | 3-tier memory hierarchy (strategic/procedural/tool) |
| [LATS](https://arxiv.org/abs/2310.04406) | LLM-as-value-function V(s) pattern |
| [REMEMBERER](https://arxiv.org/abs/2306.07929) | Q-value experience replay with Bellman updates |
| [Reflexion](https://arxiv.org/abs/2303.11366) | Verbal reinforcement via episodic self-reflection |
| [SPC](https://arxiv.org/abs/2504.19162) | Anti-reward-hacking via adversarial critic patterns |
| [CER](https://arxiv.org/abs/2506.06698) | Contextual experience distillation (Dynamics + Skills) |
| [MemRL](https://arxiv.org/abs/2601.03192) | Two-phase retrieval (semantic recall β†’ Q-value re-rank) |
| [Voyager](https://arxiv.org/abs/2305.16291) | Skill library as long-term memory |
## Quick Start
```python
from purpose_agent import Orchestrator, State
from purpose_agent.llm_backend import HFInferenceBackend
from purpose_agent.orchestrator import Environment, Action
# 1. Define your environment
class MyEnv(Environment):
def execute(self, action, current_state):
# Your environment logic
return State(data={...})
# 2. Create orchestrator with any LLM backend
orch = Orchestrator(
llm=HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras"),
environment=MyEnv(),
available_actions={"search": "Search for items", "navigate": "Go somewhere"},
persistence_dir="./agent_memory",
)
# 3. Run tasks β€” the agent self-improves across runs
result = orch.run_task(purpose="Find the answer to X", max_steps=20)
print(result.summary())
print(orch.get_heuristic_report()) # See what it learned
```
## Swapping LLM Backends
```python
# HuggingFace Inference Providers (cheapest)
from purpose_agent.llm_backend import HFInferenceBackend
llm = HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras")
# OpenAI
from purpose_agent.llm_backend import OpenAICompatibleBackend
llm = OpenAICompatibleBackend(model="gpt-4o")
# Local Ollama
llm = OpenAICompatibleBackend(
model="llama3.2",
base_url="http://localhost:11434/v1",
api_key="ollama",
)
# Use DIFFERENT models for Actor vs Critic (recommended for production)
orch = Orchestrator(
llm=cheap_fast_model, # Actor β€” needs throughput
critic_llm=strong_model, # Purpose Function β€” needs accuracy
optimizer_llm=cheap_fast_model, # Runs infrequently
environment=my_env,
)
```
## Purpose Function β€” Anti-Reward-Hacking Design
The Purpose Function system prompt enforces 7 strict rules:
1. **EVIDENCE REQUIRED** β€” Every score must cite specific observable state changes
2. **NO CREDIT FOR INTENTIONS** β€” Scores based on actual state, not agent's predictions
3. **NO SYCOPHANCY** β€” Lateral moves get Ξ”=0.0, regressions get negative Ξ”
4. **MONOTONIC SCALE** β€” Ξ¦ 0.0–10.0 proportional to progress
5. **ANTI-GAMING** β€” Superficial state manipulation flagged and penalized
6. **CONSISTENCY** β€” Identical states must receive identical Ξ¦ scores (cache-enforced)
7. **CONFIDENCE** β€” Ambiguous evaluations get reduced delta magnitude
Additional programmatic safeguards:
- Score caching prevents inconsistent evaluations
- Anomaly detection flags suspiciously large single-step jumps
- Confidence threshold filters uncertain scores
- Z-score normalization prevents score inflation over long trajectories
## 3-Tier Memory System
Based on MUSE (arxiv:2510.08002):
| Tier | Content | Loading | Update Trigger |
|------|---------|---------|----------------|
| **Strategic** | `<Dilemma, Strategy>` pairs | Always in system prompt | After each task |
| **Procedural** | Step-by-step SOPs | Index in prompt, details on demand | After high-reward trajectory |
| **Tool** | Per-action tips | Returned per step | When new patterns prove effective |
## Running the Demo
```bash
python demo.py
```
Runs 17 unit tests + full end-to-end demo with a simulated TreasureMaze environment. No API keys needed β€” uses MockLLMBackend.
## Dependencies
- **Core framework**: Python 3.10+ (stdlib only)
- **HF backend**: `huggingface_hub`
- **OpenAI backend**: `openai`
- **Production embeddings**: `sentence-transformers` (optional, for better retrieval)
## License
MIT