--- library_name: purpose-agent license: mit language: - en tags: - reinforcement-learning - agents - self-improving - experience-replay - llm-as-judge - state-value-evaluation - memory-augmented - react - orchestration - modular pipeline_tag: text-generation --- # Purpose Agent — Self-Improving Agentic Framework via State-Value Evaluation A lightweight, modular framework where an LLM agent improves across tasks **without weight updates** — using an RL-inspired self-reflection loop with a "Purpose Function" that evaluates intermediate state improvements. ## Core Philosophy The agent improves via a **Purpose Function Φ(s)** that measures distance-to-goal at every step. It rewards the agent **only if Φ(s_new) > Φ(s_current)**. High-reward trajectories are distilled into reusable heuristics stored in a 3-tier memory system, so the agent gets smarter on each subsequent task. **No real-time backprop. No PPO/DPO. Minimal infrastructure costs.** ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ ORCHESTRATOR LOOP │ │ │ │ ┌──────────┐ action ┌─────────────┐ s_new │ │ │ ACTOR │ ────────► │ ENVIRONMENT │ ──────────┐ │ │ │(+memory) │ │ (your code) │ │ │ │ └────▲─────┘ └─────────────┘ │ │ │ │ ▼ │ │ │ heuristics ┌────────────────┐ (s, a, s') │ │ │◄───────────────│ OPTIMIZER │◄─────────┐ │ │ │ │ (distillation) │ │ │ │ │ └────────────────┘ │ │ │ │ ┌────────────────┐ Φ(s)→Φ(s') │ │ │ │ PURPOSE FN │──────────┤ │ │ │ │ (state critic) │ │ │ │ │ └────────────────┘ │ │ │ │ ┌────────────────┐ │ │ │ └────────────────│ EXPERIENCE │◄─────────┘ │ │ │ REPLAY BUFFER │ │ │ └────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` ## Modules | Module | File | Role | |--------|------|------| | **Actor** | `actor.py` | ReAct-style agent with 3-tier memory-augmented prompts | | **Purpose Function** | `purpose_function.py` | Strict, non-hackable LLM critic that scores Φ(s) transitions | | **Experience Replay** | `experience_replay.py` | Trajectory storage with two-phase retrieval (similarity + Q-value) | | **Optimizer** | `optimizer.py` | Distills winning trajectories into reusable heuristics | | **Orchestrator** | `orchestrator.py` | Main loop tying everything together | | **LLM Backend** | `llm_backend.py` | Swappable inference layer (HF, OpenAI, Ollama, custom) | | **Types** | `types.py` | Shared data structures (State, Action, Trajectory, Heuristic, etc.) | ## Literature Foundation | Paper | Contribution to this framework | |-------|-------------------------------| | [MUSE](https://arxiv.org/abs/2510.08002) | 3-tier memory hierarchy (strategic/procedural/tool) | | [LATS](https://arxiv.org/abs/2310.04406) | LLM-as-value-function V(s) pattern | | [REMEMBERER](https://arxiv.org/abs/2306.07929) | Q-value experience replay with Bellman updates | | [Reflexion](https://arxiv.org/abs/2303.11366) | Verbal reinforcement via episodic self-reflection | | [SPC](https://arxiv.org/abs/2504.19162) | Anti-reward-hacking via adversarial critic patterns | | [CER](https://arxiv.org/abs/2506.06698) | Contextual experience distillation (Dynamics + Skills) | | [MemRL](https://arxiv.org/abs/2601.03192) | Two-phase retrieval (semantic recall → Q-value re-rank) | | [Voyager](https://arxiv.org/abs/2305.16291) | Skill library as long-term memory | ## Quick Start ```python from purpose_agent import Orchestrator, State from purpose_agent.llm_backend import HFInferenceBackend from purpose_agent.orchestrator import Environment, Action # 1. Define your environment class MyEnv(Environment): def execute(self, action, current_state): # Your environment logic return State(data={...}) # 2. Create orchestrator with any LLM backend orch = Orchestrator( llm=HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras"), environment=MyEnv(), available_actions={"search": "Search for items", "navigate": "Go somewhere"}, persistence_dir="./agent_memory", ) # 3. Run tasks — the agent self-improves across runs result = orch.run_task(purpose="Find the answer to X", max_steps=20) print(result.summary()) print(orch.get_heuristic_report()) # See what it learned ``` ## Swapping LLM Backends ```python # HuggingFace Inference Providers (cheapest) from purpose_agent.llm_backend import HFInferenceBackend llm = HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras") # OpenAI from purpose_agent.llm_backend import OpenAICompatibleBackend llm = OpenAICompatibleBackend(model="gpt-4o") # Local Ollama llm = OpenAICompatibleBackend( model="llama3.2", base_url="http://localhost:11434/v1", api_key="ollama", ) # Use DIFFERENT models for Actor vs Critic (recommended for production) orch = Orchestrator( llm=cheap_fast_model, # Actor — needs throughput critic_llm=strong_model, # Purpose Function — needs accuracy optimizer_llm=cheap_fast_model, # Runs infrequently environment=my_env, ) ``` ## Purpose Function — Anti-Reward-Hacking Design The Purpose Function system prompt enforces 7 strict rules: 1. **EVIDENCE REQUIRED** — Every score must cite specific observable state changes 2. **NO CREDIT FOR INTENTIONS** — Scores based on actual state, not agent's predictions 3. **NO SYCOPHANCY** — Lateral moves get Δ=0.0, regressions get negative Δ 4. **MONOTONIC SCALE** — Φ 0.0–10.0 proportional to progress 5. **ANTI-GAMING** — Superficial state manipulation flagged and penalized 6. **CONSISTENCY** — Identical states must receive identical Φ scores (cache-enforced) 7. **CONFIDENCE** — Ambiguous evaluations get reduced delta magnitude Additional programmatic safeguards: - Score caching prevents inconsistent evaluations - Anomaly detection flags suspiciously large single-step jumps - Confidence threshold filters uncertain scores - Z-score normalization prevents score inflation over long trajectories ## 3-Tier Memory System Based on MUSE (arxiv:2510.08002): | Tier | Content | Loading | Update Trigger | |------|---------|---------|----------------| | **Strategic** | `` pairs | Always in system prompt | After each task | | **Procedural** | Step-by-step SOPs | Index in prompt, details on demand | After high-reward trajectory | | **Tool** | Per-action tips | Returned per step | When new patterns prove effective | ## Running the Demo ```bash python demo.py ``` Runs 17 unit tests + full end-to-end demo with a simulated TreasureMaze environment. No API keys needed — uses MockLLMBackend. ## Dependencies - **Core framework**: Python 3.10+ (stdlib only) - **HF backend**: `huggingface_hub` - **OpenAI backend**: `openai` - **Production embeddings**: `sentence-transformers` (optional, for better retrieval) ## License MIT