| --- |
| library_name: purpose-agent |
| license: mit |
| language: |
| - en |
| tags: |
| - reinforcement-learning |
| - agents |
| - self-improving |
| - experience-replay |
| - llm-as-judge |
| - state-value-evaluation |
| - memory-augmented |
| - react |
| - orchestration |
| - modular |
| pipeline_tag: text-generation |
| --- |
| |
| # Purpose Agent β Self-Improving Agentic Framework via State-Value Evaluation |
|
|
| A lightweight, modular framework where an LLM agent improves across tasks **without weight updates** β using an RL-inspired self-reflection loop with a "Purpose Function" that evaluates intermediate state improvements. |
|
|
| ## Core Philosophy |
|
|
| The agent improves via a **Purpose Function Ξ¦(s)** that measures distance-to-goal at every step. It rewards the agent **only if Ξ¦(s_new) > Ξ¦(s_current)**. High-reward trajectories are distilled into reusable heuristics stored in a 3-tier memory system, so the agent gets smarter on each subsequent task. |
|
|
| **No real-time backprop. No PPO/DPO. Minimal infrastructure costs.** |
|
|
| ## Architecture |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β ORCHESTRATOR LOOP β |
| β β |
| β ββββββββββββ action βββββββββββββββ s_new β |
| β β ACTOR β βββββββββΊ β ENVIRONMENT β βββββββββββ β |
| β β(+memory) β β (your code) β β β |
| β ββββββ²ββββββ βββββββββββββββ β β |
| β β βΌ β |
| β β heuristics ββββββββββββββββββ (s, a, s') β |
| β ββββββββββββββββββ OPTIMIZER ββββββββββββ β |
| β β β (distillation) β β β |
| β β ββββββββββββββββββ β β |
| β β ββββββββββββββββββ Ξ¦(s)βΞ¦(s') β |
| β β β PURPOSE FN ββββββββββββ€ β |
| β β β (state critic) β β β |
| β β ββββββββββββββββββ β β |
| β β ββββββββββββββββββ β β |
| β ββββββββββββββββββ EXPERIENCE ββββββββββββ β |
| β β REPLAY BUFFER β β |
| β ββββββββββββββββββ β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| ## Modules |
|
|
| | Module | File | Role | |
| |--------|------|------| |
| | **Actor** | `actor.py` | ReAct-style agent with 3-tier memory-augmented prompts | |
| | **Purpose Function** | `purpose_function.py` | Strict, non-hackable LLM critic that scores Ξ¦(s) transitions | |
| | **Experience Replay** | `experience_replay.py` | Trajectory storage with two-phase retrieval (similarity + Q-value) | |
| | **Optimizer** | `optimizer.py` | Distills winning trajectories into reusable heuristics | |
| | **Orchestrator** | `orchestrator.py` | Main loop tying everything together | |
| | **LLM Backend** | `llm_backend.py` | Swappable inference layer (HF, OpenAI, Ollama, custom) | |
| | **Types** | `types.py` | Shared data structures (State, Action, Trajectory, Heuristic, etc.) | |
|
|
| ## Literature Foundation |
|
|
| | Paper | Contribution to this framework | |
| |-------|-------------------------------| |
| | [MUSE](https://arxiv.org/abs/2510.08002) | 3-tier memory hierarchy (strategic/procedural/tool) | |
| | [LATS](https://arxiv.org/abs/2310.04406) | LLM-as-value-function V(s) pattern | |
| | [REMEMBERER](https://arxiv.org/abs/2306.07929) | Q-value experience replay with Bellman updates | |
| | [Reflexion](https://arxiv.org/abs/2303.11366) | Verbal reinforcement via episodic self-reflection | |
| | [SPC](https://arxiv.org/abs/2504.19162) | Anti-reward-hacking via adversarial critic patterns | |
| | [CER](https://arxiv.org/abs/2506.06698) | Contextual experience distillation (Dynamics + Skills) | |
| | [MemRL](https://arxiv.org/abs/2601.03192) | Two-phase retrieval (semantic recall β Q-value re-rank) | |
| | [Voyager](https://arxiv.org/abs/2305.16291) | Skill library as long-term memory | |
|
|
| ## Quick Start |
|
|
| ```python |
| from purpose_agent import Orchestrator, State |
| from purpose_agent.llm_backend import HFInferenceBackend |
| from purpose_agent.orchestrator import Environment, Action |
| |
| # 1. Define your environment |
| class MyEnv(Environment): |
| def execute(self, action, current_state): |
| # Your environment logic |
| return State(data={...}) |
| |
| # 2. Create orchestrator with any LLM backend |
| orch = Orchestrator( |
| llm=HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras"), |
| environment=MyEnv(), |
| available_actions={"search": "Search for items", "navigate": "Go somewhere"}, |
| persistence_dir="./agent_memory", |
| ) |
| |
| # 3. Run tasks β the agent self-improves across runs |
| result = orch.run_task(purpose="Find the answer to X", max_steps=20) |
| print(result.summary()) |
| print(orch.get_heuristic_report()) # See what it learned |
| ``` |
|
|
| ## Swapping LLM Backends |
|
|
| ```python |
| # HuggingFace Inference Providers (cheapest) |
| from purpose_agent.llm_backend import HFInferenceBackend |
| llm = HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras") |
| |
| # OpenAI |
| from purpose_agent.llm_backend import OpenAICompatibleBackend |
| llm = OpenAICompatibleBackend(model="gpt-4o") |
| |
| # Local Ollama |
| llm = OpenAICompatibleBackend( |
| model="llama3.2", |
| base_url="http://localhost:11434/v1", |
| api_key="ollama", |
| ) |
| |
| # Use DIFFERENT models for Actor vs Critic (recommended for production) |
| orch = Orchestrator( |
| llm=cheap_fast_model, # Actor β needs throughput |
| critic_llm=strong_model, # Purpose Function β needs accuracy |
| optimizer_llm=cheap_fast_model, # Runs infrequently |
| environment=my_env, |
| ) |
| ``` |
|
|
| ## Purpose Function β Anti-Reward-Hacking Design |
|
|
| The Purpose Function system prompt enforces 7 strict rules: |
|
|
| 1. **EVIDENCE REQUIRED** β Every score must cite specific observable state changes |
| 2. **NO CREDIT FOR INTENTIONS** β Scores based on actual state, not agent's predictions |
| 3. **NO SYCOPHANCY** β Lateral moves get Ξ=0.0, regressions get negative Ξ |
| 4. **MONOTONIC SCALE** β Ξ¦ 0.0β10.0 proportional to progress |
| 5. **ANTI-GAMING** β Superficial state manipulation flagged and penalized |
| 6. **CONSISTENCY** β Identical states must receive identical Ξ¦ scores (cache-enforced) |
| 7. **CONFIDENCE** β Ambiguous evaluations get reduced delta magnitude |
|
|
| Additional programmatic safeguards: |
| - Score caching prevents inconsistent evaluations |
| - Anomaly detection flags suspiciously large single-step jumps |
| - Confidence threshold filters uncertain scores |
| - Z-score normalization prevents score inflation over long trajectories |
|
|
| ## 3-Tier Memory System |
|
|
| Based on MUSE (arxiv:2510.08002): |
|
|
| | Tier | Content | Loading | Update Trigger | |
| |------|---------|---------|----------------| |
| | **Strategic** | `<Dilemma, Strategy>` pairs | Always in system prompt | After each task | |
| | **Procedural** | Step-by-step SOPs | Index in prompt, details on demand | After high-reward trajectory | |
| | **Tool** | Per-action tips | Returned per step | When new patterns prove effective | |
|
|
| ## Running the Demo |
|
|
| ```bash |
| python demo.py |
| ``` |
|
|
| Runs 17 unit tests + full end-to-end demo with a simulated TreasureMaze environment. No API keys needed β uses MockLLMBackend. |
|
|
| ## Dependencies |
|
|
| - **Core framework**: Python 3.10+ (stdlib only) |
| - **HF backend**: `huggingface_hub` |
| - **OpenAI backend**: `openai` |
| - **Production embeddings**: `sentence-transformers` (optional, for better retrieval) |
|
|
| ## License |
|
|
| MIT |
|
|