Rohan03
/

purpose-agent

+# Purpose Agent — Self-Improving Agentic Framework via State-Value Evaluation
+A lightweight, modular framework where an LLM agent improves across tasks **without weight updates** — using an RL-inspired self-reflection loop with a "Purpose Function" that evaluates intermediate state improvements.
+## Core Philosophy
+The agent improves via a **Purpose Function Φ(s)** that measures distance-to-goal at every step. It rewards the agent **only if Φ(s_new) > Φ(s_current)**. High-reward trajectories are distilled into reusable heuristics stored in a 3-tier memory system, so the agent gets smarter on each subsequent task.
+**No real-time backprop. No PPO/DPO. Minimal infrastructure costs.**
+## Architecture
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                     ORCHESTRATOR LOOP                          │
+│                                                                 │
+│  ┌──────────┐   action   ┌─────────────┐   s_new              │
+│  │  ACTOR   │ ────────►  │ ENVIRONMENT │ ──────────┐          │
+│  │(+memory) │            │ (your code) │           │          │
+│  └────▲─────┘            └─────────────┘           │          │
+│       │                                             ▼          │
+│       │  heuristics    ┌────────────────┐   (s, a, s')        │
+│       │◄───────────────│   OPTIMIZER    │◄─────────┐          │
+│       │                │ (distillation) │          │          │
+│       │                └────────────────┘          │          │
+│       │                ┌────────────────┐   Φ(s)→Φ(s')       │
+│       │                │   PURPOSE FN   │──────────┤          │
+│       │                │ (state critic) │          │          │
+│       │                └────────────────┘          │          │
+│       │                ┌────────────────┐          │          │
+│       └────────────────│ EXPERIENCE     │◄─────────┘          │
+│                        │ REPLAY BUFFER  │                      │
+│                        └────────────────┘                      │
+└─────────────────────────────────────────────────────────────────┘
+```
+## Modules
+| Module | File | Role |
+|--------|------|------|
+| **Actor** | `actor.py` | ReAct-style agent with 3-tier memory-augmented prompts |
+| **Purpose Function** | `purpose_function.py` | Strict, non-hackable LLM critic that scores Φ(s) transitions |
+| **Experience Replay** | `experience_replay.py` | Trajectory storage with two-phase retrieval (similarity + Q-value) |
+| **Optimizer** | `optimizer.py` | Distills winning trajectories into reusable heuristics |
+| **Orchestrator** | `orchestrator.py` | Main loop tying everything together |
+| **LLM Backend** | `llm_backend.py` | Swappable inference layer (HF, OpenAI, Ollama, custom) |
+| **Types** | `types.py` | Shared data structures (State, Action, Trajectory, Heuristic, etc.) |
+## Literature Foundation
+| Paper | Contribution to this framework |
+|-------|-------------------------------|
+| [MUSE](https://arxiv.org/abs/2510.08002) | 3-tier memory hierarchy (strategic/procedural/tool) |
+| [LATS](https://arxiv.org/abs/2310.04406) | LLM-as-value-function V(s) pattern |
+| [REMEMBERER](https://arxiv.org/abs/2306.07929) | Q-value experience replay with Bellman updates |
+| [Reflexion](https://arxiv.org/abs/2303.11366) | Verbal reinforcement via episodic self-reflection |
+| [SPC](https://arxiv.org/abs/2504.19162) | Anti-reward-hacking via adversarial critic patterns |
+| [CER](https://arxiv.org/abs/2506.06698) | Contextual experience distillation (Dynamics + Skills) |
+| [MemRL](https://arxiv.org/abs/2601.03192) | Two-phase retrieval (semantic recall → Q-value re-rank) |
+| [Voyager](https://arxiv.org/abs/2305.16291) | Skill library as long-term memory |
+## Quick Start
+```python
+from purpose_agent import Orchestrator, State
+from purpose_agent.llm_backend import HFInferenceBackend
+from purpose_agent.orchestrator import Environment, Action
+# 1. Define your environment
+class MyEnv(Environment):
+    def execute(self, action, current_state):
+        # Your environment logic
+        return State(data={...})
+# 2. Create orchestrator with any LLM backend
+orch = Orchestrator(
+    llm=HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras"),
+    environment=MyEnv(),
+    available_actions={"search": "Search for items", "navigate": "Go somewhere"},
+    persistence_dir="./agent_memory",
+)
+# 3. Run tasks — the agent self-improves across runs
+result = orch.run_task(purpose="Find the answer to X", max_steps=20)
+print(result.summary())
+print(orch.get_heuristic_report())  # See what it learned
+```
+## Swapping LLM Backends
+```python
+# HuggingFace Inference Providers (cheapest)
+from purpose_agent.llm_backend import HFInferenceBackend
+llm = HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras")
+# OpenAI
+from purpose_agent.llm_backend import OpenAICompatibleBackend
+llm = OpenAICompatibleBackend(model="gpt-4o")
+# Local Ollama
+llm = OpenAICompatibleBackend(
+    model="llama3.2",
+    base_url="http://localhost:11434/v1",
+    api_key="ollama",
+)
+# Use DIFFERENT models for Actor vs Critic (recommended for production)
+orch = Orchestrator(
+    llm=cheap_fast_model,         # Actor — needs throughput
+    critic_llm=strong_model,      # Purpose Function — needs accuracy
+    optimizer_llm=cheap_fast_model,  # Runs infrequently
+    environment=my_env,
+)
+```
+## Purpose Function — Anti-Reward-Hacking Design
+The Purpose Function system prompt enforces 7 strict rules:
+1. **EVIDENCE REQUIRED** — Every score must cite specific observable state changes
+2. **NO CREDIT FOR INTENTIONS** — Scores based on actual state, not agent's predictions
+3. **NO SYCOPHANCY** — Lateral moves get Δ=0.0, regressions get negative Δ
+4. **MONOTONIC SCALE** — Φ 0.0–10.0 proportional to progress
+5. **ANTI-GAMING** — Superficial state manipulation flagged and penalized
+6. **CONSISTENCY** — Identical states must receive identical Φ scores (cache-enforced)
+7. **CONFIDENCE** — Ambiguous evaluations get reduced delta magnitude
+Additional programmatic safeguards:
+- Score caching prevents inconsistent evaluations
+- Anomaly detection flags suspiciously large single-step jumps
+- Confidence threshold filters uncertain scores
+- Z-score normalization prevents score inflation over long trajectories
+## 3-Tier Memory System
+Based on MUSE (arxiv:2510.08002):
+| Tier | Content | Loading | Update Trigger |
+|------|---------|---------|----------------|
+| **Strategic** | `<Dilemma, Strategy>` pairs | Always in system prompt | After each task |
+| **Procedural** | Step-by-step SOPs | Index in prompt, details on demand | After high-reward trajectory |
+| **Tool** | Per-action tips | Returned per step | When new patterns prove effective |
+## Running the Demo
+```bash
+python demo.py
+```
+Runs 17 unit tests + full end-to-end demo with a simulated TreasureMaze environment. No API keys needed — uses MockLLMBackend.
+## Dependencies
+- **Core framework**: Python 3.10+ (stdlib only)
+- **HF backend**: `huggingface_hub`
+- **OpenAI backend**: `openai`
+- **Production embeddings**: `sentence-transformers` (optional, for better retrieval)
+## License
+MIT