purpose-agent / README.md
Rohan03's picture
Add YAML metadata to repo card
ca2cef5 verified
|
raw
history blame
8.39 kB
metadata
library_name: purpose-agent
license: mit
language:
  - en
tags:
  - reinforcement-learning
  - agents
  - self-improving
  - experience-replay
  - llm-as-judge
  - state-value-evaluation
  - memory-augmented
  - react
  - orchestration
  - modular
pipeline_tag: text-generation

Purpose Agent β€” Self-Improving Agentic Framework via State-Value Evaluation

A lightweight, modular framework where an LLM agent improves across tasks without weight updates β€” using an RL-inspired self-reflection loop with a "Purpose Function" that evaluates intermediate state improvements.

Core Philosophy

The agent improves via a Purpose Function Ξ¦(s) that measures distance-to-goal at every step. It rewards the agent only if Ξ¦(s_new) > Ξ¦(s_current). High-reward trajectories are distilled into reusable heuristics stored in a 3-tier memory system, so the agent gets smarter on each subsequent task.

No real-time backprop. No PPO/DPO. Minimal infrastructure costs.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     ORCHESTRATOR LOOP                          β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   action   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   s_new              β”‚
β”‚  β”‚  ACTOR   β”‚ ────────►  β”‚ ENVIRONMENT β”‚ ──────────┐          β”‚
β”‚  β”‚(+memory) β”‚            β”‚ (your code) β”‚           β”‚          β”‚
β”‚  β””β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚          β”‚
β”‚       β”‚                                             β–Ό          β”‚
β”‚       β”‚  heuristics    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   (s, a, s')        β”‚
β”‚       │◄───────────────│   OPTIMIZER    │◄─────────┐          β”‚
β”‚       β”‚                β”‚ (distillation) β”‚          β”‚          β”‚
β”‚       β”‚                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚          β”‚
β”‚       β”‚                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   Ξ¦(s)β†’Ξ¦(s')       β”‚
β”‚       β”‚                β”‚   PURPOSE FN   │───────────          β”‚
β”‚       β”‚                β”‚ (state critic) β”‚          β”‚          β”‚
β”‚       β”‚                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚          β”‚
β”‚       β”‚                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚          β”‚
β”‚       └────────────────│ EXPERIENCE     β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                        β”‚ REPLAY BUFFER  β”‚                      β”‚
β”‚                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Modules

Module File Role
Actor actor.py ReAct-style agent with 3-tier memory-augmented prompts
Purpose Function purpose_function.py Strict, non-hackable LLM critic that scores Ξ¦(s) transitions
Experience Replay experience_replay.py Trajectory storage with two-phase retrieval (similarity + Q-value)
Optimizer optimizer.py Distills winning trajectories into reusable heuristics
Orchestrator orchestrator.py Main loop tying everything together
LLM Backend llm_backend.py Swappable inference layer (HF, OpenAI, Ollama, custom)
Types types.py Shared data structures (State, Action, Trajectory, Heuristic, etc.)

Literature Foundation

Paper Contribution to this framework
MUSE 3-tier memory hierarchy (strategic/procedural/tool)
LATS LLM-as-value-function V(s) pattern
REMEMBERER Q-value experience replay with Bellman updates
Reflexion Verbal reinforcement via episodic self-reflection
SPC Anti-reward-hacking via adversarial critic patterns
CER Contextual experience distillation (Dynamics + Skills)
MemRL Two-phase retrieval (semantic recall β†’ Q-value re-rank)
Voyager Skill library as long-term memory

Quick Start

from purpose_agent import Orchestrator, State
from purpose_agent.llm_backend import HFInferenceBackend
from purpose_agent.orchestrator import Environment, Action

# 1. Define your environment
class MyEnv(Environment):
    def execute(self, action, current_state):
        # Your environment logic
        return State(data={...})

# 2. Create orchestrator with any LLM backend
orch = Orchestrator(
    llm=HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras"),
    environment=MyEnv(),
    available_actions={"search": "Search for items", "navigate": "Go somewhere"},
    persistence_dir="./agent_memory",
)

# 3. Run tasks β€” the agent self-improves across runs
result = orch.run_task(purpose="Find the answer to X", max_steps=20)
print(result.summary())
print(orch.get_heuristic_report())  # See what it learned

Swapping LLM Backends

# HuggingFace Inference Providers (cheapest)
from purpose_agent.llm_backend import HFInferenceBackend
llm = HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras")

# OpenAI
from purpose_agent.llm_backend import OpenAICompatibleBackend
llm = OpenAICompatibleBackend(model="gpt-4o")

# Local Ollama
llm = OpenAICompatibleBackend(
    model="llama3.2",
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

# Use DIFFERENT models for Actor vs Critic (recommended for production)
orch = Orchestrator(
    llm=cheap_fast_model,         # Actor β€” needs throughput
    critic_llm=strong_model,      # Purpose Function β€” needs accuracy
    optimizer_llm=cheap_fast_model,  # Runs infrequently
    environment=my_env,
)

Purpose Function β€” Anti-Reward-Hacking Design

The Purpose Function system prompt enforces 7 strict rules:

  1. EVIDENCE REQUIRED β€” Every score must cite specific observable state changes
  2. NO CREDIT FOR INTENTIONS β€” Scores based on actual state, not agent's predictions
  3. NO SYCOPHANCY β€” Lateral moves get Ξ”=0.0, regressions get negative Ξ”
  4. MONOTONIC SCALE β€” Ξ¦ 0.0–10.0 proportional to progress
  5. ANTI-GAMING β€” Superficial state manipulation flagged and penalized
  6. CONSISTENCY β€” Identical states must receive identical Ξ¦ scores (cache-enforced)
  7. CONFIDENCE β€” Ambiguous evaluations get reduced delta magnitude

Additional programmatic safeguards:

  • Score caching prevents inconsistent evaluations
  • Anomaly detection flags suspiciously large single-step jumps
  • Confidence threshold filters uncertain scores
  • Z-score normalization prevents score inflation over long trajectories

3-Tier Memory System

Based on MUSE (arxiv:2510.08002):

Tier Content Loading Update Trigger
Strategic <Dilemma, Strategy> pairs Always in system prompt After each task
Procedural Step-by-step SOPs Index in prompt, details on demand After high-reward trajectory
Tool Per-action tips Returned per step When new patterns prove effective

Running the Demo

python demo.py

Runs 17 unit tests + full end-to-end demo with a simulated TreasureMaze environment. No API keys needed β€” uses MockLLMBackend.

Dependencies

  • Core framework: Python 3.10+ (stdlib only)
  • HF backend: huggingface_hub
  • OpenAI backend: openai
  • Production embeddings: sentence-transformers (optional, for better retrieval)

License

MIT