Add YAML metadata to repo card

ca2cef5 verified 16 days ago

8.39 kB

	---
	library_name: purpose-agent
	license: mit
	language:
	- en
	tags:
	- reinforcement-learning
	- agents
	- self-improving
	- experience-replay
	- llm-as-judge
	- state-value-evaluation
	- memory-augmented
	- react
	- orchestration
	- modular
	pipeline_tag: text-generation
	---

	# Purpose Agent — Self-Improving Agentic Framework via State-Value Evaluation

	A lightweight, modular framework where an LLM agent improves across tasks without weight updates — using an RL-inspired self-reflection loop with a "Purpose Function" that evaluates intermediate state improvements.

	## Core Philosophy

	The agent improves via a Purpose Function Φ(s) that measures distance-to-goal at every step. It rewards the agent only if Φ(s_new) > Φ(s_current). High-reward trajectories are distilled into reusable heuristics stored in a 3-tier memory system, so the agent gets smarter on each subsequent task.

	No real-time backprop. No PPO/DPO. Minimal infrastructure costs.

	## Architecture

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ ORCHESTRATOR LOOP │
	│ │
	│ ┌──────────┐ action ┌─────────────┐ s_new │
	│ │ ACTOR │ ────────► │ ENVIRONMENT │ ──────────┐ │
	│ │(+memory) │ │ (your code) │ │ │
	│ └────▲─────┘ └─────────────┘ │ │
	│ │ ▼ │
	│ │ heuristics ┌────────────────┐ (s, a, s') │
	│ │◄───────────────│ OPTIMIZER │◄─────────┐ │
	│ │ │ (distillation) │ │ │
	│ │ └────────────────┘ │ │
	│ │ ┌────────────────┐ Φ(s)→Φ(s') │
	│ │ │ PURPOSE FN │──────────┤ │
	│ │ │ (state critic) │ │ │
	│ │ └────────────────┘ │ │
	│ │ ┌────────────────┐ │ │
	│ └────────────────│ EXPERIENCE │◄─────────┘ │
	│ │ REPLAY BUFFER │ │
	│ └────────────────┘ │
	└─────────────────────────────────────────────────────────────────┘
	```

	## Modules

	\| Module \| File \| Role \|
	\|--------\|------\|------\|
	\| Actor \| `actor.py` \| ReAct-style agent with 3-tier memory-augmented prompts \|
	\| Purpose Function \| `purpose_function.py` \| Strict, non-hackable LLM critic that scores Φ(s) transitions \|
	\| Experience Replay \| `experience_replay.py` \| Trajectory storage with two-phase retrieval (similarity + Q-value) \|
	\| Optimizer \| `optimizer.py` \| Distills winning trajectories into reusable heuristics \|
	\| Orchestrator \| `orchestrator.py` \| Main loop tying everything together \|
	\| LLM Backend \| `llm_backend.py` \| Swappable inference layer (HF, OpenAI, Ollama, custom) \|
	\| Types \| `types.py` \| Shared data structures (State, Action, Trajectory, Heuristic, etc.) \|

	## Literature Foundation

	\| Paper \| Contribution to this framework \|
	\|-------\|-------------------------------\|
	\| [MUSE](https://arxiv.org/abs/2510.08002) \| 3-tier memory hierarchy (strategic/procedural/tool) \|
	\| [LATS](https://arxiv.org/abs/2310.04406) \| LLM-as-value-function V(s) pattern \|
	\| [REMEMBERER](https://arxiv.org/abs/2306.07929) \| Q-value experience replay with Bellman updates \|
	\| [Reflexion](https://arxiv.org/abs/2303.11366) \| Verbal reinforcement via episodic self-reflection \|
	\| [SPC](https://arxiv.org/abs/2504.19162) \| Anti-reward-hacking via adversarial critic patterns \|
	\| [CER](https://arxiv.org/abs/2506.06698) \| Contextual experience distillation (Dynamics + Skills) \|
	\| [MemRL](https://arxiv.org/abs/2601.03192) \| Two-phase retrieval (semantic recall → Q-value re-rank) \|
	\| [Voyager](https://arxiv.org/abs/2305.16291) \| Skill library as long-term memory \|

	## Quick Start

	```python
	from purpose_agent import Orchestrator, State
	from purpose_agent.llm_backend import HFInferenceBackend
	from purpose_agent.orchestrator import Environment, Action

	# 1. Define your environment
	class MyEnv(Environment):
	def execute(self, action, current_state):
	# Your environment logic
	return State(data={...})

	# 2. Create orchestrator with any LLM backend
	orch = Orchestrator(
	llm=HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras"),
	environment=MyEnv(),
	available_actions={"search": "Search for items", "navigate": "Go somewhere"},
	persistence_dir="./agent_memory",
	)

	# 3. Run tasks — the agent self-improves across runs
	result = orch.run_task(purpose="Find the answer to X", max_steps=20)
	print(result.summary())
	print(orch.get_heuristic_report()) # See what it learned
	```

	## Swapping LLM Backends

	```python
	# HuggingFace Inference Providers (cheapest)
	from purpose_agent.llm_backend import HFInferenceBackend
	llm = HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras")

	# OpenAI
	from purpose_agent.llm_backend import OpenAICompatibleBackend
	llm = OpenAICompatibleBackend(model="gpt-4o")

	# Local Ollama
	llm = OpenAICompatibleBackend(
	model="llama3.2",
	base_url="http://localhost:11434/v1",
	api_key="ollama",
	)

	# Use DIFFERENT models for Actor vs Critic (recommended for production)
	orch = Orchestrator(
	llm=cheap_fast_model, # Actor — needs throughput
	critic_llm=strong_model, # Purpose Function — needs accuracy
	optimizer_llm=cheap_fast_model, # Runs infrequently
	environment=my_env,
	)
	```

	## Purpose Function — Anti-Reward-Hacking Design

	The Purpose Function system prompt enforces 7 strict rules:

	1. EVIDENCE REQUIRED — Every score must cite specific observable state changes
	2. NO CREDIT FOR INTENTIONS — Scores based on actual state, not agent's predictions
	3. NO SYCOPHANCY — Lateral moves get Δ=0.0, regressions get negative Δ
	4. MONOTONIC SCALE — Φ 0.0–10.0 proportional to progress
	5. ANTI-GAMING — Superficial state manipulation flagged and penalized
	6. CONSISTENCY — Identical states must receive identical Φ scores (cache-enforced)
	7. CONFIDENCE — Ambiguous evaluations get reduced delta magnitude

	Additional programmatic safeguards:
	- Score caching prevents inconsistent evaluations
	- Anomaly detection flags suspiciously large single-step jumps
	- Confidence threshold filters uncertain scores
	- Z-score normalization prevents score inflation over long trajectories

	## 3-Tier Memory System

	Based on MUSE (arxiv:2510.08002):

	\| Tier \| Content \| Loading \| Update Trigger \|
	\|------\|---------\|---------\|----------------\|
	\| Strategic \| `<Dilemma, Strategy>` pairs \| Always in system prompt \| After each task \|
	\| Procedural \| Step-by-step SOPs \| Index in prompt, details on demand \| After high-reward trajectory \|
	\| Tool \| Per-action tips \| Returned per step \| When new patterns prove effective \|

	## Running the Demo

	```bash
	python demo.py
	```

	Runs 17 unit tests + full end-to-end demo with a simulated TreasureMaze environment. No API keys needed — uses MockLLMBackend.

	## Dependencies

	- Core framework: Python 3.10+ (stdlib only)
	- HF backend: `huggingface_hub`
	- OpenAI backend: `openai`
	- Production embeddings: `sentence-transformers` (optional, for better retrieval)

	## License

	MIT