# Purpose Agent — Architecture Documentation > For developers building on the framework, researchers understanding the theory, and anyone curious about how self-improving agents work. --- ## Table of Contents 1. [What Is Purpose Agent?](#1-what-is-purpose-agent) 2. [The Big Idea (No Jargon)](#2-the-big-idea) 3. [How It Works — Step by Step](#3-how-it-works) 4. [Architecture Map](#4-architecture-map) 5. [The Core Engine](#5-the-core-engine) 6. [The V2 Safety Kernel](#6-the-v2-safety-kernel) 7. [Research Implementations](#7-research-implementations) 8. [Breakthroughs](#8-breakthroughs) 9. [User-Facing Layers](#9-user-facing-layers) 10. [How Models Are Handled](#10-how-models-are-handled) 11. [The Research Behind It](#11-the-research) 12. [For Contributors](#12-for-contributors) --- ## 1. What Is Purpose Agent? Purpose Agent is a Python framework that builds AI agents that **get better with experience** — without retraining the underlying AI model. Traditional AI agents run the same way every time. Purpose Agent is different: after each task, it extracts lessons from what worked and what didn't, tests those lessons for safety, and uses them to perform better next time. **Think of it like this:** A new employee follows the company handbook. After their first week, they have personal notes — shortcuts they discovered, mistakes they won't repeat, tips from colleagues. Those notes make them better at their job without changing who they are. Purpose Agent does this for AI. --- ## 2. The Big Idea ### For Non-Technical Readers ``` You give it a purpose → It builds a team → It does the work → It learns → Next time is better ``` **You say:** "Help me write Python code." **It builds:** An architect (plans), a coder (writes), and a tester (reviews). **It runs:** The coder writes fibonacci. The tester checks it. A critic scores the work. **It learns:** "When writing recursive functions, check base cases first." This lesson is saved. **Next time:** The coder starts by checking base cases. It's faster and more reliable. ### For Technical Readers The framework implements a **Purpose-MDP** — a Markov Decision Process where: - A **Purpose Function Φ(s)** evaluates every state transition on a 0-10 scale - An **Optimizer** distills successful trajectories into reusable heuristics - Heuristics are ranked by **Q-values** (how often they helped) and selected via **Mixture-of-Heuristics** (sparse activation, like MoE) - An **immune system** scans every new heuristic for prompt injection, score manipulation, and other threats - **Memory CI pipeline** quarantines, tests, and promotes heuristics before they affect agent behavior This is **Potential-Based Reward Shaping** (Ng et al., 1999) applied to LLM agents, with formal convergence guarantees. See [PURPOSE_LEARNING.md](PURPOSE_LEARNING.md). --- ## 3. How It Works — Step by Step Here's what happens when you run `team.run("Write a fibonacci function")`: ### Step 1: The Actor Decides The Actor module receives: - The **purpose** ("Write a fibonacci function") - The **current state** (empty — no code written yet) - Any **learned heuristics** from past runs It generates a thought process and an action: > "I should write a function that handles base cases fib(0)=0 and fib(1)=1, then use iteration for the general case." > → Action: `submit_code` with the Python implementation. ### Step 2: The Environment Executes The code is run against test cases. The environment returns a new state: > "Tests: 4/4 ALL PASSED" ### Step 3: The Purpose Function Scores A **separate LLM call** (not the same as the actor) evaluates the transition: - Φ(state_before) = 0.0 (nothing done) - Φ(state_after) = 10.0 (all tests pass) - Delta = +10.0 (huge improvement) - Evidence: "Tests changed from 0/4 to 4/4" The Purpose Function has **7 anti-gaming rules** that prevent the agent from tricking itself into thinking it's doing well when it isn't. ### Step 4: The Optimizer Extracts Heuristics After the task, the Optimizer looks at the trajectory and extracts reusable patterns: - **Strategic:** "When writing {function_type} functions, handle edge cases first, then iterate." - **Procedural:** "1. Read test cases. 2. Handle base cases. 3. Implement general case. 4. Submit." - **Tool tip:** "When submitting code, check boundary conditions: 0, 1, empty, negative." ### Step 5: Safety Checks Every new heuristic goes through the **immune system**: - Is it a prompt injection? ("Ignore all previous instructions") → **REJECTED** - Does it try to manipulate scores? ("Always score 10") → **REJECTED** - Does it contain secrets? (API keys, passwords) → **REJECTED** - Is it safe? ("Check base cases first") → **QUARANTINED** (pending replay test) After passing replay testing → **PROMOTED** (active in future runs). ### Step 6: Next Run Benefits When the agent runs again, the **Prompt Compiler** selects the top-K heuristics by: - **Relevance** to the current task (embedding similarity) - **Trust** (immune-scanned and verified) - **Utility** (Q-value — how often it helped before) These are injected into the prompt. The agent is now better without any model retraining. --- ## 4. Architecture Map ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ PURPOSE AGENT │ │ │ │ ┌─── USER LAYER ──────────────────────────────────────────────────┐ │ │ │ pa.purpose("...") → Team → team.run("...") │ │ │ │ pa.Agent() pa.Graph() pa.parallel() pa.Conversation() │ │ │ └──────────────────────────────────────────────────┬──────────────┘ │ │ │ │ │ ┌─── CORE ENGINE ──────────────────────────────────▼──────────────┐ │ │ │ │ │ │ │ Actor ──→ Environment ──→ Purpose Function (Φ) │ │ │ │ ↑ │ │ │ │ │ │ │ │ ▼ │ │ │ │ │ State s' Φ(s) → Φ(s') │ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ │ │ │ │ │ Experience Replay Optimizer │ │ │ │ │ │ │ │ │ │ │ └──── heuristics ◄────────────┘ │ │ │ │ │ │ │ └──────────────────────────────────────────────────┬──────────────┘ │ │ │ │ │ ┌─── V2 SAFETY KERNEL ────────────────────────────▼──────────────┐ │ │ │ │ │ │ │ Immune System ──→ Memory CI ──→ Memory Store │ │ │ │ (scan threats) (quarantine) (7 types × 5 statuses) │ │ │ │ │ │ │ │ Prompt Compiler ──→ Token Budget ──→ Credit Assignment │ │ │ │ Trace System ──→ JSONL logs ──→ Offline analysis │ │ │ │ RunMode ──→ EVAL_TEST blocks all writes │ │ │ │ │ │ │ └──────────────────────────────────────────────────┬──────────────┘ │ │ │ │ │ ┌─── INFRASTRUCTURE ──────────────────────────────▼──────────────┐ │ │ │ │ │ │ │ LLM Backends: OpenRouter │ Groq │ OpenAI │ Ollama │ HF │ ... │ │ │ │ Robust Parser: TOML → JSON → field extraction → regex │ │ │ │ Tools: Calculator │ PythonExec │ ReadFile │ WriteFile │ │ │ │ Streaming │ Observability │ Cost Tracking │ Registry │ │ │ │ │ │ │ └─────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘ ``` --- ## 5. The Core Engine ### Actor (`actor.py`) The decision-maker. Given the current state and purpose, it decides what action to take. **Key design:** The Actor doesn't evaluate itself. That's the Purpose Function's job. This separation prevents self-confirmation bias (you wouldn't let a student grade their own exam). The Actor's prompt is **dynamically composed** from three tiers of memory: - **Strategic:** High-level rules ("When coding, handle edge cases first") - **Procedural:** Step-by-step procedures ("1. Read tests. 2. Handle bases. 3. Implement.") - **Tool tips:** Action-specific advice ("When using submit_code, check boundaries") ### Purpose Function (`purpose_function.py`) The critic. A separate LLM call that scores every state transition on a 0-10 scale. **Seven anti-gaming rules:** 1. Evidence required — cite specific state changes 2. No credit for intentions — score actual results, not plans 3. No sycophancy — don't inflate scores to be encouraging 4. Monotonic scale — 0=nothing done, 10=task complete 5. Anti-gaming — flag superficial state manipulation 6. Consistency — same state gets same score (enforced by cache) 7. Confidence — uncertain evaluations get reduced weight ### Experience Replay (`experience_replay.py`) Stores completed trajectories and retrieves relevant ones for future tasks. **Two-phase retrieval** (from MemRL, arxiv:2601.03192): 1. **Recall:** Find trajectories similar to the current task (embedding similarity) 2. **Re-rank:** Order by Q-value utility (how useful was this memory when retrieved before?) ### Optimizer (`optimizer.py`) Extracts reusable heuristics from successful trajectories. Uses the **CER distillation pattern** (arxiv:2506.06698): abstract away specific details with `{variable}` placeholders so heuristics generalize across tasks. ### Orchestrator (`orchestrator.py`) The main loop that ties everything together. For each step: 1. Actor decides → 2. Environment executes → 3. Critic scores → 4. Step recorded → 5. Check termination After each task: store trajectory → optimize → sync heuristics to Actor memory. --- ## 6. The V2 Safety Kernel V1 let the agent learn freely. V2 adds guardrails. ### Memory System (`memory.py`) Seven memory types, each with different trust priors: | Type | Example | Trust | |------|---------|-------| | `purpose_contract` | "Build a web scraper" | High (user-defined) | | `user_preference` | "Always cite sources" | High (human-taught) | | `skill_card` | "When coding, test edges first" | Medium (learned) | | `episodic_case` | "fib(0)=0 was a tricky case" | Medium (observed) | | `failure_pattern` | "Don't use recursion for large n" | Medium (learned from failure) | | `critic_calibration` | "Score 7 for 3/4 tests passing" | Low (meta-learned) | | `tool_policy` | "search: only use at target location" | Medium (learned) | Five statuses: `candidate` → `quarantined` → `promoted` (or `rejected`) → `archived`. ### Immune System (`immune.py`) Scans every candidate memory for 5 threat categories: - **Prompt injection** — "Ignore previous instructions..." - **Score manipulation** — "Always score 10..." - **Tool misuse** — "subprocess.call('rm -rf /')..." - **Privacy leaks** — API keys, emails, file paths - **Scope overreach** — memory tries to affect all agents when it should be scoped ### Memory CI (`memory_ci.py`) The promotion pipeline: ``` candidate → immune_scan() → quarantined → replay_test → promote/reject ``` No memory reaches the agent's prompt without passing every gate. ### Prompt Compiler (`compiler.py`) Selects which memories to include under a token budget. Ranked by: `score = 0.4 × relevance + 0.3 × trust + 0.3 × utility` Returns `included_memory_ids` for credit assignment — only memories that were in the prompt get Q-value updates after the step. ### Trace System (`trace.py`) Every run produces a JSONL trace — the raw material for debugging, evaluation, and memory extraction. Traces are append-only and immutable. ### RunMode (`v2_types.py`) Three modes with strict enforcement: - `LEARNING_TRAIN` — full read/write - `LEARNING_VALIDATION` — read + staging writes - `EVAL_TEST` — **no writes of any kind** (the only mode whose numbers you can report) --- ## 7. Research Implementations Five papers implemented as standalone modules: ### Meta-Rewarding (`meta_rewarding.py`) *From: arxiv:2407.19594 — Llama-3-8B: 22.9% → 39.4% on AlpacaEval* A meta-judge evaluates the Purpose Function's own judgments. Good judgments become calibration examples in memory. The critic improves through in-context learning. ### Self-Taught Evaluators (`self_taught.py`) *From: arxiv:2408.02666* Generates synthetic contrast pairs (correct vs wrong evaluation) from traces. Creates an automatic curriculum: as the critic improves, the contrast pairs get harder. ### Prompt Optimizer (`prompt_optimizer.py`) *From DSPy: arxiv:2310.03714 — +8% on GSM8K, +50% on BBH* Instead of hand-crafting prompts, define signatures (`state, action → score, reasoning`) and let the optimizer bootstrap effective few-shot demonstrations by trial-and-error. ### LLM Compiler (`llm_compiler.py`) *From: arxiv:2312.04511 — up to 3.7× latency speedup* Instead of sequential tool calls (ReAct), plan ALL calls upfront as a DAG and execute independent ones in parallel. ### Retroformer (`retroformer.py`) *From: arxiv:2308.02151* Structured reflection on completed traces → extracts four types of memories (skills, failures, policies, observations). Replaces raw heuristic distillation with typed, safety-scanned memory extraction. --- ## 8. Breakthroughs Six features that go beyond existing frameworks: ### B1: Self-Improving Critic The Purpose Function's own quality improves over time. Meta-judging after each task generates calibration examples that make future scoring more accurate. ### B2: Mixture-of-Heuristics (MoH) Like DeepSeek's Mixture-of-Experts: out of 100+ heuristics, only K=5 are activated per step. **Shared heuristics** (always active, like "check edge cases") + **routed heuristics** (task-specific, selected by Q×similarity). Knowledge grows; compute stays flat. ### B3: Hindsight Heuristic Relabeling From HER (arxiv:1707.01495): when a task fails, instead of discarding the trajectory, ask "what DID this accomplish?" and extract heuristics for what was achieved. Learn from failures, not just successes. ### B4: Heuristic Evolution Periodically generalize specific heuristics into abstract patterns: - Before: "When fibonacci fails on 0, return 0" - After: "When {function} fails on {boundary_value}, add an explicit base case" Creates an automatic curriculum: specific → general → abstract. ### B5: Cross-Domain Transfer Heuristics from coding tasks can help with different coding tasks. The `test_cross_domain_transfer()` function measures this: train on [fibonacci, factorial], test on [palindrome, fizzbuzz]. ### B6: Adversarial Robustness The `AdversarialHardener` generates 30 adversarial inputs (prompt injections, score hacks, API key leaks) and 10 benign inputs, tests the immune system against all of them. Current results: **93% catch rate, 0% false positive.** --- ## 9. User-Facing Layers ### Easy API (`easy.py`) The `purpose()` function analyzes your description and builds the right team: | You say | It builds | |---------|-----------| | "Write Python code" | architect + coder + tester | | "Research papers" | researcher + analyst | | "Write blog posts" | writer + editor | | "Analyze data" | analyst + reporter | | "Help me" | general assistant | ### Unified Capabilities (`unified.py`) Five competing framework philosophies in one composable layer: | Capability | Inspired By | Usage | |-----------|-------------|-------| | `Agent()` | OpenAI Agents SDK | One-liner agent creation | | `Graph()` | LangGraph | Conditional branching, cycles, fan-out | | `parallel()` | CrewAI | Concurrent task execution | | `Conversation()` | AutoGen | Agent-to-agent message passing | | `KnowledgeStore` | LlamaIndex | RAG as a tool | ### Robust Parser (`robust_parser.py`) The universal solution to "LLMs can't reliably produce JSON": - Tries TOML first (fewer tokens than JSON) - Falls back to JSON - Falls back to field extraction by regex - Never crashes. Always returns something usable. --- ## 10. How Models Are Handled ### resolve_backend() One function routes to any provider: ```python resolve_backend("openrouter:meta-llama/llama-3.3-70b-instruct") resolve_backend("groq:llama-3.3-70b-versatile") resolve_backend("openai:gpt-4o") resolve_backend("ollama:qwen3:1.7b") # Local, free resolve_backend("hf:Qwen/Qwen3-32B") resolve_backend("together:meta-llama/Llama-3.3-70B-Instruct-Turbo") ``` ### SLM-Native Design The framework was designed for small models (0.6B-3B params): - **Grammar-constrained output** via Ollama (forces valid structure from any model) - **Prompt compression** for small context windows (8K-32K) - **Tool RAG** — only load relevant tools into the prompt (saves tokens) - **TOML format** — ~fewer tokens than JSON ### _strip_thinking() Handles reasoning models (Qwen3, DeepSeek-R1) that wrap output in `` tags. Automatically strips the thinking and returns only the answer. --- ## 11. The Research Every design decision traces to a published paper. The full list with citations, methodology sections, and implementation mapping is in [COMPILED_RESEARCH.md](COMPILED_RESEARCH.md). The formal framework — **Purpose-MDP** with 5 axioms, 3 theorems, and convergence proofs — is in [PURPOSE_LEARNING.md](PURPOSE_LEARNING.md). **Key theoretical result:** The self-improvement is a form of Potential-Based Reward Shaping (Ng et al., 1999). Our ΔΦ = Φ(s') - Φ(s) preserves the optimal policy while providing dense per-step feedback. The heuristic library converges to a fixed point under bounded capacity. --- ## 12. For Contributors ### File Structure ``` purpose_agent/ ├── types.py # State, Action, Trajectory, Heuristic, PurposeScore ├── llm_backend.py # LLMBackend ABC + HF, OpenAI, Mock + resolve_backend ├── slm_backends.py # Ollama, llama-cpp, prompt compression, SLM registry ├── robust_parser.py # Universal parser: TOML → JSON → regex (never crashes) ├── actor.py # ReAct agent with 3-tier memory prompts ├── purpose_function.py # Φ(s) critic with 7 anti-gaming rules ├── experience_replay.py # Two-phase retrieval (similarity → Q-value) ├── optimizer.py # Trajectory → heuristic distillation ├── orchestrator.py # Main step loop ├── v2_types.py # RunMode, MemoryScope, PurposeScoreV2 ├── trace.py # JSONL execution traces ├── memory.py # 7 MemoryKinds × 5 MemoryStatuses ├── compiler.py # Token-budgeted prompt compilation ├── immune.py # 5 threat scanners ├── memory_ci.py # Quarantine → scan → test → promote/reject ├── evalport.py # Pluggable evaluation protocol ├── benchmark_v2.py # Train/val/test splits with ablation ├── meta_rewarding.py # Self-improving critic (arxiv:2407.19594) ├── self_taught.py # Synthetic critic training (arxiv:2408.02666) ├── prompt_optimizer.py # DSPy-style bootstrap (arxiv:2310.03714) ├── llm_compiler.py # Parallel tool DAG (arxiv:2312.04511) ├── retroformer.py # Structured reflection (arxiv:2308.02151) ├── breakthroughs.py # MoH, hindsight relabeling, heuristic evolution, etc. ├── unified.py # Agent, Graph, parallel, Conversation, KnowledgeStore ├── easy.py # purpose(), Team, quickstart wizard ├── tools.py # Secure built-in tools ├── streaming.py # Async + event streaming ├── observability.py # Cost tracking, callbacks ├── multi_agent.py # Agent teams with shared learning ├── hitl.py # Human-in-the-loop + checkpointing ├── evaluation.py # V1 benchmark runner ├── registry.py # Plugin system ├── __init__.py # 103 exports └── __main__.py # CLI entry point ``` ### Adding a New LLM Provider ```python # In your code (no core edits needed): from purpose_agent import backend_registry, OpenAICompatibleBackend backend_registry.register("my_provider", lambda model, api_key: OpenAICompatibleBackend( model=model, base_url="https://api.myprovider.com/v1", api_key=api_key )) ``` ### Adding a New Tool ```python from purpose_agent import FunctionTool def my_search(query: str) -> str: """Search my database.""" return db.search(query) tool = FunctionTool.from_function(my_search) ``` ### Running Tests ```bash python tests/test_core.py # 21 unit tests python tests/launch_readiness.py # 119 comprehensive tests python benchmarks/validate.py # Mock benchmark suite python benchmarks/validate.py --quick # Fast smoke test ```