Purpose Agent β Architecture Documentation
For developers building on the framework, researchers understanding the theory, and anyone curious about how self-improving agents work.
Table of Contents
- What Is Purpose Agent?
- The Big Idea (No Jargon)
- How It Works β Step by Step
- Architecture Map
- The Core Engine
- The V2 Safety Kernel
- Research Implementations
- Breakthroughs
- User-Facing Layers
- How Models Are Handled
- The Research Behind It
- For Contributors
1. What Is Purpose Agent?
Purpose Agent is a Python framework that builds AI agents that get better with experience β without retraining the underlying AI model.
Traditional AI agents run the same way every time. Purpose Agent is different: after each task, it extracts lessons from what worked and what didn't, tests those lessons for safety, and uses them to perform better next time.
Think of it like this: A new employee follows the company handbook. After their first week, they have personal notes β shortcuts they discovered, mistakes they won't repeat, tips from colleagues. Those notes make them better at their job without changing who they are. Purpose Agent does this for AI.
2. The Big Idea
For Non-Technical Readers
You give it a purpose β It builds a team β It does the work β It learns β Next time is better
You say: "Help me write Python code." It builds: An architect (plans), a coder (writes), and a tester (reviews). It runs: The coder writes fibonacci. The tester checks it. A critic scores the work. It learns: "When writing recursive functions, check base cases first." This lesson is saved. Next time: The coder starts by checking base cases. It's faster and more reliable.
For Technical Readers
The framework implements a Purpose-MDP β a Markov Decision Process where:
- A Purpose Function Ξ¦(s) evaluates every state transition on a 0-10 scale
- An Optimizer distills successful trajectories into reusable heuristics
- Heuristics are ranked by Q-values (how often they helped) and selected via Mixture-of-Heuristics (sparse activation, like MoE)
- An immune system scans every new heuristic for prompt injection, score manipulation, and other threats
- Memory CI pipeline quarantines, tests, and promotes heuristics before they affect agent behavior
This is Potential-Based Reward Shaping (Ng et al., 1999) applied to LLM agents, with formal convergence guarantees. See PURPOSE_LEARNING.md.
3. How It Works β Step by Step
Here's what happens when you run team.run("Write a fibonacci function"):
Step 1: The Actor Decides
The Actor module receives:
- The purpose ("Write a fibonacci function")
- The current state (empty β no code written yet)
- Any learned heuristics from past runs
It generates a thought process and an action:
"I should write a function that handles base cases fib(0)=0 and fib(1)=1, then use iteration for the general case." β Action:
submit_codewith the Python implementation.
Step 2: The Environment Executes
The code is run against test cases. The environment returns a new state:
"Tests: 4/4 ALL PASSED"
Step 3: The Purpose Function Scores
A separate LLM call (not the same as the actor) evaluates the transition:
- Ξ¦(state_before) = 0.0 (nothing done)
- Ξ¦(state_after) = 10.0 (all tests pass)
- Delta = +10.0 (huge improvement)
- Evidence: "Tests changed from 0/4 to 4/4"
The Purpose Function has 7 anti-gaming rules that prevent the agent from tricking itself into thinking it's doing well when it isn't.
Step 4: The Optimizer Extracts Heuristics
After the task, the Optimizer looks at the trajectory and extracts reusable patterns:
- Strategic: "When writing {function_type} functions, handle edge cases first, then iterate."
- Procedural: "1. Read test cases. 2. Handle base cases. 3. Implement general case. 4. Submit."
- Tool tip: "When submitting code, check boundary conditions: 0, 1, empty, negative."
Step 5: Safety Checks
Every new heuristic goes through the immune system:
- Is it a prompt injection? ("Ignore all previous instructions") β REJECTED
- Does it try to manipulate scores? ("Always score 10") β REJECTED
- Does it contain secrets? (API keys, passwords) β REJECTED
- Is it safe? ("Check base cases first") β QUARANTINED (pending replay test)
After passing replay testing β PROMOTED (active in future runs).
Step 6: Next Run Benefits
When the agent runs again, the Prompt Compiler selects the top-K heuristics by:
- Relevance to the current task (embedding similarity)
- Trust (immune-scanned and verified)
- Utility (Q-value β how often it helped before)
These are injected into the prompt. The agent is now better without any model retraining.
4. Architecture Map
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PURPOSE AGENT β
β β
β ββββ USER LAYER βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β pa.purpose("...") β Team β team.run("...") β β
β β pa.Agent() pa.Graph() pa.parallel() pa.Conversation() β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββ β
β β β
β ββββ CORE ENGINE βββββββββββββββββββββββββββββββββββΌβββββββββββββββ β
β β β β
β β Actor βββ Environment βββ Purpose Function (Ξ¦) β β
β β β β β β β
β β β β βΌ β β
β β β State s' Ξ¦(s) β Ξ¦(s') β β
β β β β β β β
β β β βΌ βΌ β β
β β β Experience Replay Optimizer β β
β β β β β β β
β β βββββ heuristics ββββββββββββββ β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββ β
β β β
β ββββ V2 SAFETY KERNEL βββββββββββββββββββββββββββββΌβββββββββββββββ β
β β β β
β β Immune System βββ Memory CI βββ Memory Store β β
β β (scan threats) (quarantine) (7 types Γ 5 statuses) β β
β β β β
β β Prompt Compiler βββ Token Budget βββ Credit Assignment β β
β β Trace System βββ JSONL logs βββ Offline analysis β β
β β RunMode βββ EVAL_TEST blocks all writes β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββ β
β β β
β ββββ INFRASTRUCTURE βββββββββββββββββββββββββββββββΌβββββββββββββββ β
β β β β
β β LLM Backends: OpenRouter β Groq β OpenAI β Ollama β HF β ... β β
β β Robust Parser: TOML β JSON β field extraction β regex β β
β β Tools: Calculator β PythonExec β ReadFile β WriteFile β β
β β Streaming β Observability β Cost Tracking β Registry β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5. The Core Engine
Actor (actor.py)
The decision-maker. Given the current state and purpose, it decides what action to take.
Key design: The Actor doesn't evaluate itself. That's the Purpose Function's job. This separation prevents self-confirmation bias (you wouldn't let a student grade their own exam).
The Actor's prompt is dynamically composed from three tiers of memory:
- Strategic: High-level rules ("When coding, handle edge cases first")
- Procedural: Step-by-step procedures ("1. Read tests. 2. Handle bases. 3. Implement.")
- Tool tips: Action-specific advice ("When using submit_code, check boundaries")
Purpose Function (purpose_function.py)
The critic. A separate LLM call that scores every state transition on a 0-10 scale.
Seven anti-gaming rules:
- Evidence required β cite specific state changes
- No credit for intentions β score actual results, not plans
- No sycophancy β don't inflate scores to be encouraging
- Monotonic scale β 0=nothing done, 10=task complete
- Anti-gaming β flag superficial state manipulation
- Consistency β same state gets same score (enforced by cache)
- Confidence β uncertain evaluations get reduced weight
Experience Replay (experience_replay.py)
Stores completed trajectories and retrieves relevant ones for future tasks.
Two-phase retrieval (from MemRL, arxiv:2601.03192):
- Recall: Find trajectories similar to the current task (embedding similarity)
- Re-rank: Order by Q-value utility (how useful was this memory when retrieved before?)
Optimizer (optimizer.py)
Extracts reusable heuristics from successful trajectories.
Uses the CER distillation pattern (arxiv:2506.06698): abstract away specific details with {variable} placeholders so heuristics generalize across tasks.
Orchestrator (orchestrator.py)
The main loop that ties everything together. For each step:
- Actor decides β 2. Environment executes β 3. Critic scores β 4. Step recorded β 5. Check termination
After each task: store trajectory β optimize β sync heuristics to Actor memory.
6. The V2 Safety Kernel
V1 let the agent learn freely. V2 adds guardrails.
Memory System (memory.py)
Seven memory types, each with different trust priors:
| Type | Example | Trust |
|---|---|---|
purpose_contract |
"Build a web scraper" | High (user-defined) |
user_preference |
"Always cite sources" | High (human-taught) |
skill_card |
"When coding, test edges first" | Medium (learned) |
episodic_case |
"fib(0)=0 was a tricky case" | Medium (observed) |
failure_pattern |
"Don't use recursion for large n" | Medium (learned from failure) |
critic_calibration |
"Score 7 for 3/4 tests passing" | Low (meta-learned) |
tool_policy |
"search: only use at target location" | Medium (learned) |
Five statuses: candidate β quarantined β promoted (or rejected) β archived.
Immune System (immune.py)
Scans every candidate memory for 5 threat categories:
- Prompt injection β "Ignore previous instructions..."
- Score manipulation β "Always score 10..."
- Tool misuse β "subprocess.call('rm -rf /')..."
- Privacy leaks β API keys, emails, file paths
- Scope overreach β memory tries to affect all agents when it should be scoped
Memory CI (memory_ci.py)
The promotion pipeline:
candidate β immune_scan() β quarantined β replay_test β promote/reject
No memory reaches the agent's prompt without passing every gate.
Prompt Compiler (compiler.py)
Selects which memories to include under a token budget. Ranked by:
score = 0.4 Γ relevance + 0.3 Γ trust + 0.3 Γ utility
Returns included_memory_ids for credit assignment β only memories that were in the prompt get Q-value updates after the step.
Trace System (trace.py)
Every run produces a JSONL trace β the raw material for debugging, evaluation, and memory extraction. Traces are append-only and immutable.
RunMode (v2_types.py)
Three modes with strict enforcement:
LEARNING_TRAINβ full read/writeLEARNING_VALIDATIONβ read + staging writesEVAL_TESTβ no writes of any kind (the only mode whose numbers you can report)
7. Research Implementations
Five papers implemented as standalone modules:
Meta-Rewarding (meta_rewarding.py)
From: arxiv:2407.19594 β Llama-3-8B: 22.9% β 39.4% on AlpacaEval
A meta-judge evaluates the Purpose Function's own judgments. Good judgments become calibration examples in memory. The critic improves through in-context learning.
Self-Taught Evaluators (self_taught.py)
From: arxiv:2408.02666
Generates synthetic contrast pairs (correct vs wrong evaluation) from traces. Creates an automatic curriculum: as the critic improves, the contrast pairs get harder.
Prompt Optimizer (prompt_optimizer.py)
From DSPy: arxiv:2310.03714 β +8% on GSM8K, +50% on BBH
Instead of hand-crafting prompts, define signatures (state, action β score, reasoning) and let the optimizer bootstrap effective few-shot demonstrations by trial-and-error.
LLM Compiler (llm_compiler.py)
From: arxiv:2312.04511 β up to 3.7Γ latency speedup
Instead of sequential tool calls (ReAct), plan ALL calls upfront as a DAG and execute independent ones in parallel.
Retroformer (retroformer.py)
From: arxiv:2308.02151
Structured reflection on completed traces β extracts four types of memories (skills, failures, policies, observations). Replaces raw heuristic distillation with typed, safety-scanned memory extraction.
8. Breakthroughs
Six features that go beyond existing frameworks:
B1: Self-Improving Critic
The Purpose Function's own quality improves over time. Meta-judging after each task generates calibration examples that make future scoring more accurate.
B2: Mixture-of-Heuristics (MoH)
Like DeepSeek's Mixture-of-Experts: out of 100+ heuristics, only K=5 are activated per step. Shared heuristics (always active, like "check edge cases") + routed heuristics (task-specific, selected by QΓsimilarity). Knowledge grows; compute stays flat.
B3: Hindsight Heuristic Relabeling
From HER (arxiv:1707.01495): when a task fails, instead of discarding the trajectory, ask "what DID this accomplish?" and extract heuristics for what was achieved. Learn from failures, not just successes.
B4: Heuristic Evolution
Periodically generalize specific heuristics into abstract patterns:
- Before: "When fibonacci fails on 0, return 0"
- After: "When {function} fails on {boundary_value}, add an explicit base case"
Creates an automatic curriculum: specific β general β abstract.
B5: Cross-Domain Transfer
Heuristics from coding tasks can help with different coding tasks. The test_cross_domain_transfer() function measures this: train on [fibonacci, factorial], test on [palindrome, fizzbuzz].
B6: Adversarial Robustness
The AdversarialHardener generates 30 adversarial inputs (prompt injections, score hacks, API key leaks) and 10 benign inputs, tests the immune system against all of them. Current results: 93% catch rate, 0% false positive.
9. User-Facing Layers
Easy API (easy.py)
The purpose() function analyzes your description and builds the right team:
| You say | It builds |
|---|---|
| "Write Python code" | architect + coder + tester |
| "Research papers" | researcher + analyst |
| "Write blog posts" | writer + editor |
| "Analyze data" | analyst + reporter |
| "Help me" | general assistant |
Unified Capabilities (unified.py)
Five competing framework philosophies in one composable layer:
| Capability | Inspired By | Usage |
|---|---|---|
Agent() |
OpenAI Agents SDK | One-liner agent creation |
Graph() |
LangGraph | Conditional branching, cycles, fan-out |
parallel() |
CrewAI | Concurrent task execution |
Conversation() |
AutoGen | Agent-to-agent message passing |
KnowledgeStore |
LlamaIndex | RAG as a tool |
Robust Parser (robust_parser.py)
The universal solution to "LLMs can't reliably produce JSON":
- Tries TOML first (fewer tokens than JSON)
- Falls back to JSON
- Falls back to field extraction by regex
- Never crashes. Always returns something usable.
10. How Models Are Handled
resolve_backend()
One function routes to any provider:
resolve_backend("openrouter:meta-llama/llama-3.3-70b-instruct")
resolve_backend("groq:llama-3.3-70b-versatile")
resolve_backend("openai:gpt-4o")
resolve_backend("ollama:qwen3:1.7b") # Local, free
resolve_backend("hf:Qwen/Qwen3-32B")
resolve_backend("together:meta-llama/Llama-3.3-70B-Instruct-Turbo")
SLM-Native Design
The framework was designed for small models (0.6B-3B params):
- Grammar-constrained output via Ollama (forces valid structure from any model)
- Prompt compression for small context windows (8K-32K)
- Tool RAG β only load relevant tools into the prompt (saves tokens)
- TOML format β ~fewer tokens than JSON
_strip_thinking()
Handles reasoning models (Qwen3, DeepSeek-R1) that wrap output in <think> tags. Automatically strips the thinking and returns only the answer.
11. The Research
Every design decision traces to a published paper. The full list with citations, methodology sections, and implementation mapping is in COMPILED_RESEARCH.md.
The formal framework β Purpose-MDP with 5 axioms, 3 theorems, and convergence proofs β is in PURPOSE_LEARNING.md.
Key theoretical result: The self-improvement is a form of Potential-Based Reward Shaping (Ng et al., 1999). Our ΞΞ¦ = Ξ¦(s') - Ξ¦(s) preserves the optimal policy while providing dense per-step feedback. The heuristic library converges to a fixed point under bounded capacity.
12. For Contributors
File Structure
purpose_agent/
βββ types.py # State, Action, Trajectory, Heuristic, PurposeScore
βββ llm_backend.py # LLMBackend ABC + HF, OpenAI, Mock + resolve_backend
βββ slm_backends.py # Ollama, llama-cpp, prompt compression, SLM registry
βββ robust_parser.py # Universal parser: TOML β JSON β regex (never crashes)
βββ actor.py # ReAct agent with 3-tier memory prompts
βββ purpose_function.py # Ξ¦(s) critic with 7 anti-gaming rules
βββ experience_replay.py # Two-phase retrieval (similarity β Q-value)
βββ optimizer.py # Trajectory β heuristic distillation
βββ orchestrator.py # Main step loop
βββ v2_types.py # RunMode, MemoryScope, PurposeScoreV2
βββ trace.py # JSONL execution traces
βββ memory.py # 7 MemoryKinds Γ 5 MemoryStatuses
βββ compiler.py # Token-budgeted prompt compilation
βββ immune.py # 5 threat scanners
βββ memory_ci.py # Quarantine β scan β test β promote/reject
βββ evalport.py # Pluggable evaluation protocol
βββ benchmark_v2.py # Train/val/test splits with ablation
βββ meta_rewarding.py # Self-improving critic (arxiv:2407.19594)
βββ self_taught.py # Synthetic critic training (arxiv:2408.02666)
βββ prompt_optimizer.py # DSPy-style bootstrap (arxiv:2310.03714)
βββ llm_compiler.py # Parallel tool DAG (arxiv:2312.04511)
βββ retroformer.py # Structured reflection (arxiv:2308.02151)
βββ breakthroughs.py # MoH, hindsight relabeling, heuristic evolution, etc.
βββ unified.py # Agent, Graph, parallel, Conversation, KnowledgeStore
βββ easy.py # purpose(), Team, quickstart wizard
βββ tools.py # Secure built-in tools
βββ streaming.py # Async + event streaming
βββ observability.py # Cost tracking, callbacks
βββ multi_agent.py # Agent teams with shared learning
βββ hitl.py # Human-in-the-loop + checkpointing
βββ evaluation.py # V1 benchmark runner
βββ registry.py # Plugin system
βββ __init__.py # 103 exports
βββ __main__.py # CLI entry point
Adding a New LLM Provider
# In your code (no core edits needed):
from purpose_agent import backend_registry, OpenAICompatibleBackend
backend_registry.register("my_provider",
lambda model, api_key: OpenAICompatibleBackend(
model=model, base_url="https://api.myprovider.com/v1", api_key=api_key
))
Adding a New Tool
from purpose_agent import FunctionTool
def my_search(query: str) -> str:
"""Search my database."""
return db.search(query)
tool = FunctionTool.from_function(my_search)
Running Tests
python tests/test_core.py # 21 unit tests
python tests/launch_readiness.py # 119 comprehensive tests
python benchmarks/validate.py # Mock benchmark suite
python benchmarks/validate.py --quick # Fast smoke test