# COMPILED RESEARCH — Purpose Agent > Living document. Every implementation decision traces back to a paper, benchmark, or empirical finding listed here. Updated with each feature addition. --- ## feat: Core Architecture — Self-Improving Agent Loop via Φ(s) State-Value Evaluation **Date:** 2025-04-28 | **Modules:** `types.py`, `actor.py`, `purpose_function.py`, `experience_replay.py`, `optimizer.py`, `orchestrator.py` ### Papers Implemented | Paper | ArXiv | Key Contribution | Where Used | |-------|-------|-----------------|------------| | MUSE | [2510.08002](https://arxiv.org/abs/2510.08002) | 3-tier memory (strategic/procedural/tool), Plan-Execute-Reflect-Memorize loop, independent Reflect Agent | `actor.py` (memory tiers), `optimizer.py` (post-task distillation), `orchestrator.py` (reflect cycle) | | LATS | [2310.04406](https://arxiv.org/abs/2310.04406) | LLM-as-value-function V(s) = λ·LM_score + (1-λ)·SC_score, score AFTER env feedback | `purpose_function.py` (Φ scoring, anti-inflation normalization) | | REMEMBERER | [2306.07929](https://arxiv.org/abs/2306.07929) | Q-value experience replay with tabular Q-Learning updates: Q(g,o,a) ← (1-α)Q + α[r + γ·max Q] | `experience_replay.py` (Q-value storage + MC update), `types.py` (Heuristic.update_q_value) | | Reflexion | [2303.11366](https://arxiv.org/abs/2303.11366) | Verbal reinforcement via episodic memory, Actor/Evaluator/Self-Reflection triad | `orchestrator.py` (actor-critic separation), `actor.py` (ReAct format) | | SPC | [2504.19162](https://arxiv.org/abs/2504.19162) | Adversarial self-play critic: Sneaky Generator vs Step Critic | `purpose_function.py` (7 anti-reward-hacking rules, evidence requirement) | | CER | [2506.06698](https://arxiv.org/abs/2506.06698) | Contextual experience distillation: Dynamics (url→summary) + Skills (abstract SOPs with {variables}) | `optimizer.py` (DISTILL_TRAJECTORY_PROMPT pattern, {variable} placeholders) | | MemRL | [2601.03192](https://arxiv.org/abs/2601.03192) | Memory-Augmented MDP: decouple "which memory to retrieve" (learned Q) from "how to act given memory" (LLM) | `experience_replay.py` (two-phase retrieval: semantic recall → Q-value re-rank) | | Voyager | [2305.16291](https://arxiv.org/abs/2305.16291) | Skill library as long-term memory, self-verification critic prompt | `optimizer.py` (heuristic library concept), `experience_replay.py` (persistent skill storage) | ### Key Design Decisions **Why Φ(s) potential-based shaping instead of binary reward:** - LATS showed V(s) with LLM scoring outperforms binary success/fail on HotPotQA, WebShop, HumanEval - Potential-based shaping (Φ(s_new) - Φ(s_current)) satisfies the necessary and sufficient condition for policy invariance under reward shaping (Ng et al., 1999) - Enables learning from partial successes — binary reward discards all information from failed tasks **Why 3-tier memory instead of flat:** - MUSE achieved SOTA 51.78% on TheAgentCompany with 3-tier; flat memory baseline was 23.65% - Strategic tier prevents context bloat (loaded once at task start, not per-step) - Procedural tier uses lazy loading (only index in prompt, full SOP on demand) — critical for SLM context limits **Why separate critic LLM from actor:** - MUSE's independent Reflect Agent removed self-confirmation bias - SPC's adversarial approach showed LLMs are sycophantic self-evaluators — separate prompts are essential **Why 7 anti-reward-hacking rules:** - JSONSchemaBench (arxiv:2501.10868) showed SLMs produce invalid outputs 35-87% of the time without constraints - SPC showed adversarial critics detect ~2x more reasoning errors than self-evaluation - Evidence requirement, cache consistency, anomaly detection, and confidence thresholds are novel programmatic safeguards not found in any paper — they close the gap between theoretical SPC and practical deployment --- ## feat: SLM-Native Backends — Ollama, llama-cpp, Prompt Compression **Date:** 2025-04-28 | **Modules:** `slm_backends.py`, `registry.py` ### Papers & Benchmarks | Paper | ArXiv | Key Finding | Where Used | |-------|-------|-------------|------------| | TinyAgent | [2409.00608](https://arxiv.org/abs/2409.00608) | 1.1B model matches GPT-4-Turbo on 16-function Mac agent task via: synthetic SFT + Tool RAG (DeBERTa classifier, 34% prompt reduction) + INT4 quantization | `slm_backends.py` (prompt compression), `tools.py` (ToolRegistry.get_relevant_tools = Tool RAG) | | JSONSchemaBench | [2501.10868](https://arxiv.org/abs/2501.10868) | Guidance: 96% compliance on simple schemas; Outlines: severe timeouts on complex; XGrammar: fastest (100x) but lower coverage; llama.cpp/Ollama: 74-97% | `slm_backends.py` (OllamaBackend uses grammar-constrained output via format= parameter) | | XGrammar | [2411.15100](https://arxiv.org/abs/2411.15100) | Grammar-constrained decoding engine, up to 100x speedup vs naïve CFG, default in vLLM v0.6+ | Referenced for vLLM production deployment | | LLMLingua-2 | [2403.12968](https://arxiv.org/abs/2403.12968) | Token classification (keep/drop) trained via GPT-4 distillation, 10x compression with minimal quality loss | `slm_backends.py` (SLMPromptCompressor design, extensibility note for llmlingua integration) | | SLM Agent Survey | [2510.03847](https://arxiv.org/abs/2510.03847) | Guided decoding + strict JSON Schema + validator-first tool execution closes most SLM-vs-LLM capability gap at 10-100x lower cost | Architecture validation — grammar-constrained output is the correct default for SLMs | ### SLM Model Selection Rationale | Model | Params | Context | Why Included | |-------|--------|---------|-------------| | Phi-4-mini | 3.8B | 16K | Top schema compliance on BFCL v3/v4 (Microsoft benchmark) | | Qwen3-1.7B | 1.7B | 32K | Best balance: strong function calling, large context for agent traces | | Qwen3-0.6B | 0.6B | 32K | Ultra-light proof point: can an agent work at 600M params? | | Llama-3.2-3B | 3B | 128K | Largest context in class, Meta's open weights | | Llama-3.2-1B | 1B | 128K | Smallest Llama, 128K context enables long agent traces | | SmolLM2-1.7B | 1.7B | 8K | HF native, tests tight context constraint | | Gemma-3-1B | 1B | 32K | Google's multimodal-capable SLM | ### Key Design Decisions **Why grammar-constrained output is mandatory for SLMs:** - JSONSchemaBench showed prompt-only JSON generation fails 35-87% on even medium schemas for SLMs - Ollama's grammar engine (via llama.cpp) forces valid output from ANY model regardless of training - This is the fundamental enabler for SLM-native agents **Why prompt compression matters:** - SmolLM2 has 8K context; agent system prompt + tool descriptions + history can exceed 4K tokens easily - TinyAgent showed 34% prompt reduction via Tool RAG alone - Our 3-stage compressor (whitespace → verbose phrases → middle truncation) is a no-dependency fallback; LLMLingua-2 is the production upgrade path --- ## feat: Streaming & Async Engine **Date:** 2025-04-28 | **Module:** `streaming.py` ### Patterns from Framework Analysis - **smolagents**: Agents are synchronous internally; `anyio.to_thread.run_sync` for async contexts (official pattern from HF docs) - **LangGraph**: `graph.astream_events(input, version="v2")` is genuinely async — gold standard for streaming - **CrewAI**: `kickoff_async()` is NOT truly async — it's `loop.run_in_executor()` wrapper (documented caveat) ### Design Decision Adopted smolagents pattern: sync core + `asyncio.to_thread` wrappers. Rationale: 1. Most LLM backends (Ollama, llama-cpp) are synchronous 2. Thread-based async avoids the complexity of native async for I/O-bound LLM calls 3. `AsyncOrchestrator.run_task_stream()` yields `StreamEvent` objects — matches LangGraph's event streaming UX --- ## feat: Tool Framework with Tool RAG **Date:** 2025-04-28 | **Module:** `tools.py` ### Research Applied - **TinyAgent (arxiv:2409.00608)**: Tool RAG via DeBERTa-v3-small multi-label classifier selects relevant tools (avg 3.97 vs 6 total = 34% prompt reduction). We implement a lightweight trigram-embedding version; production path is fine-tuned classifier. - **smolagents CodeAgent pattern**: For SLMs, code-based actions (Python generation) are more reliable than JSON tool calls. Our `FunctionTool.from_function()` bridges both — tools have JSON schemas for structured-output capable models, and `to_prompt(compact=True)` for SLM-friendly text format. - **OpenAI function calling schema**: All tools export `to_schema()` in OpenAI-compatible format for backends that support native tool_calls. --- ## feat: Observability — Cost Tracking & Callbacks **Date:** 2025-04-28 | **Module:** `observability.py` ### Competitive Analysis | Framework | Observability Approach | |-----------|----------------------| | LangChain/LangGraph | LangSmith (proprietary SaaS) + OpenTelemetry export | | CrewAI | AgentOps integration (proprietary) | | smolagents | Basic step logging | | **Purpose Agent** | Pluggable callback system (no vendor lock-in) + built-in cost tracking | ### Design Decision No vendor lock-in. `AgentCallback` protocol + `CallbackManager` dispatcher. Users plug in whatever they want: - `LoggingCallback` → structured logs - `JSONFileCallback` → JSONL event stream (ingestible by any analytics tool) - `MetricsCollector` → in-memory aggregate metrics - Custom: implement `on_event(AgentEvent)` → integrate with Arize, LangSmith, Weights & Biases, etc. Cost tracking uses per-model pricing tables. Local models get electricity-cost estimates (~$0.005/1M tokens on CPU). --- ## feat: Multi-Agent with Shared Self-Improvement **Date:** 2025-04-28 | **Module:** `multi_agent.py` ### Research Applied | Paper | Contribution | |-------|-------------| | MUSE (2510.08002) | Independent Reflect Agent → our critic_model is separate from agent models | | AgentFly (2508.16153) | Case bank with soft Q-learning for retrieval utility → our shared_replay with Q-value ranking | | DynaSaur (2411.01747) | Dynamic action accumulation into vector-indexed library → ToolRegistry with semantic retrieval | ### Key Innovation: Shared Experience Replay No other multi-agent framework does this. When Agent A completes a task: 1. Trajectory goes to shared ExperienceReplay 2. Optimizer distills heuristics from it 3. When Agent B starts a task, it retrieves relevant heuristics from the shared pool 4. Agent B benefits from Agent A's experience without any retraining This is the MemRL (2601.03192) M-MDP formulation applied to multi-agent: the retrieval policy Q(s,m) operates over a shared memory bank M. ### Task Delegation Two-phase: keyword matching (zero cost, instant) → LLM routing (1 API call, accurate). Falls back gracefully: if LLM is unavailable, keyword matching still works. --- ## feat: Human-in-the-Loop with Φ Score Overrides **Date:** 2025-04-28 | **Module:** `hitl.py` ### Competitive Analysis | Framework | HITL Approach | |-----------|--------------| | LangGraph | **Best**: Full state checkpointing, interrupt nodes, time-travel debug | | CrewAI | Basic approval callbacks | | AutoGen | Chat-based human interaction | | **Purpose Agent** | Checkpoint/resume + **Φ override** (unique — humans teach the critic) | ### Key Innovation: Φ Score Override → Permanent Learning When a human overrides a Φ score: 1. The corrected score is recorded in the TrajectoryStep 2. The trajectory (with human-corrected scores) goes into Experience Replay 3. The Optimizer distills heuristics from it — now informed by human judgment 4. Future tasks use these human-informed heuristics This is effectively RLHF without fine-tuning — the human preference signal flows through the memory system instead of through gradient updates. No other framework has this. ### Checkpoint Design Serializable state snapshot (JSON) at each step. Enables: - Resume from any point after human review - Time-travel: load any checkpoint and re-run from there - Offline review: save checkpoints, review later, resume --- ## feat: Evaluation Harness — Improvement Curve Tracking **Date:** 2025-04-28 | **Module:** `evaluation.py` ### Benchmarks Referenced | Benchmark | Domain | Used By | |-----------|--------|---------| | GAIA | General assistant tasks | LATS, Reflexion | | AlfWorld | Text-based game environments | Reflexion (91% pass@1) | | WebShop | E-commerce navigation | REMEMBERER (+4% over SOTA) | | WebArena | Web navigation | CER (51% relative improvement) | | TheAgentCompany | Corporate productivity | MUSE (51.78% SOTA) | | SWE-bench | Code generation/repair | Multiple agent papers | | HumanEval | Code generation | Reflexion (91% pass@1) | ### Design Decision The improvement curve is the key differentiator chart: ``` Iteration Success Rate 1 40% ← Cold start (no experience) 5 70% ← Learning from past tasks 10 90% ← Mature agent with full heuristic library ``` No other framework can produce this chart because none of them learn from experience. BenchmarkRunner.run() + BenchmarkResult.get_improvement_curve() makes this a one-liner. `compare_cold_vs_warm()` is the simplest proof: run once with empty memory, run again with learned memory. The delta IS the self-improvement signal. --- ## refactor: Plugin Registry & Modularity Fixes **Date:** 2025-04-28 | **Module:** `registry.py` ### Issues Fixed 1. **Duplicated embedding logic**: `ExperienceReplay._compute_embedding` (dim=128) and `ToolRegistry._embed` (dim=64) were copy-pasted. Created `EmbeddingBackend` as shared utility in registry. 2. **Private methods used as public API**: `Orchestrator._post_task` and `_sync_memory` were called by `HITLOrchestrator`, `AsyncOrchestrator`, `AgentTeam`. Made public: `post_task()`, `sync_memory()`. 3. **Hardcoded SLM registry**: `SLM_REGISTRY` dict was not extensible. Added `model_registry.register()` in plugin system. 4. **No plugin system**: Adding new backends/tools/callbacks required editing `__init__.py`. Created `PluginRegistry` with `backend_registry`, `callback_registry`, `model_registry` — new components are 1 register() call. ### Extension Pattern Adding a new component to Purpose Agent: ```python # my_custom_backend.py from purpose_agent import LLMBackend, backend_registry class MyBackend(LLMBackend): def generate(self, messages, **kwargs): return "response" backend_registry.register("my_backend", MyBackend) # Done — now: backend_registry.create("my_backend") ``` No core files edited. No `__init__.py` changes. Drop the file, import it, register. --- ## Competitive Framework Analysis **Date:** 2025-04-28 ### Why Developers Leave LangChain (sources: Medium, LinkedIn, Reddit, Analytics India Magazine) 1. **Over-abstraction**: Too many layers between user code and the LLM call. Simple tasks require understanding Chain → LLMChain → PromptTemplate → OutputParser hierarchy. 2. **Massive dependency tree**: Pulls in dozens of packages. Version conflicts common. 3. **Frequent breaking changes**: API surface changed significantly between v0.1 → v0.2 → v0.3. 4. **Debugging opacity**: Errors propagate through abstraction layers, making root cause hard to find. 5. **Performance overhead**: Abstraction layers add latency to every LLM call. ### Purpose Agent's Response to Each Criticism | LangChain Problem | Purpose Agent Approach | |-------------------|----------------------| | Over-abstraction | Flat module structure. Orchestrator → Actor → LLMBackend. 3 hops max. | | Massive dependencies | stdlib only (core). External deps are optional, per-backend. | | Breaking changes | Stable `types.py` contract. All modules exchange the same 7 types. | | Debugging opacity | Structured logging at every step. Observability callbacks. JSON event stream. | | Performance overhead | Direct LLM calls. No chain/pipeline abstraction layer. | --- ## feat: Unified Capabilities — 5 Framework Philosophies in One Composable Layer **Date:** 2025-04-28 | **Module:** `unified.py` ### The Five Competing Philosophies | Framework | Philosophy | Their Core Mechanic | Our Implementation | Zero core changes? | |-----------|-----------|--------------------|--------------------|-------------------| | **LangGraph** | "I want control" | StateGraph with conditional edges, cycles, fan-out/fan-in | `Graph` class: `add_node()`, `add_edge()`, `add_conditional_edge()`, cyclic execution with visit counting | ✅ Calls `Agent.run()` at each node | | **CrewAI** | "I want speed" | `Process.sequential` / `Process.hierarchical` / `kickoff_for_each_async` | `parallel()` function: `ThreadPoolExecutor` over `Agent.run()` calls | ✅ Wraps existing Agent | | **AutoGen** | "I want agents talking" | `GroupChat` with speaker selection, message history | `Conversation` class: round-robin/auto speaker order, shared message history | ✅ Each turn is an `Agent.run()` | | **OpenAI Agents SDK** | "I want plug-and-play" | `Agent(name, instructions, tools)` → `Runner.run(task)` | `Agent` factory: auto-resolves model strings, auto-creates environment, one-liner | ✅ Wraps Orchestrator | | **LlamaIndex** | "I want knowledge" | `QueryEngineTool` — RAG as an agent tool | `KnowledgeStore.as_tool()` — chunk/embed/retrieve as a Tool | ✅ Plugs into ToolRegistry | ### Research Behind Each **Graph Execution (LangGraph pattern)** - LangGraph uses a `StateGraph` where nodes are functions that transform state, edges are routing rules - Conditional edges enable cycles (retry loops) and branching (if/else in workflows) - Our implementation: nodes are either `Agent` instances or `Callable[[State], State]` — when a node is an Agent, its entire Φ improvement loop runs automatically inside the graph node - Key difference: LangGraph graphs are static compute graphs. Ours are self-improving — each node execution feeds experience replay **Parallel Execution (CrewAI pattern)** - CrewAI's `kickoff_for_each_async` is actually `loop.run_in_executor()` — not true async (documented caveat from CrewAI source) - Our `parallel()` uses `ThreadPoolExecutor` directly — honest concurrency, no fake async wrapper - All parallel tasks share the same experience replay via the Agent's Orchestrator — learning happens even during concurrent execution **Agent Conversation (AutoGen GroupChat pattern)** - AutoGen's `GroupChat` maintains a message list, uses LLM or round-robin for speaker selection - Our `Conversation` feeds each agent the full conversation history as its State, then the agent responds via its normal Φ-scored run loop - Key innovation: conversation turns ARE Φ-scored task executions. The agent learns what good conversation contributions look like across runs. **Plug-and-Play Factory (OpenAI Agents SDK pattern)** - OpenAI's `Agent(name, instructions, tools)` → `Runner.run(agent, task)` is the gold standard for simplicity - Our `Agent` class auto-resolves model strings: `"qwen3:1.7b"` → OllamaBackend, `"gpt-4o"` → OpenAICompatibleBackend, `"Qwen/Qwen3-32B"` → HFInferenceBackend - `handoff_from=other_agent` transfers experience replay — the OpenAI SDK handoff pattern, but with learning transfer **Knowledge-Aware Agents (LlamaIndex QueryEngineTool pattern)** - LlamaIndex's key insight: RAG works better as a TOOL the agent chooses to use (agentic RAG) than as a fixed pipeline (traditional RAG) - Ref: HyDE (arxiv:2212.10496) — agent formulates retrieval-optimized queries instead of using user query directly - Our `KnowledgeStore.as_tool()` converts any document collection into a Tool — the agent decides WHEN to retrieve - Uses the same trigram embedding as ExperienceReplay (swappable via EmbeddingBackend for production sentence-transformers) ### Architecture Decision: Why One File All 5 capabilities live in `unified.py` (~30KB) because: 1. **Zero coupling to core**: None of these modify Orchestrator, Actor, PurposeFunction, or ExperienceReplay 2. **Composable**: You can use Graph + KnowledgeStore + Conversation together — they're independent layers 3. **The Φ loop runs everywhere**: Agent.run() is the primitive. Graph nodes call it. Parallel tasks call it. Conversation turns call it. Every execution feeds the self-improvement loop. 4. **Removable**: Delete `unified.py` and everything else still works. It's a pure extension layer. --- ## Future Research Directions ### Papers to Implement Next | Paper | ArXiv | What It Would Add | |-------|-------|------------------| | Meta-Rewarding | [2407.19594](https://arxiv.org/abs/2407.19594) | Self-improving critic via meta-judge loop (DPO on judge preference pairs) | | Self-Taught Evaluators | [2408.02666](https://arxiv.org/abs/2408.02666) | Synthetic training data for the Purpose Function to improve without human labels | | DSPy | [2310.03714](https://arxiv.org/abs/2310.03714) | Automatic prompt optimization for system prompts (Actor, Purpose Function) | | LLMCompiler | [2312.04511](https://arxiv.org/abs/2312.04511) | Parallel function calling plan → faster multi-tool execution | | Retroformer | [2308.02151](https://arxiv.org/abs/2308.02151) | Policy gradient for retrospective model → trainable reflection |