File size: 21,207 Bytes
7958a2f 59589f0 7958a2f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 | # COMPILED RESEARCH β Purpose Agent
> Living document. Every implementation decision traces back to a paper, benchmark, or empirical finding listed here. Updated with each feature addition.
---
## feat: Core Architecture β Self-Improving Agent Loop via Ξ¦(s) State-Value Evaluation
**Date:** 2025-04-28 | **Modules:** `types.py`, `actor.py`, `purpose_function.py`, `experience_replay.py`, `optimizer.py`, `orchestrator.py`
### Papers Implemented
| Paper | ArXiv | Key Contribution | Where Used |
|-------|-------|-----------------|------------|
| MUSE | [2510.08002](https://arxiv.org/abs/2510.08002) | 3-tier memory (strategic/procedural/tool), Plan-Execute-Reflect-Memorize loop, independent Reflect Agent | `actor.py` (memory tiers), `optimizer.py` (post-task distillation), `orchestrator.py` (reflect cycle) |
| LATS | [2310.04406](https://arxiv.org/abs/2310.04406) | LLM-as-value-function V(s) = λ·LM_score + (1-λ)·SC_score, score AFTER env feedback | `purpose_function.py` (Φ scoring, anti-inflation normalization) |
| REMEMBERER | [2306.07929](https://arxiv.org/abs/2306.07929) | Q-value experience replay with tabular Q-Learning updates: Q(g,o,a) β (1-Ξ±)Q + Ξ±[r + Ξ³Β·max Q] | `experience_replay.py` (Q-value storage + MC update), `types.py` (Heuristic.update_q_value) |
| Reflexion | [2303.11366](https://arxiv.org/abs/2303.11366) | Verbal reinforcement via episodic memory, Actor/Evaluator/Self-Reflection triad | `orchestrator.py` (actor-critic separation), `actor.py` (ReAct format) |
| SPC | [2504.19162](https://arxiv.org/abs/2504.19162) | Adversarial self-play critic: Sneaky Generator vs Step Critic | `purpose_function.py` (7 anti-reward-hacking rules, evidence requirement) |
| CER | [2506.06698](https://arxiv.org/abs/2506.06698) | Contextual experience distillation: Dynamics (urlβsummary) + Skills (abstract SOPs with {variables}) | `optimizer.py` (DISTILL_TRAJECTORY_PROMPT pattern, {variable} placeholders) |
| MemRL | [2601.03192](https://arxiv.org/abs/2601.03192) | Memory-Augmented MDP: decouple "which memory to retrieve" (learned Q) from "how to act given memory" (LLM) | `experience_replay.py` (two-phase retrieval: semantic recall β Q-value re-rank) |
| Voyager | [2305.16291](https://arxiv.org/abs/2305.16291) | Skill library as long-term memory, self-verification critic prompt | `optimizer.py` (heuristic library concept), `experience_replay.py` (persistent skill storage) |
### Key Design Decisions
**Why Ξ¦(s) potential-based shaping instead of binary reward:**
- LATS showed V(s) with LLM scoring outperforms binary success/fail on HotPotQA, WebShop, HumanEval
- Potential-based shaping (Ξ¦(s_new) - Ξ¦(s_current)) satisfies the necessary and sufficient condition for policy invariance under reward shaping (Ng et al., 1999)
- Enables learning from partial successes β binary reward discards all information from failed tasks
**Why 3-tier memory instead of flat:**
- MUSE achieved SOTA 51.78% on TheAgentCompany with 3-tier; flat memory baseline was 23.65%
- Strategic tier prevents context bloat (loaded once at task start, not per-step)
- Procedural tier uses lazy loading (only index in prompt, full SOP on demand) β critical for SLM context limits
**Why separate critic LLM from actor:**
- MUSE's independent Reflect Agent removed self-confirmation bias
- SPC's adversarial approach showed LLMs are sycophantic self-evaluators β separate prompts are essential
**Why 7 anti-reward-hacking rules:**
- JSONSchemaBench (arxiv:2501.10868) showed SLMs produce invalid outputs 35-87% of the time without constraints
- SPC showed adversarial critics detect ~2x more reasoning errors than self-evaluation
- Evidence requirement, cache consistency, anomaly detection, and confidence thresholds are novel programmatic safeguards not found in any paper β they close the gap between theoretical SPC and practical deployment
---
## feat: SLM-Native Backends β Ollama, llama-cpp, Prompt Compression
**Date:** 2025-04-28 | **Modules:** `slm_backends.py`, `registry.py`
### Papers & Benchmarks
| Paper | ArXiv | Key Finding | Where Used |
|-------|-------|-------------|------------|
| TinyAgent | [2409.00608](https://arxiv.org/abs/2409.00608) | 1.1B model matches GPT-4-Turbo on 16-function Mac agent task via: synthetic SFT + Tool RAG (DeBERTa classifier, 34% prompt reduction) + INT4 quantization | `slm_backends.py` (prompt compression), `tools.py` (ToolRegistry.get_relevant_tools = Tool RAG) |
| JSONSchemaBench | [2501.10868](https://arxiv.org/abs/2501.10868) | Guidance: 96% compliance on simple schemas; Outlines: severe timeouts on complex; XGrammar: fastest (100x) but lower coverage; llama.cpp/Ollama: 74-97% | `slm_backends.py` (OllamaBackend uses grammar-constrained output via format= parameter) |
| XGrammar | [2411.15100](https://arxiv.org/abs/2411.15100) | Grammar-constrained decoding engine, up to 100x speedup vs naΓ―ve CFG, default in vLLM v0.6+ | Referenced for vLLM production deployment |
| LLMLingua-2 | [2403.12968](https://arxiv.org/abs/2403.12968) | Token classification (keep/drop) trained via GPT-4 distillation, 10x compression with minimal quality loss | `slm_backends.py` (SLMPromptCompressor design, extensibility note for llmlingua integration) |
| SLM Agent Survey | [2510.03847](https://arxiv.org/abs/2510.03847) | Guided decoding + strict JSON Schema + validator-first tool execution closes most SLM-vs-LLM capability gap at 10-100x lower cost | Architecture validation β grammar-constrained output is the correct default for SLMs |
### SLM Model Selection Rationale
| Model | Params | Context | Why Included |
|-------|--------|---------|-------------|
| Phi-4-mini | 3.8B | 16K | Top schema compliance on BFCL v3/v4 (Microsoft benchmark) |
| Qwen3-1.7B | 1.7B | 32K | Best balance: strong function calling, large context for agent traces |
| Qwen3-0.6B | 0.6B | 32K | Ultra-light proof point: can an agent work at 600M params? |
| Llama-3.2-3B | 3B | 128K | Largest context in class, Meta's open weights |
| Llama-3.2-1B | 1B | 128K | Smallest Llama, 128K context enables long agent traces |
| SmolLM2-1.7B | 1.7B | 8K | HF native, tests tight context constraint |
| Gemma-3-1B | 1B | 32K | Google's multimodal-capable SLM |
### Key Design Decisions
**Why grammar-constrained output is mandatory for SLMs:**
- JSONSchemaBench showed prompt-only JSON generation fails 35-87% on even medium schemas for SLMs
- Ollama's grammar engine (via llama.cpp) forces valid output from ANY model regardless of training
- This is the fundamental enabler for SLM-native agents
**Why prompt compression matters:**
- SmolLM2 has 8K context; agent system prompt + tool descriptions + history can exceed 4K tokens easily
- TinyAgent showed 34% prompt reduction via Tool RAG alone
- Our 3-stage compressor (whitespace β verbose phrases β middle truncation) is a no-dependency fallback; LLMLingua-2 is the production upgrade path
---
## feat: Streaming & Async Engine
**Date:** 2025-04-28 | **Module:** `streaming.py`
### Patterns from Framework Analysis
- **smolagents**: Agents are synchronous internally; `anyio.to_thread.run_sync` for async contexts (official pattern from HF docs)
- **LangGraph**: `graph.astream_events(input, version="v2")` is genuinely async β gold standard for streaming
- **CrewAI**: `kickoff_async()` is NOT truly async β it's `loop.run_in_executor()` wrapper (documented caveat)
### Design Decision
Adopted smolagents pattern: sync core + `asyncio.to_thread` wrappers. Rationale:
1. Most LLM backends (Ollama, llama-cpp) are synchronous
2. Thread-based async avoids the complexity of native async for I/O-bound LLM calls
3. `AsyncOrchestrator.run_task_stream()` yields `StreamEvent` objects β matches LangGraph's event streaming UX
---
## feat: Tool Framework with Tool RAG
**Date:** 2025-04-28 | **Module:** `tools.py`
### Research Applied
- **TinyAgent (arxiv:2409.00608)**: Tool RAG via DeBERTa-v3-small multi-label classifier selects relevant tools (avg 3.97 vs 6 total = 34% prompt reduction). We implement a lightweight trigram-embedding version; production path is fine-tuned classifier.
- **smolagents CodeAgent pattern**: For SLMs, code-based actions (Python generation) are more reliable than JSON tool calls. Our `FunctionTool.from_function()` bridges both β tools have JSON schemas for structured-output capable models, and `to_prompt(compact=True)` for SLM-friendly text format.
- **OpenAI function calling schema**: All tools export `to_schema()` in OpenAI-compatible format for backends that support native tool_calls.
---
## feat: Observability β Cost Tracking & Callbacks
**Date:** 2025-04-28 | **Module:** `observability.py`
### Competitive Analysis
| Framework | Observability Approach |
|-----------|----------------------|
| LangChain/LangGraph | LangSmith (proprietary SaaS) + OpenTelemetry export |
| CrewAI | AgentOps integration (proprietary) |
| smolagents | Basic step logging |
| **Purpose Agent** | Pluggable callback system (no vendor lock-in) + built-in cost tracking |
### Design Decision
No vendor lock-in. `AgentCallback` protocol + `CallbackManager` dispatcher. Users plug in whatever they want:
- `LoggingCallback` β structured logs
- `JSONFileCallback` β JSONL event stream (ingestible by any analytics tool)
- `MetricsCollector` β in-memory aggregate metrics
- Custom: implement `on_event(AgentEvent)` β integrate with Arize, LangSmith, Weights & Biases, etc.
Cost tracking uses per-model pricing tables. Local models get electricity-cost estimates (~$0.005/1M tokens on CPU).
---
## feat: Multi-Agent with Shared Self-Improvement
**Date:** 2025-04-28 | **Module:** `multi_agent.py`
### Research Applied
| Paper | Contribution |
|-------|-------------|
| MUSE (2510.08002) | Independent Reflect Agent β our critic_model is separate from agent models |
| AgentFly (2508.16153) | Case bank with soft Q-learning for retrieval utility β our shared_replay with Q-value ranking |
| DynaSaur (2411.01747) | Dynamic action accumulation into vector-indexed library β ToolRegistry with semantic retrieval |
### Key Innovation: Shared Experience Replay
No other multi-agent framework does this. When Agent A completes a task:
1. Trajectory goes to shared ExperienceReplay
2. Optimizer distills heuristics from it
3. When Agent B starts a task, it retrieves relevant heuristics from the shared pool
4. Agent B benefits from Agent A's experience without any retraining
This is the MemRL (2601.03192) M-MDP formulation applied to multi-agent: the retrieval policy Q(s,m) operates over a shared memory bank M.
### Task Delegation
Two-phase: keyword matching (zero cost, instant) β LLM routing (1 API call, accurate). Falls back gracefully: if LLM is unavailable, keyword matching still works.
---
## feat: Human-in-the-Loop with Ξ¦ Score Overrides
**Date:** 2025-04-28 | **Module:** `hitl.py`
### Competitive Analysis
| Framework | HITL Approach |
|-----------|--------------|
| LangGraph | **Best**: Full state checkpointing, interrupt nodes, time-travel debug |
| CrewAI | Basic approval callbacks |
| AutoGen | Chat-based human interaction |
| **Purpose Agent** | Checkpoint/resume + **Ξ¦ override** (unique β humans teach the critic) |
### Key Innovation: Ξ¦ Score Override β Permanent Learning
When a human overrides a Ξ¦ score:
1. The corrected score is recorded in the TrajectoryStep
2. The trajectory (with human-corrected scores) goes into Experience Replay
3. The Optimizer distills heuristics from it β now informed by human judgment
4. Future tasks use these human-informed heuristics
This is effectively RLHF without fine-tuning β the human preference signal flows through the memory system instead of through gradient updates. No other framework has this.
### Checkpoint Design
Serializable state snapshot (JSON) at each step. Enables:
- Resume from any point after human review
- Time-travel: load any checkpoint and re-run from there
- Offline review: save checkpoints, review later, resume
---
## feat: Evaluation Harness β Improvement Curve Tracking
**Date:** 2025-04-28 | **Module:** `evaluation.py`
### Benchmarks Referenced
| Benchmark | Domain | Used By |
|-----------|--------|---------|
| GAIA | General assistant tasks | LATS, Reflexion |
| AlfWorld | Text-based game environments | Reflexion (91% pass@1) |
| WebShop | E-commerce navigation | REMEMBERER (+4% over SOTA) |
| WebArena | Web navigation | CER (51% relative improvement) |
| TheAgentCompany | Corporate productivity | MUSE (51.78% SOTA) |
| SWE-bench | Code generation/repair | Multiple agent papers |
| HumanEval | Code generation | Reflexion (91% pass@1) |
### Design Decision
The improvement curve is the key differentiator chart:
```
Iteration Success Rate
1 40% β Cold start (no experience)
5 70% β Learning from past tasks
10 90% β Mature agent with full heuristic library
```
No other framework can produce this chart because none of them learn from experience. BenchmarkRunner.run() + BenchmarkResult.get_improvement_curve() makes this a one-liner.
`compare_cold_vs_warm()` is the simplest proof: run once with empty memory, run again with learned memory. The delta IS the self-improvement signal.
---
## refactor: Plugin Registry & Modularity Fixes
**Date:** 2025-04-28 | **Module:** `registry.py`
### Issues Fixed
1. **Duplicated embedding logic**: `ExperienceReplay._compute_embedding` (dim=128) and `ToolRegistry._embed` (dim=64) were copy-pasted. Created `EmbeddingBackend` as shared utility in registry.
2. **Private methods used as public API**: `Orchestrator._post_task` and `_sync_memory` were called by `HITLOrchestrator`, `AsyncOrchestrator`, `AgentTeam`. Made public: `post_task()`, `sync_memory()`.
3. **Hardcoded SLM registry**: `SLM_REGISTRY` dict was not extensible. Added `model_registry.register()` in plugin system.
4. **No plugin system**: Adding new backends/tools/callbacks required editing `__init__.py`. Created `PluginRegistry` with `backend_registry`, `callback_registry`, `model_registry` β new components are 1 register() call.
### Extension Pattern
Adding a new component to Purpose Agent:
```python
# my_custom_backend.py
from purpose_agent import LLMBackend, backend_registry
class MyBackend(LLMBackend):
def generate(self, messages, **kwargs):
return "response"
backend_registry.register("my_backend", MyBackend)
# Done β now: backend_registry.create("my_backend")
```
No core files edited. No `__init__.py` changes. Drop the file, import it, register.
---
## Competitive Framework Analysis
**Date:** 2025-04-28
### Why Developers Leave LangChain (sources: Medium, LinkedIn, Reddit, Analytics India Magazine)
1. **Over-abstraction**: Too many layers between user code and the LLM call. Simple tasks require understanding Chain β LLMChain β PromptTemplate β OutputParser hierarchy.
2. **Massive dependency tree**: Pulls in dozens of packages. Version conflicts common.
3. **Frequent breaking changes**: API surface changed significantly between v0.1 β v0.2 β v0.3.
4. **Debugging opacity**: Errors propagate through abstraction layers, making root cause hard to find.
5. **Performance overhead**: Abstraction layers add latency to every LLM call.
### Purpose Agent's Response to Each Criticism
| LangChain Problem | Purpose Agent Approach |
|-------------------|----------------------|
| Over-abstraction | Flat module structure. Orchestrator β Actor β LLMBackend. 3 hops max. |
| Massive dependencies | stdlib only (core). External deps are optional, per-backend. |
| Breaking changes | Stable `types.py` contract. All modules exchange the same 7 types. |
| Debugging opacity | Structured logging at every step. Observability callbacks. JSON event stream. |
| Performance overhead | Direct LLM calls. No chain/pipeline abstraction layer. |
---
## feat: Unified Capabilities β 5 Framework Philosophies in One Composable Layer
**Date:** 2025-04-28 | **Module:** `unified.py`
### The Five Competing Philosophies
| Framework | Philosophy | Their Core Mechanic | Our Implementation | Zero core changes? |
|-----------|-----------|--------------------|--------------------|-------------------|
| **LangGraph** | "I want control" | StateGraph with conditional edges, cycles, fan-out/fan-in | `Graph` class: `add_node()`, `add_edge()`, `add_conditional_edge()`, cyclic execution with visit counting | β
Calls `Agent.run()` at each node |
| **CrewAI** | "I want speed" | `Process.sequential` / `Process.hierarchical` / `kickoff_for_each_async` | `parallel()` function: `ThreadPoolExecutor` over `Agent.run()` calls | β
Wraps existing Agent |
| **AutoGen** | "I want agents talking" | `GroupChat` with speaker selection, message history | `Conversation` class: round-robin/auto speaker order, shared message history | β
Each turn is an `Agent.run()` |
| **OpenAI Agents SDK** | "I want plug-and-play" | `Agent(name, instructions, tools)` β `Runner.run(task)` | `Agent` factory: auto-resolves model strings, auto-creates environment, one-liner | β
Wraps Orchestrator |
| **LlamaIndex** | "I want knowledge" | `QueryEngineTool` β RAG as an agent tool | `KnowledgeStore.as_tool()` β chunk/embed/retrieve as a Tool | β
Plugs into ToolRegistry |
### Research Behind Each
**Graph Execution (LangGraph pattern)**
- LangGraph uses a `StateGraph` where nodes are functions that transform state, edges are routing rules
- Conditional edges enable cycles (retry loops) and branching (if/else in workflows)
- Our implementation: nodes are either `Agent` instances or `Callable[[State], State]` β when a node is an Agent, its entire Ξ¦ improvement loop runs automatically inside the graph node
- Key difference: LangGraph graphs are static compute graphs. Ours are self-improving β each node execution feeds experience replay
**Parallel Execution (CrewAI pattern)**
- CrewAI's `kickoff_for_each_async` is actually `loop.run_in_executor()` β not true async (documented caveat from CrewAI source)
- Our `parallel()` uses `ThreadPoolExecutor` directly β honest concurrency, no fake async wrapper
- All parallel tasks share the same experience replay via the Agent's Orchestrator β learning happens even during concurrent execution
**Agent Conversation (AutoGen GroupChat pattern)**
- AutoGen's `GroupChat` maintains a message list, uses LLM or round-robin for speaker selection
- Our `Conversation` feeds each agent the full conversation history as its State, then the agent responds via its normal Ξ¦-scored run loop
- Key innovation: conversation turns ARE Ξ¦-scored task executions. The agent learns what good conversation contributions look like across runs.
**Plug-and-Play Factory (OpenAI Agents SDK pattern)**
- OpenAI's `Agent(name, instructions, tools)` β `Runner.run(agent, task)` is the gold standard for simplicity
- Our `Agent` class auto-resolves model strings: `"qwen3:1.7b"` β OllamaBackend, `"gpt-4o"` β OpenAICompatibleBackend, `"Qwen/Qwen3-32B"` β HFInferenceBackend
- `handoff_from=other_agent` transfers experience replay β the OpenAI SDK handoff pattern, but with learning transfer
**Knowledge-Aware Agents (LlamaIndex QueryEngineTool pattern)**
- LlamaIndex's key insight: RAG works better as a TOOL the agent chooses to use (agentic RAG) than as a fixed pipeline (traditional RAG)
- Ref: HyDE (arxiv:2212.10496) β agent formulates retrieval-optimized queries instead of using user query directly
- Our `KnowledgeStore.as_tool()` converts any document collection into a Tool β the agent decides WHEN to retrieve
- Uses the same trigram embedding as ExperienceReplay (swappable via EmbeddingBackend for production sentence-transformers)
### Architecture Decision: Why One File
All 5 capabilities live in `unified.py` (~30KB) because:
1. **Zero coupling to core**: None of these modify Orchestrator, Actor, PurposeFunction, or ExperienceReplay
2. **Composable**: You can use Graph + KnowledgeStore + Conversation together β they're independent layers
3. **The Ξ¦ loop runs everywhere**: Agent.run() is the primitive. Graph nodes call it. Parallel tasks call it. Conversation turns call it. Every execution feeds the self-improvement loop.
4. **Removable**: Delete `unified.py` and everything else still works. It's a pure extension layer.
---
## Future Research Directions
### Papers to Implement Next
| Paper | ArXiv | What It Would Add |
|-------|-------|------------------|
| Meta-Rewarding | [2407.19594](https://arxiv.org/abs/2407.19594) | Self-improving critic via meta-judge loop (DPO on judge preference pairs) |
| Self-Taught Evaluators | [2408.02666](https://arxiv.org/abs/2408.02666) | Synthetic training data for the Purpose Function to improve without human labels |
| DSPy | [2310.03714](https://arxiv.org/abs/2310.03714) | Automatic prompt optimization for system prompts (Actor, Purpose Function) |
| LLMCompiler | [2312.04511](https://arxiv.org/abs/2312.04511) | Parallel function calling plan β faster multi-tool execution |
| Retroformer | [2308.02151](https://arxiv.org/abs/2308.02151) | Policy gradient for retrospective model β trainable reflection |
|