purpose-agent / COMPILED_RESEARCH.md
Rohan03's picture
docs: add unified capabilities research to COMPILED_RESEARCH.md
59589f0 verified
|
raw
history blame
21.2 kB

COMPILED RESEARCH β€” Purpose Agent

Living document. Every implementation decision traces back to a paper, benchmark, or empirical finding listed here. Updated with each feature addition.


feat: Core Architecture β€” Self-Improving Agent Loop via Ξ¦(s) State-Value Evaluation

Date: 2025-04-28 | Modules: types.py, actor.py, purpose_function.py, experience_replay.py, optimizer.py, orchestrator.py

Papers Implemented

Paper ArXiv Key Contribution Where Used
MUSE 2510.08002 3-tier memory (strategic/procedural/tool), Plan-Execute-Reflect-Memorize loop, independent Reflect Agent actor.py (memory tiers), optimizer.py (post-task distillation), orchestrator.py (reflect cycle)
LATS 2310.04406 LLM-as-value-function V(s) = λ·LM_score + (1-λ)·SC_score, score AFTER env feedback purpose_function.py (Φ scoring, anti-inflation normalization)
REMEMBERER 2306.07929 Q-value experience replay with tabular Q-Learning updates: Q(g,o,a) ← (1-Ξ±)Q + Ξ±[r + Ξ³Β·max Q] experience_replay.py (Q-value storage + MC update), types.py (Heuristic.update_q_value)
Reflexion 2303.11366 Verbal reinforcement via episodic memory, Actor/Evaluator/Self-Reflection triad orchestrator.py (actor-critic separation), actor.py (ReAct format)
SPC 2504.19162 Adversarial self-play critic: Sneaky Generator vs Step Critic purpose_function.py (7 anti-reward-hacking rules, evidence requirement)
CER 2506.06698 Contextual experience distillation: Dynamics (url→summary) + Skills (abstract SOPs with {variables}) optimizer.py (DISTILL_TRAJECTORY_PROMPT pattern, {variable} placeholders)
MemRL 2601.03192 Memory-Augmented MDP: decouple "which memory to retrieve" (learned Q) from "how to act given memory" (LLM) experience_replay.py (two-phase retrieval: semantic recall β†’ Q-value re-rank)
Voyager 2305.16291 Skill library as long-term memory, self-verification critic prompt optimizer.py (heuristic library concept), experience_replay.py (persistent skill storage)

Key Design Decisions

Why Ξ¦(s) potential-based shaping instead of binary reward:

  • LATS showed V(s) with LLM scoring outperforms binary success/fail on HotPotQA, WebShop, HumanEval
  • Potential-based shaping (Ξ¦(s_new) - Ξ¦(s_current)) satisfies the necessary and sufficient condition for policy invariance under reward shaping (Ng et al., 1999)
  • Enables learning from partial successes β€” binary reward discards all information from failed tasks

Why 3-tier memory instead of flat:

  • MUSE achieved SOTA 51.78% on TheAgentCompany with 3-tier; flat memory baseline was 23.65%
  • Strategic tier prevents context bloat (loaded once at task start, not per-step)
  • Procedural tier uses lazy loading (only index in prompt, full SOP on demand) β€” critical for SLM context limits

Why separate critic LLM from actor:

  • MUSE's independent Reflect Agent removed self-confirmation bias
  • SPC's adversarial approach showed LLMs are sycophantic self-evaluators β€” separate prompts are essential

Why 7 anti-reward-hacking rules:

  • JSONSchemaBench (arxiv:2501.10868) showed SLMs produce invalid outputs 35-87% of the time without constraints
  • SPC showed adversarial critics detect ~2x more reasoning errors than self-evaluation
  • Evidence requirement, cache consistency, anomaly detection, and confidence thresholds are novel programmatic safeguards not found in any paper β€” they close the gap between theoretical SPC and practical deployment

feat: SLM-Native Backends β€” Ollama, llama-cpp, Prompt Compression

Date: 2025-04-28 | Modules: slm_backends.py, registry.py

Papers & Benchmarks

Paper ArXiv Key Finding Where Used
TinyAgent 2409.00608 1.1B model matches GPT-4-Turbo on 16-function Mac agent task via: synthetic SFT + Tool RAG (DeBERTa classifier, 34% prompt reduction) + INT4 quantization slm_backends.py (prompt compression), tools.py (ToolRegistry.get_relevant_tools = Tool RAG)
JSONSchemaBench 2501.10868 Guidance: 96% compliance on simple schemas; Outlines: severe timeouts on complex; XGrammar: fastest (100x) but lower coverage; llama.cpp/Ollama: 74-97% slm_backends.py (OllamaBackend uses grammar-constrained output via format= parameter)
XGrammar 2411.15100 Grammar-constrained decoding engine, up to 100x speedup vs naΓ―ve CFG, default in vLLM v0.6+ Referenced for vLLM production deployment
LLMLingua-2 2403.12968 Token classification (keep/drop) trained via GPT-4 distillation, 10x compression with minimal quality loss slm_backends.py (SLMPromptCompressor design, extensibility note for llmlingua integration)
SLM Agent Survey 2510.03847 Guided decoding + strict JSON Schema + validator-first tool execution closes most SLM-vs-LLM capability gap at 10-100x lower cost Architecture validation β€” grammar-constrained output is the correct default for SLMs

SLM Model Selection Rationale

Model Params Context Why Included
Phi-4-mini 3.8B 16K Top schema compliance on BFCL v3/v4 (Microsoft benchmark)
Qwen3-1.7B 1.7B 32K Best balance: strong function calling, large context for agent traces
Qwen3-0.6B 0.6B 32K Ultra-light proof point: can an agent work at 600M params?
Llama-3.2-3B 3B 128K Largest context in class, Meta's open weights
Llama-3.2-1B 1B 128K Smallest Llama, 128K context enables long agent traces
SmolLM2-1.7B 1.7B 8K HF native, tests tight context constraint
Gemma-3-1B 1B 32K Google's multimodal-capable SLM

Key Design Decisions

Why grammar-constrained output is mandatory for SLMs:

  • JSONSchemaBench showed prompt-only JSON generation fails 35-87% on even medium schemas for SLMs
  • Ollama's grammar engine (via llama.cpp) forces valid output from ANY model regardless of training
  • This is the fundamental enabler for SLM-native agents

Why prompt compression matters:

  • SmolLM2 has 8K context; agent system prompt + tool descriptions + history can exceed 4K tokens easily
  • TinyAgent showed 34% prompt reduction via Tool RAG alone
  • Our 3-stage compressor (whitespace β†’ verbose phrases β†’ middle truncation) is a no-dependency fallback; LLMLingua-2 is the production upgrade path

feat: Streaming & Async Engine

Date: 2025-04-28 | Module: streaming.py

Patterns from Framework Analysis

  • smolagents: Agents are synchronous internally; anyio.to_thread.run_sync for async contexts (official pattern from HF docs)
  • LangGraph: graph.astream_events(input, version="v2") is genuinely async β€” gold standard for streaming
  • CrewAI: kickoff_async() is NOT truly async β€” it's loop.run_in_executor() wrapper (documented caveat)

Design Decision

Adopted smolagents pattern: sync core + asyncio.to_thread wrappers. Rationale:

  1. Most LLM backends (Ollama, llama-cpp) are synchronous
  2. Thread-based async avoids the complexity of native async for I/O-bound LLM calls
  3. AsyncOrchestrator.run_task_stream() yields StreamEvent objects β€” matches LangGraph's event streaming UX

feat: Tool Framework with Tool RAG

Date: 2025-04-28 | Module: tools.py

Research Applied

  • TinyAgent (arxiv:2409.00608): Tool RAG via DeBERTa-v3-small multi-label classifier selects relevant tools (avg 3.97 vs 6 total = 34% prompt reduction). We implement a lightweight trigram-embedding version; production path is fine-tuned classifier.
  • smolagents CodeAgent pattern: For SLMs, code-based actions (Python generation) are more reliable than JSON tool calls. Our FunctionTool.from_function() bridges both β€” tools have JSON schemas for structured-output capable models, and to_prompt(compact=True) for SLM-friendly text format.
  • OpenAI function calling schema: All tools export to_schema() in OpenAI-compatible format for backends that support native tool_calls.

feat: Observability β€” Cost Tracking & Callbacks

Date: 2025-04-28 | Module: observability.py

Competitive Analysis

Framework Observability Approach
LangChain/LangGraph LangSmith (proprietary SaaS) + OpenTelemetry export
CrewAI AgentOps integration (proprietary)
smolagents Basic step logging
Purpose Agent Pluggable callback system (no vendor lock-in) + built-in cost tracking

Design Decision

No vendor lock-in. AgentCallback protocol + CallbackManager dispatcher. Users plug in whatever they want:

  • LoggingCallback β†’ structured logs
  • JSONFileCallback β†’ JSONL event stream (ingestible by any analytics tool)
  • MetricsCollector β†’ in-memory aggregate metrics
  • Custom: implement on_event(AgentEvent) β†’ integrate with Arize, LangSmith, Weights & Biases, etc.

Cost tracking uses per-model pricing tables. Local models get electricity-cost estimates (~$0.005/1M tokens on CPU).


feat: Multi-Agent with Shared Self-Improvement

Date: 2025-04-28 | Module: multi_agent.py

Research Applied

Paper Contribution
MUSE (2510.08002) Independent Reflect Agent β†’ our critic_model is separate from agent models
AgentFly (2508.16153) Case bank with soft Q-learning for retrieval utility β†’ our shared_replay with Q-value ranking
DynaSaur (2411.01747) Dynamic action accumulation into vector-indexed library β†’ ToolRegistry with semantic retrieval

Key Innovation: Shared Experience Replay

No other multi-agent framework does this. When Agent A completes a task:

  1. Trajectory goes to shared ExperienceReplay
  2. Optimizer distills heuristics from it
  3. When Agent B starts a task, it retrieves relevant heuristics from the shared pool
  4. Agent B benefits from Agent A's experience without any retraining

This is the MemRL (2601.03192) M-MDP formulation applied to multi-agent: the retrieval policy Q(s,m) operates over a shared memory bank M.

Task Delegation

Two-phase: keyword matching (zero cost, instant) β†’ LLM routing (1 API call, accurate). Falls back gracefully: if LLM is unavailable, keyword matching still works.


feat: Human-in-the-Loop with Ξ¦ Score Overrides

Date: 2025-04-28 | Module: hitl.py

Competitive Analysis

Framework HITL Approach
LangGraph Best: Full state checkpointing, interrupt nodes, time-travel debug
CrewAI Basic approval callbacks
AutoGen Chat-based human interaction
Purpose Agent Checkpoint/resume + Ξ¦ override (unique β€” humans teach the critic)

Key Innovation: Ξ¦ Score Override β†’ Permanent Learning

When a human overrides a Ξ¦ score:

  1. The corrected score is recorded in the TrajectoryStep
  2. The trajectory (with human-corrected scores) goes into Experience Replay
  3. The Optimizer distills heuristics from it β€” now informed by human judgment
  4. Future tasks use these human-informed heuristics

This is effectively RLHF without fine-tuning β€” the human preference signal flows through the memory system instead of through gradient updates. No other framework has this.

Checkpoint Design

Serializable state snapshot (JSON) at each step. Enables:

  • Resume from any point after human review
  • Time-travel: load any checkpoint and re-run from there
  • Offline review: save checkpoints, review later, resume

feat: Evaluation Harness β€” Improvement Curve Tracking

Date: 2025-04-28 | Module: evaluation.py

Benchmarks Referenced

Benchmark Domain Used By
GAIA General assistant tasks LATS, Reflexion
AlfWorld Text-based game environments Reflexion (91% pass@1)
WebShop E-commerce navigation REMEMBERER (+4% over SOTA)
WebArena Web navigation CER (51% relative improvement)
TheAgentCompany Corporate productivity MUSE (51.78% SOTA)
SWE-bench Code generation/repair Multiple agent papers
HumanEval Code generation Reflexion (91% pass@1)

Design Decision

The improvement curve is the key differentiator chart:

Iteration    Success Rate
    1           40%      ← Cold start (no experience)
    5           70%      ← Learning from past tasks
   10           90%      ← Mature agent with full heuristic library

No other framework can produce this chart because none of them learn from experience. BenchmarkRunner.run() + BenchmarkResult.get_improvement_curve() makes this a one-liner.

compare_cold_vs_warm() is the simplest proof: run once with empty memory, run again with learned memory. The delta IS the self-improvement signal.


refactor: Plugin Registry & Modularity Fixes

Date: 2025-04-28 | Module: registry.py

Issues Fixed

  1. Duplicated embedding logic: ExperienceReplay._compute_embedding (dim=128) and ToolRegistry._embed (dim=64) were copy-pasted. Created EmbeddingBackend as shared utility in registry.
  2. Private methods used as public API: Orchestrator._post_task and _sync_memory were called by HITLOrchestrator, AsyncOrchestrator, AgentTeam. Made public: post_task(), sync_memory().
  3. Hardcoded SLM registry: SLM_REGISTRY dict was not extensible. Added model_registry.register() in plugin system.
  4. No plugin system: Adding new backends/tools/callbacks required editing __init__.py. Created PluginRegistry with backend_registry, callback_registry, model_registry β€” new components are 1 register() call.

Extension Pattern

Adding a new component to Purpose Agent:

# my_custom_backend.py
from purpose_agent import LLMBackend, backend_registry

class MyBackend(LLMBackend):
    def generate(self, messages, **kwargs):
        return "response"

backend_registry.register("my_backend", MyBackend)
# Done β€” now: backend_registry.create("my_backend")

No core files edited. No __init__.py changes. Drop the file, import it, register.


Competitive Framework Analysis

Date: 2025-04-28

Why Developers Leave LangChain (sources: Medium, LinkedIn, Reddit, Analytics India Magazine)

  1. Over-abstraction: Too many layers between user code and the LLM call. Simple tasks require understanding Chain β†’ LLMChain β†’ PromptTemplate β†’ OutputParser hierarchy.
  2. Massive dependency tree: Pulls in dozens of packages. Version conflicts common.
  3. Frequent breaking changes: API surface changed significantly between v0.1 β†’ v0.2 β†’ v0.3.
  4. Debugging opacity: Errors propagate through abstraction layers, making root cause hard to find.
  5. Performance overhead: Abstraction layers add latency to every LLM call.

Purpose Agent's Response to Each Criticism

LangChain Problem Purpose Agent Approach
Over-abstraction Flat module structure. Orchestrator β†’ Actor β†’ LLMBackend. 3 hops max.
Massive dependencies stdlib only (core). External deps are optional, per-backend.
Breaking changes Stable types.py contract. All modules exchange the same 7 types.
Debugging opacity Structured logging at every step. Observability callbacks. JSON event stream.
Performance overhead Direct LLM calls. No chain/pipeline abstraction layer.

feat: Unified Capabilities β€” 5 Framework Philosophies in One Composable Layer

Date: 2025-04-28 | Module: unified.py

The Five Competing Philosophies

Framework Philosophy Their Core Mechanic Our Implementation Zero core changes?
LangGraph "I want control" StateGraph with conditional edges, cycles, fan-out/fan-in Graph class: add_node(), add_edge(), add_conditional_edge(), cyclic execution with visit counting βœ… Calls Agent.run() at each node
CrewAI "I want speed" Process.sequential / Process.hierarchical / kickoff_for_each_async parallel() function: ThreadPoolExecutor over Agent.run() calls βœ… Wraps existing Agent
AutoGen "I want agents talking" GroupChat with speaker selection, message history Conversation class: round-robin/auto speaker order, shared message history βœ… Each turn is an Agent.run()
OpenAI Agents SDK "I want plug-and-play" Agent(name, instructions, tools) β†’ Runner.run(task) Agent factory: auto-resolves model strings, auto-creates environment, one-liner βœ… Wraps Orchestrator
LlamaIndex "I want knowledge" QueryEngineTool β€” RAG as an agent tool KnowledgeStore.as_tool() β€” chunk/embed/retrieve as a Tool βœ… Plugs into ToolRegistry

Research Behind Each

Graph Execution (LangGraph pattern)

  • LangGraph uses a StateGraph where nodes are functions that transform state, edges are routing rules
  • Conditional edges enable cycles (retry loops) and branching (if/else in workflows)
  • Our implementation: nodes are either Agent instances or Callable[[State], State] β€” when a node is an Agent, its entire Ξ¦ improvement loop runs automatically inside the graph node
  • Key difference: LangGraph graphs are static compute graphs. Ours are self-improving β€” each node execution feeds experience replay

Parallel Execution (CrewAI pattern)

  • CrewAI's kickoff_for_each_async is actually loop.run_in_executor() β€” not true async (documented caveat from CrewAI source)
  • Our parallel() uses ThreadPoolExecutor directly β€” honest concurrency, no fake async wrapper
  • All parallel tasks share the same experience replay via the Agent's Orchestrator β€” learning happens even during concurrent execution

Agent Conversation (AutoGen GroupChat pattern)

  • AutoGen's GroupChat maintains a message list, uses LLM or round-robin for speaker selection
  • Our Conversation feeds each agent the full conversation history as its State, then the agent responds via its normal Ξ¦-scored run loop
  • Key innovation: conversation turns ARE Ξ¦-scored task executions. The agent learns what good conversation contributions look like across runs.

Plug-and-Play Factory (OpenAI Agents SDK pattern)

  • OpenAI's Agent(name, instructions, tools) β†’ Runner.run(agent, task) is the gold standard for simplicity
  • Our Agent class auto-resolves model strings: "qwen3:1.7b" β†’ OllamaBackend, "gpt-4o" β†’ OpenAICompatibleBackend, "Qwen/Qwen3-32B" β†’ HFInferenceBackend
  • handoff_from=other_agent transfers experience replay β€” the OpenAI SDK handoff pattern, but with learning transfer

Knowledge-Aware Agents (LlamaIndex QueryEngineTool pattern)

  • LlamaIndex's key insight: RAG works better as a TOOL the agent chooses to use (agentic RAG) than as a fixed pipeline (traditional RAG)
  • Ref: HyDE (arxiv:2212.10496) β€” agent formulates retrieval-optimized queries instead of using user query directly
  • Our KnowledgeStore.as_tool() converts any document collection into a Tool β€” the agent decides WHEN to retrieve
  • Uses the same trigram embedding as ExperienceReplay (swappable via EmbeddingBackend for production sentence-transformers)

Architecture Decision: Why One File

All 5 capabilities live in unified.py (~30KB) because:

  1. Zero coupling to core: None of these modify Orchestrator, Actor, PurposeFunction, or ExperienceReplay
  2. Composable: You can use Graph + KnowledgeStore + Conversation together β€” they're independent layers
  3. The Ξ¦ loop runs everywhere: Agent.run() is the primitive. Graph nodes call it. Parallel tasks call it. Conversation turns call it. Every execution feeds the self-improvement loop.
  4. Removable: Delete unified.py and everything else still works. It's a pure extension layer.

Future Research Directions

Papers to Implement Next

Paper ArXiv What It Would Add
Meta-Rewarding 2407.19594 Self-improving critic via meta-judge loop (DPO on judge preference pairs)
Self-Taught Evaluators 2408.02666 Synthetic training data for the Purpose Function to improve without human labels
DSPy 2310.03714 Automatic prompt optimization for system prompts (Actor, Purpose Function)
LLMCompiler 2312.04511 Parallel function calling plan β†’ faster multi-tool execution
Retroformer 2308.02151 Policy gradient for retrospective model β†’ trainable reflection