refactor: modularity fixes + plugin registry + compiled research
Browse files- COMPILED_RESEARCH.md +299 -0
COMPILED_RESEARCH.md
ADDED
|
@@ -0,0 +1,299 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# COMPILED RESEARCH β Purpose Agent
|
| 2 |
+
|
| 3 |
+
> Living document. Every implementation decision traces back to a paper, benchmark, or empirical finding listed here. Updated with each feature addition.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## feat: Core Architecture β Self-Improving Agent Loop via Ξ¦(s) State-Value Evaluation
|
| 8 |
+
|
| 9 |
+
**Date:** 2025-04-28 | **Modules:** `types.py`, `actor.py`, `purpose_function.py`, `experience_replay.py`, `optimizer.py`, `orchestrator.py`
|
| 10 |
+
|
| 11 |
+
### Papers Implemented
|
| 12 |
+
|
| 13 |
+
| Paper | ArXiv | Key Contribution | Where Used |
|
| 14 |
+
|-------|-------|-----------------|------------|
|
| 15 |
+
| MUSE | [2510.08002](https://arxiv.org/abs/2510.08002) | 3-tier memory (strategic/procedural/tool), Plan-Execute-Reflect-Memorize loop, independent Reflect Agent | `actor.py` (memory tiers), `optimizer.py` (post-task distillation), `orchestrator.py` (reflect cycle) |
|
| 16 |
+
| LATS | [2310.04406](https://arxiv.org/abs/2310.04406) | LLM-as-value-function V(s) = λ·LM_score + (1-λ)·SC_score, score AFTER env feedback | `purpose_function.py` (Φ scoring, anti-inflation normalization) |
|
| 17 |
+
| REMEMBERER | [2306.07929](https://arxiv.org/abs/2306.07929) | Q-value experience replay with tabular Q-Learning updates: Q(g,o,a) β (1-Ξ±)Q + Ξ±[r + Ξ³Β·max Q] | `experience_replay.py` (Q-value storage + MC update), `types.py` (Heuristic.update_q_value) |
|
| 18 |
+
| Reflexion | [2303.11366](https://arxiv.org/abs/2303.11366) | Verbal reinforcement via episodic memory, Actor/Evaluator/Self-Reflection triad | `orchestrator.py` (actor-critic separation), `actor.py` (ReAct format) |
|
| 19 |
+
| SPC | [2504.19162](https://arxiv.org/abs/2504.19162) | Adversarial self-play critic: Sneaky Generator vs Step Critic | `purpose_function.py` (7 anti-reward-hacking rules, evidence requirement) |
|
| 20 |
+
| CER | [2506.06698](https://arxiv.org/abs/2506.06698) | Contextual experience distillation: Dynamics (urlβsummary) + Skills (abstract SOPs with {variables}) | `optimizer.py` (DISTILL_TRAJECTORY_PROMPT pattern, {variable} placeholders) |
|
| 21 |
+
| MemRL | [2601.03192](https://arxiv.org/abs/2601.03192) | Memory-Augmented MDP: decouple "which memory to retrieve" (learned Q) from "how to act given memory" (LLM) | `experience_replay.py` (two-phase retrieval: semantic recall β Q-value re-rank) |
|
| 22 |
+
| Voyager | [2305.16291](https://arxiv.org/abs/2305.16291) | Skill library as long-term memory, self-verification critic prompt | `optimizer.py` (heuristic library concept), `experience_replay.py` (persistent skill storage) |
|
| 23 |
+
|
| 24 |
+
### Key Design Decisions
|
| 25 |
+
|
| 26 |
+
**Why Ξ¦(s) potential-based shaping instead of binary reward:**
|
| 27 |
+
- LATS showed V(s) with LLM scoring outperforms binary success/fail on HotPotQA, WebShop, HumanEval
|
| 28 |
+
- Potential-based shaping (Ξ¦(s_new) - Ξ¦(s_current)) satisfies the necessary and sufficient condition for policy invariance under reward shaping (Ng et al., 1999)
|
| 29 |
+
- Enables learning from partial successes β binary reward discards all information from failed tasks
|
| 30 |
+
|
| 31 |
+
**Why 3-tier memory instead of flat:**
|
| 32 |
+
- MUSE achieved SOTA 51.78% on TheAgentCompany with 3-tier; flat memory baseline was 23.65%
|
| 33 |
+
- Strategic tier prevents context bloat (loaded once at task start, not per-step)
|
| 34 |
+
- Procedural tier uses lazy loading (only index in prompt, full SOP on demand) β critical for SLM context limits
|
| 35 |
+
|
| 36 |
+
**Why separate critic LLM from actor:**
|
| 37 |
+
- MUSE's independent Reflect Agent removed self-confirmation bias
|
| 38 |
+
- SPC's adversarial approach showed LLMs are sycophantic self-evaluators β separate prompts are essential
|
| 39 |
+
|
| 40 |
+
**Why 7 anti-reward-hacking rules:**
|
| 41 |
+
- JSONSchemaBench (arxiv:2501.10868) showed SLMs produce invalid outputs 35-87% of the time without constraints
|
| 42 |
+
- SPC showed adversarial critics detect ~2x more reasoning errors than self-evaluation
|
| 43 |
+
- Evidence requirement, cache consistency, anomaly detection, and confidence thresholds are novel programmatic safeguards not found in any paper β they close the gap between theoretical SPC and practical deployment
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## feat: SLM-Native Backends β Ollama, llama-cpp, Prompt Compression
|
| 48 |
+
|
| 49 |
+
**Date:** 2025-04-28 | **Modules:** `slm_backends.py`, `registry.py`
|
| 50 |
+
|
| 51 |
+
### Papers & Benchmarks
|
| 52 |
+
|
| 53 |
+
| Paper | ArXiv | Key Finding | Where Used |
|
| 54 |
+
|-------|-------|-------------|------------|
|
| 55 |
+
| TinyAgent | [2409.00608](https://arxiv.org/abs/2409.00608) | 1.1B model matches GPT-4-Turbo on 16-function Mac agent task via: synthetic SFT + Tool RAG (DeBERTa classifier, 34% prompt reduction) + INT4 quantization | `slm_backends.py` (prompt compression), `tools.py` (ToolRegistry.get_relevant_tools = Tool RAG) |
|
| 56 |
+
| JSONSchemaBench | [2501.10868](https://arxiv.org/abs/2501.10868) | Guidance: 96% compliance on simple schemas; Outlines: severe timeouts on complex; XGrammar: fastest (100x) but lower coverage; llama.cpp/Ollama: 74-97% | `slm_backends.py` (OllamaBackend uses grammar-constrained output via format= parameter) |
|
| 57 |
+
| XGrammar | [2411.15100](https://arxiv.org/abs/2411.15100) | Grammar-constrained decoding engine, up to 100x speedup vs naΓ―ve CFG, default in vLLM v0.6+ | Referenced for vLLM production deployment |
|
| 58 |
+
| LLMLingua-2 | [2403.12968](https://arxiv.org/abs/2403.12968) | Token classification (keep/drop) trained via GPT-4 distillation, 10x compression with minimal quality loss | `slm_backends.py` (SLMPromptCompressor design, extensibility note for llmlingua integration) |
|
| 59 |
+
| SLM Agent Survey | [2510.03847](https://arxiv.org/abs/2510.03847) | Guided decoding + strict JSON Schema + validator-first tool execution closes most SLM-vs-LLM capability gap at 10-100x lower cost | Architecture validation β grammar-constrained output is the correct default for SLMs |
|
| 60 |
+
|
| 61 |
+
### SLM Model Selection Rationale
|
| 62 |
+
|
| 63 |
+
| Model | Params | Context | Why Included |
|
| 64 |
+
|-------|--------|---------|-------------|
|
| 65 |
+
| Phi-4-mini | 3.8B | 16K | Top schema compliance on BFCL v3/v4 (Microsoft benchmark) |
|
| 66 |
+
| Qwen3-1.7B | 1.7B | 32K | Best balance: strong function calling, large context for agent traces |
|
| 67 |
+
| Qwen3-0.6B | 0.6B | 32K | Ultra-light proof point: can an agent work at 600M params? |
|
| 68 |
+
| Llama-3.2-3B | 3B | 128K | Largest context in class, Meta's open weights |
|
| 69 |
+
| Llama-3.2-1B | 1B | 128K | Smallest Llama, 128K context enables long agent traces |
|
| 70 |
+
| SmolLM2-1.7B | 1.7B | 8K | HF native, tests tight context constraint |
|
| 71 |
+
| Gemma-3-1B | 1B | 32K | Google's multimodal-capable SLM |
|
| 72 |
+
|
| 73 |
+
### Key Design Decisions
|
| 74 |
+
|
| 75 |
+
**Why grammar-constrained output is mandatory for SLMs:**
|
| 76 |
+
- JSONSchemaBench showed prompt-only JSON generation fails 35-87% on even medium schemas for SLMs
|
| 77 |
+
- Ollama's grammar engine (via llama.cpp) forces valid output from ANY model regardless of training
|
| 78 |
+
- This is the fundamental enabler for SLM-native agents
|
| 79 |
+
|
| 80 |
+
**Why prompt compression matters:**
|
| 81 |
+
- SmolLM2 has 8K context; agent system prompt + tool descriptions + history can exceed 4K tokens easily
|
| 82 |
+
- TinyAgent showed 34% prompt reduction via Tool RAG alone
|
| 83 |
+
- Our 3-stage compressor (whitespace β verbose phrases β middle truncation) is a no-dependency fallback; LLMLingua-2 is the production upgrade path
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
## feat: Streaming & Async Engine
|
| 88 |
+
|
| 89 |
+
**Date:** 2025-04-28 | **Module:** `streaming.py`
|
| 90 |
+
|
| 91 |
+
### Patterns from Framework Analysis
|
| 92 |
+
|
| 93 |
+
- **smolagents**: Agents are synchronous internally; `anyio.to_thread.run_sync` for async contexts (official pattern from HF docs)
|
| 94 |
+
- **LangGraph**: `graph.astream_events(input, version="v2")` is genuinely async β gold standard for streaming
|
| 95 |
+
- **CrewAI**: `kickoff_async()` is NOT truly async β it's `loop.run_in_executor()` wrapper (documented caveat)
|
| 96 |
+
|
| 97 |
+
### Design Decision
|
| 98 |
+
|
| 99 |
+
Adopted smolagents pattern: sync core + `asyncio.to_thread` wrappers. Rationale:
|
| 100 |
+
1. Most LLM backends (Ollama, llama-cpp) are synchronous
|
| 101 |
+
2. Thread-based async avoids the complexity of native async for I/O-bound LLM calls
|
| 102 |
+
3. `AsyncOrchestrator.run_task_stream()` yields `StreamEvent` objects β matches LangGraph's event streaming UX
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## feat: Tool Framework with Tool RAG
|
| 107 |
+
|
| 108 |
+
**Date:** 2025-04-28 | **Module:** `tools.py`
|
| 109 |
+
|
| 110 |
+
### Research Applied
|
| 111 |
+
|
| 112 |
+
- **TinyAgent (arxiv:2409.00608)**: Tool RAG via DeBERTa-v3-small multi-label classifier selects relevant tools (avg 3.97 vs 6 total = 34% prompt reduction). We implement a lightweight trigram-embedding version; production path is fine-tuned classifier.
|
| 113 |
+
- **smolagents CodeAgent pattern**: For SLMs, code-based actions (Python generation) are more reliable than JSON tool calls. Our `FunctionTool.from_function()` bridges both β tools have JSON schemas for structured-output capable models, and `to_prompt(compact=True)` for SLM-friendly text format.
|
| 114 |
+
- **OpenAI function calling schema**: All tools export `to_schema()` in OpenAI-compatible format for backends that support native tool_calls.
|
| 115 |
+
|
| 116 |
+
---
|
| 117 |
+
|
| 118 |
+
## feat: Observability β Cost Tracking & Callbacks
|
| 119 |
+
|
| 120 |
+
**Date:** 2025-04-28 | **Module:** `observability.py`
|
| 121 |
+
|
| 122 |
+
### Competitive Analysis
|
| 123 |
+
|
| 124 |
+
| Framework | Observability Approach |
|
| 125 |
+
|-----------|----------------------|
|
| 126 |
+
| LangChain/LangGraph | LangSmith (proprietary SaaS) + OpenTelemetry export |
|
| 127 |
+
| CrewAI | AgentOps integration (proprietary) |
|
| 128 |
+
| smolagents | Basic step logging |
|
| 129 |
+
| **Purpose Agent** | Pluggable callback system (no vendor lock-in) + built-in cost tracking |
|
| 130 |
+
|
| 131 |
+
### Design Decision
|
| 132 |
+
|
| 133 |
+
No vendor lock-in. `AgentCallback` protocol + `CallbackManager` dispatcher. Users plug in whatever they want:
|
| 134 |
+
- `LoggingCallback` β structured logs
|
| 135 |
+
- `JSONFileCallback` β JSONL event stream (ingestible by any analytics tool)
|
| 136 |
+
- `MetricsCollector` β in-memory aggregate metrics
|
| 137 |
+
- Custom: implement `on_event(AgentEvent)` β integrate with Arize, LangSmith, Weights & Biases, etc.
|
| 138 |
+
|
| 139 |
+
Cost tracking uses per-model pricing tables. Local models get electricity-cost estimates (~$0.005/1M tokens on CPU).
|
| 140 |
+
|
| 141 |
+
---
|
| 142 |
+
|
| 143 |
+
## feat: Multi-Agent with Shared Self-Improvement
|
| 144 |
+
|
| 145 |
+
**Date:** 2025-04-28 | **Module:** `multi_agent.py`
|
| 146 |
+
|
| 147 |
+
### Research Applied
|
| 148 |
+
|
| 149 |
+
| Paper | Contribution |
|
| 150 |
+
|-------|-------------|
|
| 151 |
+
| MUSE (2510.08002) | Independent Reflect Agent β our critic_model is separate from agent models |
|
| 152 |
+
| AgentFly (2508.16153) | Case bank with soft Q-learning for retrieval utility β our shared_replay with Q-value ranking |
|
| 153 |
+
| DynaSaur (2411.01747) | Dynamic action accumulation into vector-indexed library οΏ½οΏ½ ToolRegistry with semantic retrieval |
|
| 154 |
+
|
| 155 |
+
### Key Innovation: Shared Experience Replay
|
| 156 |
+
|
| 157 |
+
No other multi-agent framework does this. When Agent A completes a task:
|
| 158 |
+
1. Trajectory goes to shared ExperienceReplay
|
| 159 |
+
2. Optimizer distills heuristics from it
|
| 160 |
+
3. When Agent B starts a task, it retrieves relevant heuristics from the shared pool
|
| 161 |
+
4. Agent B benefits from Agent A's experience without any retraining
|
| 162 |
+
|
| 163 |
+
This is the MemRL (2601.03192) M-MDP formulation applied to multi-agent: the retrieval policy Q(s,m) operates over a shared memory bank M.
|
| 164 |
+
|
| 165 |
+
### Task Delegation
|
| 166 |
+
|
| 167 |
+
Two-phase: keyword matching (zero cost, instant) β LLM routing (1 API call, accurate). Falls back gracefully: if LLM is unavailable, keyword matching still works.
|
| 168 |
+
|
| 169 |
+
---
|
| 170 |
+
|
| 171 |
+
## feat: Human-in-the-Loop with Ξ¦ Score Overrides
|
| 172 |
+
|
| 173 |
+
**Date:** 2025-04-28 | **Module:** `hitl.py`
|
| 174 |
+
|
| 175 |
+
### Competitive Analysis
|
| 176 |
+
|
| 177 |
+
| Framework | HITL Approach |
|
| 178 |
+
|-----------|--------------|
|
| 179 |
+
| LangGraph | **Best**: Full state checkpointing, interrupt nodes, time-travel debug |
|
| 180 |
+
| CrewAI | Basic approval callbacks |
|
| 181 |
+
| AutoGen | Chat-based human interaction |
|
| 182 |
+
| **Purpose Agent** | Checkpoint/resume + **Ξ¦ override** (unique β humans teach the critic) |
|
| 183 |
+
|
| 184 |
+
### Key Innovation: Ξ¦ Score Override β Permanent Learning
|
| 185 |
+
|
| 186 |
+
When a human overrides a Ξ¦ score:
|
| 187 |
+
1. The corrected score is recorded in the TrajectoryStep
|
| 188 |
+
2. The trajectory (with human-corrected scores) goes into Experience Replay
|
| 189 |
+
3. The Optimizer distills heuristics from it β now informed by human judgment
|
| 190 |
+
4. Future tasks use these human-informed heuristics
|
| 191 |
+
|
| 192 |
+
This is effectively RLHF without fine-tuning β the human preference signal flows through the memory system instead of through gradient updates. No other framework has this.
|
| 193 |
+
|
| 194 |
+
### Checkpoint Design
|
| 195 |
+
|
| 196 |
+
Serializable state snapshot (JSON) at each step. Enables:
|
| 197 |
+
- Resume from any point after human review
|
| 198 |
+
- Time-travel: load any checkpoint and re-run from there
|
| 199 |
+
- Offline review: save checkpoints, review later, resume
|
| 200 |
+
|
| 201 |
+
---
|
| 202 |
+
|
| 203 |
+
## feat: Evaluation Harness β Improvement Curve Tracking
|
| 204 |
+
|
| 205 |
+
**Date:** 2025-04-28 | **Module:** `evaluation.py`
|
| 206 |
+
|
| 207 |
+
### Benchmarks Referenced
|
| 208 |
+
|
| 209 |
+
| Benchmark | Domain | Used By |
|
| 210 |
+
|-----------|--------|---------|
|
| 211 |
+
| GAIA | General assistant tasks | LATS, Reflexion |
|
| 212 |
+
| AlfWorld | Text-based game environments | Reflexion (91% pass@1) |
|
| 213 |
+
| WebShop | E-commerce navigation | REMEMBERER (+4% over SOTA) |
|
| 214 |
+
| WebArena | Web navigation | CER (51% relative improvement) |
|
| 215 |
+
| TheAgentCompany | Corporate productivity | MUSE (51.78% SOTA) |
|
| 216 |
+
| SWE-bench | Code generation/repair | Multiple agent papers |
|
| 217 |
+
| HumanEval | Code generation | Reflexion (91% pass@1) |
|
| 218 |
+
|
| 219 |
+
### Design Decision
|
| 220 |
+
|
| 221 |
+
The improvement curve is the key differentiator chart:
|
| 222 |
+
```
|
| 223 |
+
Iteration Success Rate
|
| 224 |
+
1 40% β Cold start (no experience)
|
| 225 |
+
5 70% β Learning from past tasks
|
| 226 |
+
10 90% β Mature agent with full heuristic library
|
| 227 |
+
```
|
| 228 |
+
|
| 229 |
+
No other framework can produce this chart because none of them learn from experience. BenchmarkRunner.run() + BenchmarkResult.get_improvement_curve() makes this a one-liner.
|
| 230 |
+
|
| 231 |
+
`compare_cold_vs_warm()` is the simplest proof: run once with empty memory, run again with learned memory. The delta IS the self-improvement signal.
|
| 232 |
+
|
| 233 |
+
---
|
| 234 |
+
|
| 235 |
+
## refactor: Plugin Registry & Modularity Fixes
|
| 236 |
+
|
| 237 |
+
**Date:** 2025-04-28 | **Module:** `registry.py`
|
| 238 |
+
|
| 239 |
+
### Issues Fixed
|
| 240 |
+
|
| 241 |
+
1. **Duplicated embedding logic**: `ExperienceReplay._compute_embedding` (dim=128) and `ToolRegistry._embed` (dim=64) were copy-pasted. Created `EmbeddingBackend` as shared utility in registry.
|
| 242 |
+
2. **Private methods used as public API**: `Orchestrator._post_task` and `_sync_memory` were called by `HITLOrchestrator`, `AsyncOrchestrator`, `AgentTeam`. Made public: `post_task()`, `sync_memory()`.
|
| 243 |
+
3. **Hardcoded SLM registry**: `SLM_REGISTRY` dict was not extensible. Added `model_registry.register()` in plugin system.
|
| 244 |
+
4. **No plugin system**: Adding new backends/tools/callbacks required editing `__init__.py`. Created `PluginRegistry` with `backend_registry`, `callback_registry`, `model_registry` β new components are 1 register() call.
|
| 245 |
+
|
| 246 |
+
### Extension Pattern
|
| 247 |
+
|
| 248 |
+
Adding a new component to Purpose Agent:
|
| 249 |
+
```python
|
| 250 |
+
# my_custom_backend.py
|
| 251 |
+
from purpose_agent import LLMBackend, backend_registry
|
| 252 |
+
|
| 253 |
+
class MyBackend(LLMBackend):
|
| 254 |
+
def generate(self, messages, **kwargs):
|
| 255 |
+
return "response"
|
| 256 |
+
|
| 257 |
+
backend_registry.register("my_backend", MyBackend)
|
| 258 |
+
# Done β now: backend_registry.create("my_backend")
|
| 259 |
+
```
|
| 260 |
+
|
| 261 |
+
No core files edited. No `__init__.py` changes. Drop the file, import it, register.
|
| 262 |
+
|
| 263 |
+
---
|
| 264 |
+
|
| 265 |
+
## Competitive Framework Analysis
|
| 266 |
+
|
| 267 |
+
**Date:** 2025-04-28
|
| 268 |
+
|
| 269 |
+
### Why Developers Leave LangChain (sources: Medium, LinkedIn, Reddit, Analytics India Magazine)
|
| 270 |
+
|
| 271 |
+
1. **Over-abstraction**: Too many layers between user code and the LLM call. Simple tasks require understanding Chain β LLMChain β PromptTemplate β OutputParser hierarchy.
|
| 272 |
+
2. **Massive dependency tree**: Pulls in dozens of packages. Version conflicts common.
|
| 273 |
+
3. **Frequent breaking changes**: API surface changed significantly between v0.1 β v0.2 β v0.3.
|
| 274 |
+
4. **Debugging opacity**: Errors propagate through abstraction layers, making root cause hard to find.
|
| 275 |
+
5. **Performance overhead**: Abstraction layers add latency to every LLM call.
|
| 276 |
+
|
| 277 |
+
### Purpose Agent's Response to Each Criticism
|
| 278 |
+
|
| 279 |
+
| LangChain Problem | Purpose Agent Approach |
|
| 280 |
+
|-------------------|----------------------|
|
| 281 |
+
| Over-abstraction | Flat module structure. Orchestrator β Actor β LLMBackend. 3 hops max. |
|
| 282 |
+
| Massive dependencies | stdlib only (core). External deps are optional, per-backend. |
|
| 283 |
+
| Breaking changes | Stable `types.py` contract. All modules exchange the same 7 types. |
|
| 284 |
+
| Debugging opacity | Structured logging at every step. Observability callbacks. JSON event stream. |
|
| 285 |
+
| Performance overhead | Direct LLM calls. No chain/pipeline abstraction layer. |
|
| 286 |
+
|
| 287 |
+
---
|
| 288 |
+
|
| 289 |
+
## Future Research Directions
|
| 290 |
+
|
| 291 |
+
### Papers to Implement Next
|
| 292 |
+
|
| 293 |
+
| Paper | ArXiv | What It Would Add |
|
| 294 |
+
|-------|-------|------------------|
|
| 295 |
+
| Meta-Rewarding | [2407.19594](https://arxiv.org/abs/2407.19594) | Self-improving critic via meta-judge loop (DPO on judge preference pairs) |
|
| 296 |
+
| Self-Taught Evaluators | [2408.02666](https://arxiv.org/abs/2408.02666) | Synthetic training data for the Purpose Function to improve without human labels |
|
| 297 |
+
| DSPy | [2310.03714](https://arxiv.org/abs/2310.03714) | Automatic prompt optimization for system prompts (Actor, Purpose Function) |
|
| 298 |
+
| LLMCompiler | [2312.04511](https://arxiv.org/abs/2312.04511) | Parallel function calling plan β faster multi-tool execution |
|
| 299 |
+
| Retroformer | [2308.02151](https://arxiv.org/abs/2308.02151) | Policy gradient for retrospective model β trainable reflection |
|