Rohan03 commited on
Commit
7958a2f
Β·
verified Β·
1 Parent(s): 592f7ef

refactor: modularity fixes + plugin registry + compiled research

Browse files
Files changed (1) hide show
  1. COMPILED_RESEARCH.md +299 -0
COMPILED_RESEARCH.md ADDED
@@ -0,0 +1,299 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # COMPILED RESEARCH β€” Purpose Agent
2
+
3
+ > Living document. Every implementation decision traces back to a paper, benchmark, or empirical finding listed here. Updated with each feature addition.
4
+
5
+ ---
6
+
7
+ ## feat: Core Architecture β€” Self-Improving Agent Loop via Ξ¦(s) State-Value Evaluation
8
+
9
+ **Date:** 2025-04-28 | **Modules:** `types.py`, `actor.py`, `purpose_function.py`, `experience_replay.py`, `optimizer.py`, `orchestrator.py`
10
+
11
+ ### Papers Implemented
12
+
13
+ | Paper | ArXiv | Key Contribution | Where Used |
14
+ |-------|-------|-----------------|------------|
15
+ | MUSE | [2510.08002](https://arxiv.org/abs/2510.08002) | 3-tier memory (strategic/procedural/tool), Plan-Execute-Reflect-Memorize loop, independent Reflect Agent | `actor.py` (memory tiers), `optimizer.py` (post-task distillation), `orchestrator.py` (reflect cycle) |
16
+ | LATS | [2310.04406](https://arxiv.org/abs/2310.04406) | LLM-as-value-function V(s) = λ·LM_score + (1-λ)·SC_score, score AFTER env feedback | `purpose_function.py` (Φ scoring, anti-inflation normalization) |
17
+ | REMEMBERER | [2306.07929](https://arxiv.org/abs/2306.07929) | Q-value experience replay with tabular Q-Learning updates: Q(g,o,a) ← (1-Ξ±)Q + Ξ±[r + Ξ³Β·max Q] | `experience_replay.py` (Q-value storage + MC update), `types.py` (Heuristic.update_q_value) |
18
+ | Reflexion | [2303.11366](https://arxiv.org/abs/2303.11366) | Verbal reinforcement via episodic memory, Actor/Evaluator/Self-Reflection triad | `orchestrator.py` (actor-critic separation), `actor.py` (ReAct format) |
19
+ | SPC | [2504.19162](https://arxiv.org/abs/2504.19162) | Adversarial self-play critic: Sneaky Generator vs Step Critic | `purpose_function.py` (7 anti-reward-hacking rules, evidence requirement) |
20
+ | CER | [2506.06698](https://arxiv.org/abs/2506.06698) | Contextual experience distillation: Dynamics (url→summary) + Skills (abstract SOPs with {variables}) | `optimizer.py` (DISTILL_TRAJECTORY_PROMPT pattern, {variable} placeholders) |
21
+ | MemRL | [2601.03192](https://arxiv.org/abs/2601.03192) | Memory-Augmented MDP: decouple "which memory to retrieve" (learned Q) from "how to act given memory" (LLM) | `experience_replay.py` (two-phase retrieval: semantic recall β†’ Q-value re-rank) |
22
+ | Voyager | [2305.16291](https://arxiv.org/abs/2305.16291) | Skill library as long-term memory, self-verification critic prompt | `optimizer.py` (heuristic library concept), `experience_replay.py` (persistent skill storage) |
23
+
24
+ ### Key Design Decisions
25
+
26
+ **Why Ξ¦(s) potential-based shaping instead of binary reward:**
27
+ - LATS showed V(s) with LLM scoring outperforms binary success/fail on HotPotQA, WebShop, HumanEval
28
+ - Potential-based shaping (Ξ¦(s_new) - Ξ¦(s_current)) satisfies the necessary and sufficient condition for policy invariance under reward shaping (Ng et al., 1999)
29
+ - Enables learning from partial successes β€” binary reward discards all information from failed tasks
30
+
31
+ **Why 3-tier memory instead of flat:**
32
+ - MUSE achieved SOTA 51.78% on TheAgentCompany with 3-tier; flat memory baseline was 23.65%
33
+ - Strategic tier prevents context bloat (loaded once at task start, not per-step)
34
+ - Procedural tier uses lazy loading (only index in prompt, full SOP on demand) β€” critical for SLM context limits
35
+
36
+ **Why separate critic LLM from actor:**
37
+ - MUSE's independent Reflect Agent removed self-confirmation bias
38
+ - SPC's adversarial approach showed LLMs are sycophantic self-evaluators β€” separate prompts are essential
39
+
40
+ **Why 7 anti-reward-hacking rules:**
41
+ - JSONSchemaBench (arxiv:2501.10868) showed SLMs produce invalid outputs 35-87% of the time without constraints
42
+ - SPC showed adversarial critics detect ~2x more reasoning errors than self-evaluation
43
+ - Evidence requirement, cache consistency, anomaly detection, and confidence thresholds are novel programmatic safeguards not found in any paper β€” they close the gap between theoretical SPC and practical deployment
44
+
45
+ ---
46
+
47
+ ## feat: SLM-Native Backends β€” Ollama, llama-cpp, Prompt Compression
48
+
49
+ **Date:** 2025-04-28 | **Modules:** `slm_backends.py`, `registry.py`
50
+
51
+ ### Papers & Benchmarks
52
+
53
+ | Paper | ArXiv | Key Finding | Where Used |
54
+ |-------|-------|-------------|------------|
55
+ | TinyAgent | [2409.00608](https://arxiv.org/abs/2409.00608) | 1.1B model matches GPT-4-Turbo on 16-function Mac agent task via: synthetic SFT + Tool RAG (DeBERTa classifier, 34% prompt reduction) + INT4 quantization | `slm_backends.py` (prompt compression), `tools.py` (ToolRegistry.get_relevant_tools = Tool RAG) |
56
+ | JSONSchemaBench | [2501.10868](https://arxiv.org/abs/2501.10868) | Guidance: 96% compliance on simple schemas; Outlines: severe timeouts on complex; XGrammar: fastest (100x) but lower coverage; llama.cpp/Ollama: 74-97% | `slm_backends.py` (OllamaBackend uses grammar-constrained output via format= parameter) |
57
+ | XGrammar | [2411.15100](https://arxiv.org/abs/2411.15100) | Grammar-constrained decoding engine, up to 100x speedup vs naΓ―ve CFG, default in vLLM v0.6+ | Referenced for vLLM production deployment |
58
+ | LLMLingua-2 | [2403.12968](https://arxiv.org/abs/2403.12968) | Token classification (keep/drop) trained via GPT-4 distillation, 10x compression with minimal quality loss | `slm_backends.py` (SLMPromptCompressor design, extensibility note for llmlingua integration) |
59
+ | SLM Agent Survey | [2510.03847](https://arxiv.org/abs/2510.03847) | Guided decoding + strict JSON Schema + validator-first tool execution closes most SLM-vs-LLM capability gap at 10-100x lower cost | Architecture validation β€” grammar-constrained output is the correct default for SLMs |
60
+
61
+ ### SLM Model Selection Rationale
62
+
63
+ | Model | Params | Context | Why Included |
64
+ |-------|--------|---------|-------------|
65
+ | Phi-4-mini | 3.8B | 16K | Top schema compliance on BFCL v3/v4 (Microsoft benchmark) |
66
+ | Qwen3-1.7B | 1.7B | 32K | Best balance: strong function calling, large context for agent traces |
67
+ | Qwen3-0.6B | 0.6B | 32K | Ultra-light proof point: can an agent work at 600M params? |
68
+ | Llama-3.2-3B | 3B | 128K | Largest context in class, Meta's open weights |
69
+ | Llama-3.2-1B | 1B | 128K | Smallest Llama, 128K context enables long agent traces |
70
+ | SmolLM2-1.7B | 1.7B | 8K | HF native, tests tight context constraint |
71
+ | Gemma-3-1B | 1B | 32K | Google's multimodal-capable SLM |
72
+
73
+ ### Key Design Decisions
74
+
75
+ **Why grammar-constrained output is mandatory for SLMs:**
76
+ - JSONSchemaBench showed prompt-only JSON generation fails 35-87% on even medium schemas for SLMs
77
+ - Ollama's grammar engine (via llama.cpp) forces valid output from ANY model regardless of training
78
+ - This is the fundamental enabler for SLM-native agents
79
+
80
+ **Why prompt compression matters:**
81
+ - SmolLM2 has 8K context; agent system prompt + tool descriptions + history can exceed 4K tokens easily
82
+ - TinyAgent showed 34% prompt reduction via Tool RAG alone
83
+ - Our 3-stage compressor (whitespace β†’ verbose phrases β†’ middle truncation) is a no-dependency fallback; LLMLingua-2 is the production upgrade path
84
+
85
+ ---
86
+
87
+ ## feat: Streaming & Async Engine
88
+
89
+ **Date:** 2025-04-28 | **Module:** `streaming.py`
90
+
91
+ ### Patterns from Framework Analysis
92
+
93
+ - **smolagents**: Agents are synchronous internally; `anyio.to_thread.run_sync` for async contexts (official pattern from HF docs)
94
+ - **LangGraph**: `graph.astream_events(input, version="v2")` is genuinely async β€” gold standard for streaming
95
+ - **CrewAI**: `kickoff_async()` is NOT truly async β€” it's `loop.run_in_executor()` wrapper (documented caveat)
96
+
97
+ ### Design Decision
98
+
99
+ Adopted smolagents pattern: sync core + `asyncio.to_thread` wrappers. Rationale:
100
+ 1. Most LLM backends (Ollama, llama-cpp) are synchronous
101
+ 2. Thread-based async avoids the complexity of native async for I/O-bound LLM calls
102
+ 3. `AsyncOrchestrator.run_task_stream()` yields `StreamEvent` objects β€” matches LangGraph's event streaming UX
103
+
104
+ ---
105
+
106
+ ## feat: Tool Framework with Tool RAG
107
+
108
+ **Date:** 2025-04-28 | **Module:** `tools.py`
109
+
110
+ ### Research Applied
111
+
112
+ - **TinyAgent (arxiv:2409.00608)**: Tool RAG via DeBERTa-v3-small multi-label classifier selects relevant tools (avg 3.97 vs 6 total = 34% prompt reduction). We implement a lightweight trigram-embedding version; production path is fine-tuned classifier.
113
+ - **smolagents CodeAgent pattern**: For SLMs, code-based actions (Python generation) are more reliable than JSON tool calls. Our `FunctionTool.from_function()` bridges both β€” tools have JSON schemas for structured-output capable models, and `to_prompt(compact=True)` for SLM-friendly text format.
114
+ - **OpenAI function calling schema**: All tools export `to_schema()` in OpenAI-compatible format for backends that support native tool_calls.
115
+
116
+ ---
117
+
118
+ ## feat: Observability β€” Cost Tracking & Callbacks
119
+
120
+ **Date:** 2025-04-28 | **Module:** `observability.py`
121
+
122
+ ### Competitive Analysis
123
+
124
+ | Framework | Observability Approach |
125
+ |-----------|----------------------|
126
+ | LangChain/LangGraph | LangSmith (proprietary SaaS) + OpenTelemetry export |
127
+ | CrewAI | AgentOps integration (proprietary) |
128
+ | smolagents | Basic step logging |
129
+ | **Purpose Agent** | Pluggable callback system (no vendor lock-in) + built-in cost tracking |
130
+
131
+ ### Design Decision
132
+
133
+ No vendor lock-in. `AgentCallback` protocol + `CallbackManager` dispatcher. Users plug in whatever they want:
134
+ - `LoggingCallback` β†’ structured logs
135
+ - `JSONFileCallback` β†’ JSONL event stream (ingestible by any analytics tool)
136
+ - `MetricsCollector` β†’ in-memory aggregate metrics
137
+ - Custom: implement `on_event(AgentEvent)` β†’ integrate with Arize, LangSmith, Weights & Biases, etc.
138
+
139
+ Cost tracking uses per-model pricing tables. Local models get electricity-cost estimates (~$0.005/1M tokens on CPU).
140
+
141
+ ---
142
+
143
+ ## feat: Multi-Agent with Shared Self-Improvement
144
+
145
+ **Date:** 2025-04-28 | **Module:** `multi_agent.py`
146
+
147
+ ### Research Applied
148
+
149
+ | Paper | Contribution |
150
+ |-------|-------------|
151
+ | MUSE (2510.08002) | Independent Reflect Agent β†’ our critic_model is separate from agent models |
152
+ | AgentFly (2508.16153) | Case bank with soft Q-learning for retrieval utility β†’ our shared_replay with Q-value ranking |
153
+ | DynaSaur (2411.01747) | Dynamic action accumulation into vector-indexed library οΏ½οΏ½ ToolRegistry with semantic retrieval |
154
+
155
+ ### Key Innovation: Shared Experience Replay
156
+
157
+ No other multi-agent framework does this. When Agent A completes a task:
158
+ 1. Trajectory goes to shared ExperienceReplay
159
+ 2. Optimizer distills heuristics from it
160
+ 3. When Agent B starts a task, it retrieves relevant heuristics from the shared pool
161
+ 4. Agent B benefits from Agent A's experience without any retraining
162
+
163
+ This is the MemRL (2601.03192) M-MDP formulation applied to multi-agent: the retrieval policy Q(s,m) operates over a shared memory bank M.
164
+
165
+ ### Task Delegation
166
+
167
+ Two-phase: keyword matching (zero cost, instant) β†’ LLM routing (1 API call, accurate). Falls back gracefully: if LLM is unavailable, keyword matching still works.
168
+
169
+ ---
170
+
171
+ ## feat: Human-in-the-Loop with Ξ¦ Score Overrides
172
+
173
+ **Date:** 2025-04-28 | **Module:** `hitl.py`
174
+
175
+ ### Competitive Analysis
176
+
177
+ | Framework | HITL Approach |
178
+ |-----------|--------------|
179
+ | LangGraph | **Best**: Full state checkpointing, interrupt nodes, time-travel debug |
180
+ | CrewAI | Basic approval callbacks |
181
+ | AutoGen | Chat-based human interaction |
182
+ | **Purpose Agent** | Checkpoint/resume + **Ξ¦ override** (unique β€” humans teach the critic) |
183
+
184
+ ### Key Innovation: Ξ¦ Score Override β†’ Permanent Learning
185
+
186
+ When a human overrides a Ξ¦ score:
187
+ 1. The corrected score is recorded in the TrajectoryStep
188
+ 2. The trajectory (with human-corrected scores) goes into Experience Replay
189
+ 3. The Optimizer distills heuristics from it β€” now informed by human judgment
190
+ 4. Future tasks use these human-informed heuristics
191
+
192
+ This is effectively RLHF without fine-tuning β€” the human preference signal flows through the memory system instead of through gradient updates. No other framework has this.
193
+
194
+ ### Checkpoint Design
195
+
196
+ Serializable state snapshot (JSON) at each step. Enables:
197
+ - Resume from any point after human review
198
+ - Time-travel: load any checkpoint and re-run from there
199
+ - Offline review: save checkpoints, review later, resume
200
+
201
+ ---
202
+
203
+ ## feat: Evaluation Harness β€” Improvement Curve Tracking
204
+
205
+ **Date:** 2025-04-28 | **Module:** `evaluation.py`
206
+
207
+ ### Benchmarks Referenced
208
+
209
+ | Benchmark | Domain | Used By |
210
+ |-----------|--------|---------|
211
+ | GAIA | General assistant tasks | LATS, Reflexion |
212
+ | AlfWorld | Text-based game environments | Reflexion (91% pass@1) |
213
+ | WebShop | E-commerce navigation | REMEMBERER (+4% over SOTA) |
214
+ | WebArena | Web navigation | CER (51% relative improvement) |
215
+ | TheAgentCompany | Corporate productivity | MUSE (51.78% SOTA) |
216
+ | SWE-bench | Code generation/repair | Multiple agent papers |
217
+ | HumanEval | Code generation | Reflexion (91% pass@1) |
218
+
219
+ ### Design Decision
220
+
221
+ The improvement curve is the key differentiator chart:
222
+ ```
223
+ Iteration Success Rate
224
+ 1 40% ← Cold start (no experience)
225
+ 5 70% ← Learning from past tasks
226
+ 10 90% ← Mature agent with full heuristic library
227
+ ```
228
+
229
+ No other framework can produce this chart because none of them learn from experience. BenchmarkRunner.run() + BenchmarkResult.get_improvement_curve() makes this a one-liner.
230
+
231
+ `compare_cold_vs_warm()` is the simplest proof: run once with empty memory, run again with learned memory. The delta IS the self-improvement signal.
232
+
233
+ ---
234
+
235
+ ## refactor: Plugin Registry & Modularity Fixes
236
+
237
+ **Date:** 2025-04-28 | **Module:** `registry.py`
238
+
239
+ ### Issues Fixed
240
+
241
+ 1. **Duplicated embedding logic**: `ExperienceReplay._compute_embedding` (dim=128) and `ToolRegistry._embed` (dim=64) were copy-pasted. Created `EmbeddingBackend` as shared utility in registry.
242
+ 2. **Private methods used as public API**: `Orchestrator._post_task` and `_sync_memory` were called by `HITLOrchestrator`, `AsyncOrchestrator`, `AgentTeam`. Made public: `post_task()`, `sync_memory()`.
243
+ 3. **Hardcoded SLM registry**: `SLM_REGISTRY` dict was not extensible. Added `model_registry.register()` in plugin system.
244
+ 4. **No plugin system**: Adding new backends/tools/callbacks required editing `__init__.py`. Created `PluginRegistry` with `backend_registry`, `callback_registry`, `model_registry` β€” new components are 1 register() call.
245
+
246
+ ### Extension Pattern
247
+
248
+ Adding a new component to Purpose Agent:
249
+ ```python
250
+ # my_custom_backend.py
251
+ from purpose_agent import LLMBackend, backend_registry
252
+
253
+ class MyBackend(LLMBackend):
254
+ def generate(self, messages, **kwargs):
255
+ return "response"
256
+
257
+ backend_registry.register("my_backend", MyBackend)
258
+ # Done β€” now: backend_registry.create("my_backend")
259
+ ```
260
+
261
+ No core files edited. No `__init__.py` changes. Drop the file, import it, register.
262
+
263
+ ---
264
+
265
+ ## Competitive Framework Analysis
266
+
267
+ **Date:** 2025-04-28
268
+
269
+ ### Why Developers Leave LangChain (sources: Medium, LinkedIn, Reddit, Analytics India Magazine)
270
+
271
+ 1. **Over-abstraction**: Too many layers between user code and the LLM call. Simple tasks require understanding Chain β†’ LLMChain β†’ PromptTemplate β†’ OutputParser hierarchy.
272
+ 2. **Massive dependency tree**: Pulls in dozens of packages. Version conflicts common.
273
+ 3. **Frequent breaking changes**: API surface changed significantly between v0.1 β†’ v0.2 β†’ v0.3.
274
+ 4. **Debugging opacity**: Errors propagate through abstraction layers, making root cause hard to find.
275
+ 5. **Performance overhead**: Abstraction layers add latency to every LLM call.
276
+
277
+ ### Purpose Agent's Response to Each Criticism
278
+
279
+ | LangChain Problem | Purpose Agent Approach |
280
+ |-------------------|----------------------|
281
+ | Over-abstraction | Flat module structure. Orchestrator β†’ Actor β†’ LLMBackend. 3 hops max. |
282
+ | Massive dependencies | stdlib only (core). External deps are optional, per-backend. |
283
+ | Breaking changes | Stable `types.py` contract. All modules exchange the same 7 types. |
284
+ | Debugging opacity | Structured logging at every step. Observability callbacks. JSON event stream. |
285
+ | Performance overhead | Direct LLM calls. No chain/pipeline abstraction layer. |
286
+
287
+ ---
288
+
289
+ ## Future Research Directions
290
+
291
+ ### Papers to Implement Next
292
+
293
+ | Paper | ArXiv | What It Would Add |
294
+ |-------|-------|------------------|
295
+ | Meta-Rewarding | [2407.19594](https://arxiv.org/abs/2407.19594) | Self-improving critic via meta-judge loop (DPO on judge preference pairs) |
296
+ | Self-Taught Evaluators | [2408.02666](https://arxiv.org/abs/2408.02666) | Synthetic training data for the Purpose Function to improve without human labels |
297
+ | DSPy | [2310.03714](https://arxiv.org/abs/2310.03714) | Automatic prompt optimization for system prompts (Actor, Purpose Function) |
298
+ | LLMCompiler | [2312.04511](https://arxiv.org/abs/2312.04511) | Parallel function calling plan β†’ faster multi-tool execution |
299
+ | Retroformer | [2308.02151](https://arxiv.org/abs/2308.02151) | Policy gradient for retrospective model β†’ trainable reflection |