Rohan03
/

purpose-agent

@@ -14,166 +14,265 @@ tags:
   - react
   - orchestration
   - modular
 pipeline_tag: text-generation
 ---
-# Purpose Agent — Self-Improving Agentic Framework via State-Value Evaluation
-A lightweight, modular framework where an LLM agent improves across tasks **without weight updates** — using an RL-inspired self-reflection loop with a "Purpose Function" that evaluates intermediate state improvements.
-## Core Philosophy
-The agent improves via a **Purpose Function Φ(s)** that measures distance-to-goal at every step. It rewards the agent **only if Φ(s_new) > Φ(s_current)**. High-reward trajectories are distilled into reusable heuristics stored in a 3-tier memory system, so the agent gets smarter on each subsequent task.
-**No real-time backprop. No PPO/DPO. Minimal infrastructure costs.**
 ## Architecture
 ```
-┌─────────────────────────────────────────────────────────────────┐
-│                     ORCHESTRATOR LOOP                          │
-│                                                                 │
-│  ┌──────────┐   action   ┌─────────────┐   s_new              │
-│  │  ACTOR   │ ────────►  │ ENVIRONMENT │ ──────────┐          │
-│  │(+memory) │            │ (your code) │           │          │
-│  └────▲─────┘            └─────────────┘           │          │
-│       │                                             ▼          │
-│       │  heuristics    ┌────────────────┐   (s, a, s')        │
-│       │◄───────────────│   OPTIMIZER    │◄─────────┐          │
-│       │                │ (distillation) │          │          │
-│       │                └────────────────┘          │          │
-│       │                ┌────────────────┐   Φ(s)→Φ(s')       │
-│       │                │   PURPOSE FN   │──────────┤          │
-│       │                │ (state critic) │          │          │
-│       │                └────────────────┘          │          │
-│       │                ┌────────────────┐          │          │
-│       └────────────────│ EXPERIENCE     │◄─────────┘          │
-│                        │ REPLAY BUFFER  │                      │
-│                        └────────────────┘                      │
-└─────────────────────────────────────────────────────────────────┘
 ```
-## Modules
-| Module | File | Role |
-|--------|------|------|
-| **Actor** | `actor.py` | ReAct-style agent with 3-tier memory-augmented prompts |
-| **Purpose Function** | `purpose_function.py` | Strict, non-hackable LLM critic that scores Φ(s) transitions |
-| **Experience Replay** | `experience_replay.py` | Trajectory storage with two-phase retrieval (similarity + Q-value) |
-| **Optimizer** | `optimizer.py` | Distills winning trajectories into reusable heuristics |
-| **Orchestrator** | `orchestrator.py` | Main loop tying everything together |
-| **LLM Backend** | `llm_backend.py` | Swappable inference layer (HF, OpenAI, Ollama, custom) |
-| **Types** | `types.py` | Shared data structures (State, Action, Trajectory, Heuristic, etc.) |
-## Literature Foundation
-| Paper | Contribution to this framework |
-|-------|-------------------------------|
-| [MUSE](https://arxiv.org/abs/2510.08002) | 3-tier memory hierarchy (strategic/procedural/tool) |
-| [LATS](https://arxiv.org/abs/2310.04406) | LLM-as-value-function V(s) pattern |
-| [REMEMBERER](https://arxiv.org/abs/2306.07929) | Q-value experience replay with Bellman updates |
-| [Reflexion](https://arxiv.org/abs/2303.11366) | Verbal reinforcement via episodic self-reflection |
-| [SPC](https://arxiv.org/abs/2504.19162) | Anti-reward-hacking via adversarial critic patterns |
-| [CER](https://arxiv.org/abs/2506.06698) | Contextual experience distillation (Dynamics + Skills) |
-| [MemRL](https://arxiv.org/abs/2601.03192) | Two-phase retrieval (semantic recall → Q-value re-rank) |
-| [Voyager](https://arxiv.org/abs/2305.16291) | Skill library as long-term memory |
-## Quick Start
 ```python
-from purpose_agent import Orchestrator, State
-from purpose_agent.llm_backend import HFInferenceBackend
-from purpose_agent.orchestrator import Environment, Action
-# 1. Define your environment
 class MyEnv(Environment):
-    def execute(self, action, current_state):
-        # Your environment logic
-        return State(data={...})
-# 2. Create orchestrator with any LLM backend
-orch = Orchestrator(
-    llm=HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras"),
-    environment=MyEnv(),
-    available_actions={"search": "Search for items", "navigate": "Go somewhere"},
-    persistence_dir="./agent_memory",
-)
-# 3. Run tasks — the agent self-improves across runs
-result = orch.run_task(purpose="Find the answer to X", max_steps=20)
 print(result.summary())
-print(orch.get_heuristic_report())  # See what it learned
 ```
-## Swapping LLM Backends
 ```python
-# HuggingFace Inference Providers (cheapest)
-from purpose_agent.llm_backend import HFInferenceBackend
-llm = HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras")
-# OpenAI
-from purpose_agent.llm_backend import OpenAICompatibleBackend
-llm = OpenAICompatibleBackend(model="gpt-4o")
-# Local Ollama
-llm = OpenAICompatibleBackend(
-    model="llama3.2",
-    base_url="http://localhost:11434/v1",
-    api_key="ollama",
 )
-# Use DIFFERENT models for Actor vs Critic (recommended for production)
-orch = Orchestrator(
-    llm=cheap_fast_model,         # Actor — needs throughput
-    critic_llm=strong_model,      # Purpose Function — needs accuracy
-    optimizer_llm=cheap_fast_model,  # Runs infrequently
     environment=my_env,
 )
 ```
-## Purpose Function — Anti-Reward-Hacking Design
-The Purpose Function system prompt enforces 7 strict rules:
-1. **EVIDENCE REQUIRED** — Every score must cite specific observable state changes
-2. **NO CREDIT FOR INTENTIONS** — Scores based on actual state, not agent's predictions
-3. **NO SYCOPHANCY** — Lateral moves get Δ=0.0, regressions get negative Δ
-4. **MONOTONIC SCALE** — Φ 0.0–10.0 proportional to progress
-5. **ANTI-GAMING** — Superficial state manipulation flagged and penalized
-6. **CONSISTENCY** — Identical states must receive identical Φ scores (cache-enforced)
-7. **CONFIDENCE** — Ambiguous evaluations get reduced delta magnitude
-Additional programmatic safeguards:
-- Score caching prevents inconsistent evaluations
-- Anomaly detection flags suspiciously large single-step jumps
-- Confidence threshold filters uncertain scores
-- Z-score normalization prevents score inflation over long trajectories
-## 3-Tier Memory System
-Based on MUSE (arxiv:2510.08002):
-| Tier | Content | Loading | Update Trigger |
-|------|---------|---------|----------------|
-| **Strategic** | `<Dilemma, Strategy>` pairs | Always in system prompt | After each task |
-| **Procedural** | Step-by-step SOPs | Index in prompt, details on demand | After high-reward trajectory |
-| **Tool** | Per-action tips | Returned per step | When new patterns prove effective |
-## Running the Demo
-```bash
-python demo.py
 ```
-Runs 17 unit tests + full end-to-end demo with a simulated TreasureMaze environment. No API keys needed — uses MockLLMBackend.
-## Dependencies
-- **Core framework**: Python 3.10+ (stdlib only)
-- **HF backend**: `huggingface_hub`
-- **OpenAI backend**: `openai`
-- **Production embeddings**: `sentence-transformers` (optional, for better retrieval)
 ## License

   - react
   - orchestration
   - modular
+  - slm
+  - small-language-models
+  - multi-agent
+  - human-in-the-loop
+  - streaming
+  - tools
+  - evaluation
+  - ollama
+  - local-models
 pipeline_tag: text-generation
 ---
+# Purpose Agent v0.2.0
+**The world's first SLM-native self-improving agentic framework.**
+Works with both **Small Language Models** (0.6B–3B params, local, $0 cost) and **Large Language Models** (cloud APIs) with equal efficiency. Agents learn from experience via a Purpose Function Φ(s) — no fine-tuning needed.
+## What Makes This Different
+| Feature | Purpose Agent | LangChain | LangGraph | CrewAI | AutoGen | smolagents |
+|---|:---:|:---:|:---:|:---:|:---:|:---:|
+| **Self-Improvement** | ✅ Φ(s) + experience replay + heuristic distillation | ❌ | ❌ | ❌ | ❌ | ❌ |
+| **SLM-Native** | ✅ Grammar-constrained JSON, prompt compression, Tool RAG | ❌ | ❌ | ❌ | ❌ | ⚠️ |
+| **Anti-Reward-Hacking** | ✅ 7 strict rules + cache consistency + anomaly detection | ❌ | ❌ | ❌ | ❌ | ❌ |
+| **3-Tier Memory** | ✅ Strategic/Procedural/Tool with Q-value retrieval | ❌ | ⚠️ | ⚠️ | ❌ | ❌ |
+| **Multi-Agent with Shared Learning** | ✅ Agents learn from each other | ❌ | ⚠️ | ✅ | ✅ | ⚠️ |
+| **Human Φ Override** | ✅ Humans teach the critic → permanent learning | ❌ | ⚠️ | ❌ | ❌ | ❌ |
+| **Streaming** | ✅ Event + token streaming | ✅ | ✅ | ⚠️ | ⚠️ | ✅ |
+| **Tool Framework** | ✅ Schema, validation, retry, Tool RAG | ✅ | ✅ | ✅ | ✅ | ✅ |
+| **Cost Tracking** | ✅ Per-call token + USD tracking | ⚠️ | ⚠️ | ❌ | ❌ | ❌ |
+| **Benchmark Harness** | ✅ Improvement curve tracking | ❌ | ❌ | ❌ | ❌ | ❌ |
+| **Lightweight** | ✅ ~150KB, stdlib only | ❌ | ❌ | ⚠️ | ⚠️ | ✅ |
+| **Literature-Grounded** | ✅ 8 papers implemented | ❌ | ❌ | ❌ | ❌ | ❌ |
 ## Architecture
 ```
+purpose_agent/
+├── types.py              # Core data types
+├── llm_backend.py        # Cloud LLM backends (HF, OpenAI, Mock)
+├── slm_backends.py       # 🆕 SLM backends (Ollama, llama-cpp, prompt compression)
+├── actor.py              # ReAct agent with 3-tier memory
+├── purpose_function.py   # Non-hackable Φ(s) critic
+├── experience_replay.py  # Two-phase retrieval (similarity + Q-value)
+├── optimizer.py          # Trajectory → heuristic distillation
+├── orchestrator.py       # Main loop
+├── streaming.py          # 🆕 Async engine + event streaming
+├── tools.py              # 🆕 Tool framework + built-in tools + Tool RAG
+├── observability.py      # 🆕 Cost tracking, callbacks, metrics
+├── multi_agent.py        # 🆕 Agent teams with shared learning
+├── hitl.py               # 🆕 Human-in-the-loop + checkpointing
+└── evaluation.py         # 🆕 Benchmark runner + improvement curves
 ```
+## Quick Start — Local SLM (Zero Cost)
+```bash
+# 1. Install Ollama
+curl -fsSL https://ollama.ai/install.sh | sh
+# 2. Pull a small model (1.7B params, runs on any laptop)
+ollama pull qwen3:1.7b
+# 3. Run your agent
+python my_agent.py
+```
 ```python
+from purpose_agent import (
+    Orchestrator, OllamaBackend, State, Environment, Action,
+    CalculatorTool, ToolRegistry,
+)
+# SLM backend — runs locally, zero cost
+llm = OllamaBackend(model="qwen3:1.7b")   # 1.7B params
+# Or use a cloud LLM
+# from purpose_agent import HFInferenceBackend
+# llm = HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras")
 class MyEnv(Environment):
+    def execute(self, action, state):
+        return State(data={"result": "done"})
+orch = Orchestrator(llm=llm, environment=MyEnv())
+result = orch.run_task(purpose="Solve the problem", max_steps=10)
 print(result.summary())
 ```
+## SLM Model Registry
+Pre-configured models optimized for agent tasks:
+```python
+from purpose_agent import create_slm_backend
+backend = create_slm_backend("phi-4-mini")    # 3.8B — best tool-use accuracy
+backend = create_slm_backend("qwen3-1.7b")    # 1.7B — best balance
+backend = create_slm_backend("qwen3-0.6b")    # 0.6B — ultra-light
+backend = create_slm_backend("llama-3.2-1b")  # 1B — 128K context
+backend = create_slm_backend("smollm2-1.7b")  # 1.7B — HF native
+```
+## Multi-Agent with Shared Learning
+Agents learn from each other — when one agent solves a problem, all benefit:
 ```python
+from purpose_agent import AgentSpec, AgentTeam, OllamaBackend
+researcher = AgentSpec(
+    name="researcher", role="Find information",
+    model=OllamaBackend(model="qwen3:1.7b"),     # Cheap SLM
+    expertise_keywords=["search", "find", "research"],
+)
+coder = AgentSpec(
+    name="coder", role="Write and debug code",
+    model=OllamaBackend(model="phi4-mini"),       # Better SLM for code
+    expertise_keywords=["code", "program", "debug"],
 )
+team = AgentTeam(
+    agents=[researcher, coder],
+    default_model=OllamaBackend(model="qwen3:1.7b"),
     environment=my_env,
 )
+# Auto-delegates to the best agent
+result = team.run_task(purpose="Search for Python sorting algorithms")
+print(team.get_learning_report())  # See shared knowledge
 ```
+## Human-in-the-Loop
+Humans can override Φ scores → the agent permanently learns preferences:
+```python
+from purpose_agent import HITLOrchestrator, CLIInputHandler
+hitl = HITLOrchestrator(
+    orchestrator=orch,
+    input_handler=CLIInputHandler(),
+    approve_actions=True,      # Approve each action
+    review_scores=True,        # Override Φ scores
+    checkpoint_dir="./checkpoints",
+)
+result = hitl.run_task(purpose="Important task")
+# Inject knowledge directly
+hitl.inject_heuristic(
+    pattern="When facing {problem_type}",
+    strategy="Always try the simplest approach first",
+)
+```
+## Streaming
+Real-time event streaming for UIs:
+```python
+import asyncio
+from purpose_agent import AsyncOrchestrator
+async def main():
+    async_orch = AsyncOrchestrator(orch)
+    async for event in async_orch.run_task_stream(purpose="..."):
+        if event.event_type == "action":
+            print(f"🤖 {event.data['name']}: {event.data['thought'][:100]}")
+        elif event.event_type == "score":
+            print(f"📊 Φ: {event.data['phi_before']:.1f} → {event.data['phi_after']:.1f}")
+asyncio.run(main())
+```
+## Tool Framework
+```python
+from purpose_agent import FunctionTool, ToolRegistry, CalculatorTool, PythonExecTool
+# Create tool from any function
+@FunctionTool.from_function
+def search(query: str) -> str:
+    """Search the web for information."""
+    return requests.get(f"https://api.search.com?q={query}").text
+# Tool RAG for SLMs (only load relevant tools into prompt)
+registry = ToolRegistry()
+registry.register(CalculatorTool())
+registry.register(PythonExecTool())
+registry.register(search)
+relevant = registry.get_relevant_tools("compute 2+2", top_k=2)
+# → [CalculatorTool, PythonExecTool]  (search excluded — saves tokens)
+```
+## Cost Tracking
+```python
+from purpose_agent import CostTracker
+tracker = CostTracker(model_name="qwen3:1.7b", cost_per_1m_input=0.005)
+tracker.record(prompt_tokens=500, completion_tokens=200)
+print(tracker.summary())
+# → {'model': 'qwen3:1.7b', 'total_tokens': 700, 'estimated_cost_usd': 0.000005}
 ```
+## Benchmark & Prove Self-Improvement
+```python
+from purpose_agent import BenchmarkRunner, BenchmarkTask
+runner = BenchmarkRunner(orchestrator=orch)
+tasks = [
+    BenchmarkTask(id="t1", purpose="Find treasure", initial_state=...),
+    BenchmarkTask(id="t2", purpose="Solve puzzle", initial_state=...),
+]
+result = runner.run(tasks, iterations=10, name="MazeTest")
+print(result.summary())
+# Iteration    Success Rate      Avg Φ    Avg Steps   Avg Reward
+# -----------------------------------------------------------------
+#          1          40.0%       4.20          12.0         3.20
+#          5          70.0%       6.80           8.0         6.50
+#         10          90.0%       8.50           6.0         8.90
+# Improvement: 40.0% → 90.0% (+50.0%)
+result.save("results/benchmark.json")
+```
+## Literature Foundation
+| Paper | What it contributes |
+|-------|-------------------|
+| [MUSE](https://arxiv.org/abs/2510.08002) | 3-tier memory (strategic/procedural/tool) |
+| [LATS](https://arxiv.org/abs/2310.04406) | LLM-as-value-function V(s) |
+| [REMEMBERER](https://arxiv.org/abs/2306.07929) | Q-value experience replay |
+| [Reflexion](https://arxiv.org/abs/2303.11366) | Verbal reinforcement |
+| [SPC](https://arxiv.org/abs/2504.19162) | Anti-reward-hacking |
+| [CER](https://arxiv.org/abs/2506.06698) | Contextual experience distillation |
+| [MemRL](https://arxiv.org/abs/2601.03192) | Two-phase retrieval |
+| [TinyAgent](https://arxiv.org/abs/2409.00608) | SLM-native agent patterns |
+## Installation
+```bash
+# Core (no dependencies beyond stdlib)
+git clone https://huggingface.co/Rohan03/purpose-agent
+cd purpose-agent
+# For local SLMs
+pip install ollama
+# For cloud LLMs
+pip install huggingface_hub  # or: pip install openai
+# Run demo (no API keys needed)
+python demo.py
+```
 ## License