Rohan03 commited on
Commit
a99d027
Β·
verified Β·
1 Parent(s): 44d1ab6

Add comprehensive README

Browse files
Files changed (1) hide show
  1. README.md +161 -0
README.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Purpose Agent β€” Self-Improving Agentic Framework via State-Value Evaluation
2
+
3
+ A lightweight, modular framework where an LLM agent improves across tasks **without weight updates** β€” using an RL-inspired self-reflection loop with a "Purpose Function" that evaluates intermediate state improvements.
4
+
5
+ ## Core Philosophy
6
+
7
+ The agent improves via a **Purpose Function Ξ¦(s)** that measures distance-to-goal at every step. It rewards the agent **only if Ξ¦(s_new) > Ξ¦(s_current)**. High-reward trajectories are distilled into reusable heuristics stored in a 3-tier memory system, so the agent gets smarter on each subsequent task.
8
+
9
+ **No real-time backprop. No PPO/DPO. Minimal infrastructure costs.**
10
+
11
+ ## Architecture
12
+
13
+ ```
14
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
15
+ β”‚ ORCHESTRATOR LOOP β”‚
16
+ β”‚ β”‚
17
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” action β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” s_new β”‚
18
+ β”‚ β”‚ ACTOR β”‚ ────────► β”‚ ENVIRONMENT β”‚ ──────────┐ β”‚
19
+ β”‚ β”‚(+memory) β”‚ β”‚ (your code) β”‚ β”‚ β”‚
20
+ β”‚ β””β”€β”€β”€β”€β–²β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
21
+ β”‚ β”‚ β–Ό β”‚
22
+ β”‚ β”‚ heuristics β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” (s, a, s') β”‚
23
+ β”‚ │◄───────────────│ OPTIMIZER │◄─────────┐ β”‚
24
+ β”‚ β”‚ β”‚ (distillation) β”‚ β”‚ β”‚
25
+ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
26
+ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” Ξ¦(s)β†’Ξ¦(s') β”‚
27
+ β”‚ β”‚ β”‚ PURPOSE FN │─────────── β”‚
28
+ β”‚ β”‚ β”‚ (state critic) β”‚ β”‚ β”‚
29
+ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
30
+ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
31
+ β”‚ └────────────────│ EXPERIENCE β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
32
+ β”‚ β”‚ REPLAY BUFFER β”‚ β”‚
33
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
34
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
35
+ ```
36
+
37
+ ## Modules
38
+
39
+ | Module | File | Role |
40
+ |--------|------|------|
41
+ | **Actor** | `actor.py` | ReAct-style agent with 3-tier memory-augmented prompts |
42
+ | **Purpose Function** | `purpose_function.py` | Strict, non-hackable LLM critic that scores Ξ¦(s) transitions |
43
+ | **Experience Replay** | `experience_replay.py` | Trajectory storage with two-phase retrieval (similarity + Q-value) |
44
+ | **Optimizer** | `optimizer.py` | Distills winning trajectories into reusable heuristics |
45
+ | **Orchestrator** | `orchestrator.py` | Main loop tying everything together |
46
+ | **LLM Backend** | `llm_backend.py` | Swappable inference layer (HF, OpenAI, Ollama, custom) |
47
+ | **Types** | `types.py` | Shared data structures (State, Action, Trajectory, Heuristic, etc.) |
48
+
49
+ ## Literature Foundation
50
+
51
+ | Paper | Contribution to this framework |
52
+ |-------|-------------------------------|
53
+ | [MUSE](https://arxiv.org/abs/2510.08002) | 3-tier memory hierarchy (strategic/procedural/tool) |
54
+ | [LATS](https://arxiv.org/abs/2310.04406) | LLM-as-value-function V(s) pattern |
55
+ | [REMEMBERER](https://arxiv.org/abs/2306.07929) | Q-value experience replay with Bellman updates |
56
+ | [Reflexion](https://arxiv.org/abs/2303.11366) | Verbal reinforcement via episodic self-reflection |
57
+ | [SPC](https://arxiv.org/abs/2504.19162) | Anti-reward-hacking via adversarial critic patterns |
58
+ | [CER](https://arxiv.org/abs/2506.06698) | Contextual experience distillation (Dynamics + Skills) |
59
+ | [MemRL](https://arxiv.org/abs/2601.03192) | Two-phase retrieval (semantic recall β†’ Q-value re-rank) |
60
+ | [Voyager](https://arxiv.org/abs/2305.16291) | Skill library as long-term memory |
61
+
62
+ ## Quick Start
63
+
64
+ ```python
65
+ from purpose_agent import Orchestrator, State
66
+ from purpose_agent.llm_backend import HFInferenceBackend
67
+ from purpose_agent.orchestrator import Environment, Action
68
+
69
+ # 1. Define your environment
70
+ class MyEnv(Environment):
71
+ def execute(self, action, current_state):
72
+ # Your environment logic
73
+ return State(data={...})
74
+
75
+ # 2. Create orchestrator with any LLM backend
76
+ orch = Orchestrator(
77
+ llm=HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras"),
78
+ environment=MyEnv(),
79
+ available_actions={"search": "Search for items", "navigate": "Go somewhere"},
80
+ persistence_dir="./agent_memory",
81
+ )
82
+
83
+ # 3. Run tasks β€” the agent self-improves across runs
84
+ result = orch.run_task(purpose="Find the answer to X", max_steps=20)
85
+ print(result.summary())
86
+ print(orch.get_heuristic_report()) # See what it learned
87
+ ```
88
+
89
+ ## Swapping LLM Backends
90
+
91
+ ```python
92
+ # HuggingFace Inference Providers (cheapest)
93
+ from purpose_agent.llm_backend import HFInferenceBackend
94
+ llm = HFInferenceBackend(model_id="Qwen/Qwen3-32B", provider="cerebras")
95
+
96
+ # OpenAI
97
+ from purpose_agent.llm_backend import OpenAICompatibleBackend
98
+ llm = OpenAICompatibleBackend(model="gpt-4o")
99
+
100
+ # Local Ollama
101
+ llm = OpenAICompatibleBackend(
102
+ model="llama3.2",
103
+ base_url="http://localhost:11434/v1",
104
+ api_key="ollama",
105
+ )
106
+
107
+ # Use DIFFERENT models for Actor vs Critic (recommended for production)
108
+ orch = Orchestrator(
109
+ llm=cheap_fast_model, # Actor β€” needs throughput
110
+ critic_llm=strong_model, # Purpose Function β€” needs accuracy
111
+ optimizer_llm=cheap_fast_model, # Runs infrequently
112
+ environment=my_env,
113
+ )
114
+ ```
115
+
116
+ ## Purpose Function β€” Anti-Reward-Hacking Design
117
+
118
+ The Purpose Function system prompt enforces 7 strict rules:
119
+
120
+ 1. **EVIDENCE REQUIRED** β€” Every score must cite specific observable state changes
121
+ 2. **NO CREDIT FOR INTENTIONS** β€” Scores based on actual state, not agent's predictions
122
+ 3. **NO SYCOPHANCY** β€” Lateral moves get Ξ”=0.0, regressions get negative Ξ”
123
+ 4. **MONOTONIC SCALE** β€” Ξ¦ 0.0–10.0 proportional to progress
124
+ 5. **ANTI-GAMING** β€” Superficial state manipulation flagged and penalized
125
+ 6. **CONSISTENCY** β€” Identical states must receive identical Ξ¦ scores (cache-enforced)
126
+ 7. **CONFIDENCE** β€” Ambiguous evaluations get reduced delta magnitude
127
+
128
+ Additional programmatic safeguards:
129
+ - Score caching prevents inconsistent evaluations
130
+ - Anomaly detection flags suspiciously large single-step jumps
131
+ - Confidence threshold filters uncertain scores
132
+ - Z-score normalization prevents score inflation over long trajectories
133
+
134
+ ## 3-Tier Memory System
135
+
136
+ Based on MUSE (arxiv:2510.08002):
137
+
138
+ | Tier | Content | Loading | Update Trigger |
139
+ |------|---------|---------|----------------|
140
+ | **Strategic** | `<Dilemma, Strategy>` pairs | Always in system prompt | After each task |
141
+ | **Procedural** | Step-by-step SOPs | Index in prompt, details on demand | After high-reward trajectory |
142
+ | **Tool** | Per-action tips | Returned per step | When new patterns prove effective |
143
+
144
+ ## Running the Demo
145
+
146
+ ```bash
147
+ python demo.py
148
+ ```
149
+
150
+ Runs 17 unit tests + full end-to-end demo with a simulated TreasureMaze environment. No API keys needed β€” uses MockLLMBackend.
151
+
152
+ ## Dependencies
153
+
154
+ - **Core framework**: Python 3.10+ (stdlib only)
155
+ - **HF backend**: `huggingface_hub`
156
+ - **OpenAI backend**: `openai`
157
+ - **Production embeddings**: `sentence-transformers` (optional, for better retrieval)
158
+
159
+ ## License
160
+
161
+ MIT