purpose-agent / ARCHITECTURE.md
Rohan03's picture
docs: Complete architecture documentation for technical and non-technical readers
67678c5 verified
|
raw
history blame
23.3 kB
# Purpose Agent β€” Architecture Documentation
> For developers building on the framework, researchers understanding the theory, and anyone curious about how self-improving agents work.
---
## Table of Contents
1. [What Is Purpose Agent?](#1-what-is-purpose-agent)
2. [The Big Idea (No Jargon)](#2-the-big-idea)
3. [How It Works β€” Step by Step](#3-how-it-works)
4. [Architecture Map](#4-architecture-map)
5. [The Core Engine](#5-the-core-engine)
6. [The V2 Safety Kernel](#6-the-v2-safety-kernel)
7. [Research Implementations](#7-research-implementations)
8. [Breakthroughs](#8-breakthroughs)
9. [User-Facing Layers](#9-user-facing-layers)
10. [How Models Are Handled](#10-how-models-are-handled)
11. [The Research Behind It](#11-the-research)
12. [For Contributors](#12-for-contributors)
---
## 1. What Is Purpose Agent?
Purpose Agent is a Python framework that builds AI agents that **get better with experience** β€” without retraining the underlying AI model.
Traditional AI agents run the same way every time. Purpose Agent is different: after each task, it extracts lessons from what worked and what didn't, tests those lessons for safety, and uses them to perform better next time.
**Think of it like this:** A new employee follows the company handbook. After their first week, they have personal notes β€” shortcuts they discovered, mistakes they won't repeat, tips from colleagues. Those notes make them better at their job without changing who they are. Purpose Agent does this for AI.
---
## 2. The Big Idea
### For Non-Technical Readers
```
You give it a purpose β†’ It builds a team β†’ It does the work β†’ It learns β†’ Next time is better
```
**You say:** "Help me write Python code."
**It builds:** An architect (plans), a coder (writes), and a tester (reviews).
**It runs:** The coder writes fibonacci. The tester checks it. A critic scores the work.
**It learns:** "When writing recursive functions, check base cases first." This lesson is saved.
**Next time:** The coder starts by checking base cases. It's faster and more reliable.
### For Technical Readers
The framework implements a **Purpose-MDP** β€” a Markov Decision Process where:
- A **Purpose Function Ξ¦(s)** evaluates every state transition on a 0-10 scale
- An **Optimizer** distills successful trajectories into reusable heuristics
- Heuristics are ranked by **Q-values** (how often they helped) and selected via **Mixture-of-Heuristics** (sparse activation, like MoE)
- An **immune system** scans every new heuristic for prompt injection, score manipulation, and other threats
- **Memory CI pipeline** quarantines, tests, and promotes heuristics before they affect agent behavior
This is **Potential-Based Reward Shaping** (Ng et al., 1999) applied to LLM agents, with formal convergence guarantees. See [PURPOSE_LEARNING.md](PURPOSE_LEARNING.md).
---
## 3. How It Works β€” Step by Step
Here's what happens when you run `team.run("Write a fibonacci function")`:
### Step 1: The Actor Decides
The Actor module receives:
- The **purpose** ("Write a fibonacci function")
- The **current state** (empty β€” no code written yet)
- Any **learned heuristics** from past runs
It generates a thought process and an action:
> "I should write a function that handles base cases fib(0)=0 and fib(1)=1, then use iteration for the general case."
> β†’ Action: `submit_code` with the Python implementation.
### Step 2: The Environment Executes
The code is run against test cases. The environment returns a new state:
> "Tests: 4/4 ALL PASSED"
### Step 3: The Purpose Function Scores
A **separate LLM call** (not the same as the actor) evaluates the transition:
- Ξ¦(state_before) = 0.0 (nothing done)
- Ξ¦(state_after) = 10.0 (all tests pass)
- Delta = +10.0 (huge improvement)
- Evidence: "Tests changed from 0/4 to 4/4"
The Purpose Function has **7 anti-gaming rules** that prevent the agent from tricking itself into thinking it's doing well when it isn't.
### Step 4: The Optimizer Extracts Heuristics
After the task, the Optimizer looks at the trajectory and extracts reusable patterns:
- **Strategic:** "When writing {function_type} functions, handle edge cases first, then iterate."
- **Procedural:** "1. Read test cases. 2. Handle base cases. 3. Implement general case. 4. Submit."
- **Tool tip:** "When submitting code, check boundary conditions: 0, 1, empty, negative."
### Step 5: Safety Checks
Every new heuristic goes through the **immune system**:
- Is it a prompt injection? ("Ignore all previous instructions") β†’ **REJECTED**
- Does it try to manipulate scores? ("Always score 10") β†’ **REJECTED**
- Does it contain secrets? (API keys, passwords) β†’ **REJECTED**
- Is it safe? ("Check base cases first") β†’ **QUARANTINED** (pending replay test)
After passing replay testing β†’ **PROMOTED** (active in future runs).
### Step 6: Next Run Benefits
When the agent runs again, the **Prompt Compiler** selects the top-K heuristics by:
- **Relevance** to the current task (embedding similarity)
- **Trust** (immune-scanned and verified)
- **Utility** (Q-value β€” how often it helped before)
These are injected into the prompt. The agent is now better without any model retraining.
---
## 4. Architecture Map
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PURPOSE AGENT β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€ USER LAYER ──────────────────────────────────────────────────┐ β”‚
β”‚ β”‚ pa.purpose("...") β†’ Team β†’ team.run("...") β”‚ β”‚
β”‚ β”‚ pa.Agent() pa.Graph() pa.parallel() pa.Conversation() β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€ CORE ENGINE ──────────────────────────────────▼──────────────┐ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ Actor ──→ Environment ──→ Purpose Function (Ξ¦) β”‚ β”‚
β”‚ β”‚ ↑ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β–Ό β”‚ β”‚
β”‚ β”‚ β”‚ State s' Ξ¦(s) β†’ Ξ¦(s') β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β–Ό β–Ό β”‚ β”‚
β”‚ β”‚ β”‚ Experience Replay Optimizer β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ └──── heuristics β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€ V2 SAFETY KERNEL ────────────────────────────▼──────────────┐ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ Immune System ──→ Memory CI ──→ Memory Store β”‚ β”‚
β”‚ β”‚ (scan threats) (quarantine) (7 types Γ— 5 statuses) β”‚ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ Prompt Compiler ──→ Token Budget ──→ Credit Assignment β”‚ β”‚
β”‚ β”‚ Trace System ──→ JSONL logs ──→ Offline analysis β”‚ β”‚
β”‚ β”‚ RunMode ──→ EVAL_TEST blocks all writes β”‚ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€ INFRASTRUCTURE ──────────────────────────────▼──────────────┐ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ LLM Backends: OpenRouter β”‚ Groq β”‚ OpenAI β”‚ Ollama β”‚ HF β”‚ ... β”‚ β”‚
β”‚ β”‚ Robust Parser: TOML β†’ JSON β†’ field extraction β†’ regex β”‚ β”‚
β”‚ β”‚ Tools: Calculator β”‚ PythonExec β”‚ ReadFile β”‚ WriteFile β”‚ β”‚
β”‚ β”‚ Streaming β”‚ Observability β”‚ Cost Tracking β”‚ Registry β”‚ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## 5. The Core Engine
### Actor (`actor.py`)
The decision-maker. Given the current state and purpose, it decides what action to take.
**Key design:** The Actor doesn't evaluate itself. That's the Purpose Function's job. This separation prevents self-confirmation bias (you wouldn't let a student grade their own exam).
The Actor's prompt is **dynamically composed** from three tiers of memory:
- **Strategic:** High-level rules ("When coding, handle edge cases first")
- **Procedural:** Step-by-step procedures ("1. Read tests. 2. Handle bases. 3. Implement.")
- **Tool tips:** Action-specific advice ("When using submit_code, check boundaries")
### Purpose Function (`purpose_function.py`)
The critic. A separate LLM call that scores every state transition on a 0-10 scale.
**Seven anti-gaming rules:**
1. Evidence required β€” cite specific state changes
2. No credit for intentions β€” score actual results, not plans
3. No sycophancy β€” don't inflate scores to be encouraging
4. Monotonic scale β€” 0=nothing done, 10=task complete
5. Anti-gaming β€” flag superficial state manipulation
6. Consistency β€” same state gets same score (enforced by cache)
7. Confidence β€” uncertain evaluations get reduced weight
### Experience Replay (`experience_replay.py`)
Stores completed trajectories and retrieves relevant ones for future tasks.
**Two-phase retrieval** (from MemRL, arxiv:2601.03192):
1. **Recall:** Find trajectories similar to the current task (embedding similarity)
2. **Re-rank:** Order by Q-value utility (how useful was this memory when retrieved before?)
### Optimizer (`optimizer.py`)
Extracts reusable heuristics from successful trajectories.
Uses the **CER distillation pattern** (arxiv:2506.06698): abstract away specific details with `{variable}` placeholders so heuristics generalize across tasks.
### Orchestrator (`orchestrator.py`)
The main loop that ties everything together. For each step:
1. Actor decides β†’ 2. Environment executes β†’ 3. Critic scores β†’ 4. Step recorded β†’ 5. Check termination
After each task: store trajectory β†’ optimize β†’ sync heuristics to Actor memory.
---
## 6. The V2 Safety Kernel
V1 let the agent learn freely. V2 adds guardrails.
### Memory System (`memory.py`)
Seven memory types, each with different trust priors:
| Type | Example | Trust |
|------|---------|-------|
| `purpose_contract` | "Build a web scraper" | High (user-defined) |
| `user_preference` | "Always cite sources" | High (human-taught) |
| `skill_card` | "When coding, test edges first" | Medium (learned) |
| `episodic_case` | "fib(0)=0 was a tricky case" | Medium (observed) |
| `failure_pattern` | "Don't use recursion for large n" | Medium (learned from failure) |
| `critic_calibration` | "Score 7 for 3/4 tests passing" | Low (meta-learned) |
| `tool_policy` | "search: only use at target location" | Medium (learned) |
Five statuses: `candidate` β†’ `quarantined` β†’ `promoted` (or `rejected`) β†’ `archived`.
### Immune System (`immune.py`)
Scans every candidate memory for 5 threat categories:
- **Prompt injection** β€” "Ignore previous instructions..."
- **Score manipulation** β€” "Always score 10..."
- **Tool misuse** β€” "subprocess.call('rm -rf /')..."
- **Privacy leaks** β€” API keys, emails, file paths
- **Scope overreach** β€” memory tries to affect all agents when it should be scoped
### Memory CI (`memory_ci.py`)
The promotion pipeline:
```
candidate β†’ immune_scan() β†’ quarantined β†’ replay_test β†’ promote/reject
```
No memory reaches the agent's prompt without passing every gate.
### Prompt Compiler (`compiler.py`)
Selects which memories to include under a token budget. Ranked by:
`score = 0.4 Γ— relevance + 0.3 Γ— trust + 0.3 Γ— utility`
Returns `included_memory_ids` for credit assignment β€” only memories that were in the prompt get Q-value updates after the step.
### Trace System (`trace.py`)
Every run produces a JSONL trace β€” the raw material for debugging, evaluation, and memory extraction. Traces are append-only and immutable.
### RunMode (`v2_types.py`)
Three modes with strict enforcement:
- `LEARNING_TRAIN` β€” full read/write
- `LEARNING_VALIDATION` β€” read + staging writes
- `EVAL_TEST` β€” **no writes of any kind** (the only mode whose numbers you can report)
---
## 7. Research Implementations
Five papers implemented as standalone modules:
### Meta-Rewarding (`meta_rewarding.py`)
*From: arxiv:2407.19594 β€” Llama-3-8B: 22.9% β†’ 39.4% on AlpacaEval*
A meta-judge evaluates the Purpose Function's own judgments. Good judgments become calibration examples in memory. The critic improves through in-context learning.
### Self-Taught Evaluators (`self_taught.py`)
*From: arxiv:2408.02666*
Generates synthetic contrast pairs (correct vs wrong evaluation) from traces. Creates an automatic curriculum: as the critic improves, the contrast pairs get harder.
### Prompt Optimizer (`prompt_optimizer.py`)
*From DSPy: arxiv:2310.03714 β€” +8% on GSM8K, +50% on BBH*
Instead of hand-crafting prompts, define signatures (`state, action β†’ score, reasoning`) and let the optimizer bootstrap effective few-shot demonstrations by trial-and-error.
### LLM Compiler (`llm_compiler.py`)
*From: arxiv:2312.04511 β€” up to 3.7Γ— latency speedup*
Instead of sequential tool calls (ReAct), plan ALL calls upfront as a DAG and execute independent ones in parallel.
### Retroformer (`retroformer.py`)
*From: arxiv:2308.02151*
Structured reflection on completed traces β†’ extracts four types of memories (skills, failures, policies, observations). Replaces raw heuristic distillation with typed, safety-scanned memory extraction.
---
## 8. Breakthroughs
Six features that go beyond existing frameworks:
### B1: Self-Improving Critic
The Purpose Function's own quality improves over time. Meta-judging after each task generates calibration examples that make future scoring more accurate.
### B2: Mixture-of-Heuristics (MoH)
Like DeepSeek's Mixture-of-Experts: out of 100+ heuristics, only K=5 are activated per step. **Shared heuristics** (always active, like "check edge cases") + **routed heuristics** (task-specific, selected by QΓ—similarity). Knowledge grows; compute stays flat.
### B3: Hindsight Heuristic Relabeling
From HER (arxiv:1707.01495): when a task fails, instead of discarding the trajectory, ask "what DID this accomplish?" and extract heuristics for what was achieved. Learn from failures, not just successes.
### B4: Heuristic Evolution
Periodically generalize specific heuristics into abstract patterns:
- Before: "When fibonacci fails on 0, return 0"
- After: "When {function} fails on {boundary_value}, add an explicit base case"
Creates an automatic curriculum: specific β†’ general β†’ abstract.
### B5: Cross-Domain Transfer
Heuristics from coding tasks can help with different coding tasks. The `test_cross_domain_transfer()` function measures this: train on [fibonacci, factorial], test on [palindrome, fizzbuzz].
### B6: Adversarial Robustness
The `AdversarialHardener` generates 30 adversarial inputs (prompt injections, score hacks, API key leaks) and 10 benign inputs, tests the immune system against all of them. Current results: **93% catch rate, 0% false positive.**
---
## 9. User-Facing Layers
### Easy API (`easy.py`)
The `purpose()` function analyzes your description and builds the right team:
| You say | It builds |
|---------|-----------|
| "Write Python code" | architect + coder + tester |
| "Research papers" | researcher + analyst |
| "Write blog posts" | writer + editor |
| "Analyze data" | analyst + reporter |
| "Help me" | general assistant |
### Unified Capabilities (`unified.py`)
Five competing framework philosophies in one composable layer:
| Capability | Inspired By | Usage |
|-----------|-------------|-------|
| `Agent()` | OpenAI Agents SDK | One-liner agent creation |
| `Graph()` | LangGraph | Conditional branching, cycles, fan-out |
| `parallel()` | CrewAI | Concurrent task execution |
| `Conversation()` | AutoGen | Agent-to-agent message passing |
| `KnowledgeStore` | LlamaIndex | RAG as a tool |
### Robust Parser (`robust_parser.py`)
The universal solution to "LLMs can't reliably produce JSON":
- Tries TOML first (fewer tokens than JSON)
- Falls back to JSON
- Falls back to field extraction by regex
- Never crashes. Always returns something usable.
---
## 10. How Models Are Handled
### resolve_backend()
One function routes to any provider:
```python
resolve_backend("openrouter:meta-llama/llama-3.3-70b-instruct")
resolve_backend("groq:llama-3.3-70b-versatile")
resolve_backend("openai:gpt-4o")
resolve_backend("ollama:qwen3:1.7b") # Local, free
resolve_backend("hf:Qwen/Qwen3-32B")
resolve_backend("together:meta-llama/Llama-3.3-70B-Instruct-Turbo")
```
### SLM-Native Design
The framework was designed for small models (0.6B-3B params):
- **Grammar-constrained output** via Ollama (forces valid structure from any model)
- **Prompt compression** for small context windows (8K-32K)
- **Tool RAG** β€” only load relevant tools into the prompt (saves tokens)
- **TOML format** β€” ~fewer tokens than JSON
### _strip_thinking()
Handles reasoning models (Qwen3, DeepSeek-R1) that wrap output in `<think>` tags. Automatically strips the thinking and returns only the answer.
---
## 11. The Research
Every design decision traces to a published paper. The full list with citations, methodology sections, and implementation mapping is in [COMPILED_RESEARCH.md](COMPILED_RESEARCH.md).
The formal framework β€” **Purpose-MDP** with 5 axioms, 3 theorems, and convergence proofs β€” is in [PURPOSE_LEARNING.md](PURPOSE_LEARNING.md).
**Key theoretical result:** The self-improvement is a form of Potential-Based Reward Shaping (Ng et al., 1999). Our ΔΦ = Ξ¦(s') - Ξ¦(s) preserves the optimal policy while providing dense per-step feedback. The heuristic library converges to a fixed point under bounded capacity.
---
## 12. For Contributors
### File Structure
```
purpose_agent/
β”œβ”€β”€ types.py # State, Action, Trajectory, Heuristic, PurposeScore
β”œβ”€β”€ llm_backend.py # LLMBackend ABC + HF, OpenAI, Mock + resolve_backend
β”œβ”€β”€ slm_backends.py # Ollama, llama-cpp, prompt compression, SLM registry
β”œβ”€β”€ robust_parser.py # Universal parser: TOML β†’ JSON β†’ regex (never crashes)
β”œβ”€β”€ actor.py # ReAct agent with 3-tier memory prompts
β”œβ”€β”€ purpose_function.py # Ξ¦(s) critic with 7 anti-gaming rules
β”œβ”€β”€ experience_replay.py # Two-phase retrieval (similarity β†’ Q-value)
β”œβ”€β”€ optimizer.py # Trajectory β†’ heuristic distillation
β”œβ”€β”€ orchestrator.py # Main step loop
β”œβ”€β”€ v2_types.py # RunMode, MemoryScope, PurposeScoreV2
β”œβ”€β”€ trace.py # JSONL execution traces
β”œβ”€β”€ memory.py # 7 MemoryKinds Γ— 5 MemoryStatuses
β”œβ”€β”€ compiler.py # Token-budgeted prompt compilation
β”œβ”€β”€ immune.py # 5 threat scanners
β”œβ”€β”€ memory_ci.py # Quarantine β†’ scan β†’ test β†’ promote/reject
β”œβ”€β”€ evalport.py # Pluggable evaluation protocol
β”œβ”€β”€ benchmark_v2.py # Train/val/test splits with ablation
β”œβ”€β”€ meta_rewarding.py # Self-improving critic (arxiv:2407.19594)
β”œβ”€β”€ self_taught.py # Synthetic critic training (arxiv:2408.02666)
β”œβ”€β”€ prompt_optimizer.py # DSPy-style bootstrap (arxiv:2310.03714)
β”œβ”€β”€ llm_compiler.py # Parallel tool DAG (arxiv:2312.04511)
β”œβ”€β”€ retroformer.py # Structured reflection (arxiv:2308.02151)
β”œβ”€β”€ breakthroughs.py # MoH, hindsight relabeling, heuristic evolution, etc.
β”œβ”€β”€ unified.py # Agent, Graph, parallel, Conversation, KnowledgeStore
β”œβ”€β”€ easy.py # purpose(), Team, quickstart wizard
β”œβ”€β”€ tools.py # Secure built-in tools
β”œβ”€β”€ streaming.py # Async + event streaming
β”œβ”€β”€ observability.py # Cost tracking, callbacks
β”œβ”€β”€ multi_agent.py # Agent teams with shared learning
β”œβ”€β”€ hitl.py # Human-in-the-loop + checkpointing
β”œβ”€β”€ evaluation.py # V1 benchmark runner
β”œβ”€β”€ registry.py # Plugin system
β”œβ”€β”€ __init__.py # 103 exports
└── __main__.py # CLI entry point
```
### Adding a New LLM Provider
```python
# In your code (no core edits needed):
from purpose_agent import backend_registry, OpenAICompatibleBackend
backend_registry.register("my_provider",
lambda model, api_key: OpenAICompatibleBackend(
model=model, base_url="https://api.myprovider.com/v1", api_key=api_key
))
```
### Adding a New Tool
```python
from purpose_agent import FunctionTool
def my_search(query: str) -> str:
"""Search my database."""
return db.search(query)
tool = FunctionTool.from_function(my_search)
```
### Running Tests
```bash
python tests/test_core.py # 21 unit tests
python tests/launch_readiness.py # 119 comprehensive tests
python benchmarks/validate.py # Mock benchmark suite
python benchmarks/validate.py --quick # Fast smoke test
```