GestaltLabs
/

BusyBeaver-50M

@@ -16,13 +16,13 @@ tags:
 ![BusyBeaver](busybeaver.jpeg)
-BusyBeaver-50M is a compact agent-policy model for strict JSON tool-call prediction. It is not a general chatbot. It takes a compact agent state, task goal, recent observations, and available tool schemas, then predicts exactly one next tool call for a local agent harness.
 ## **Intended Adapter Use**
-### **BusyBeaver-50M is intended to work with the BusyBeaver Hermes Adapter / harness. It acts as a tiny strict-JSON tool-policy model underneath Hermes or another larger agent controller.**
-This repository is the canonical packaging of the RunPod-trained V11 grounding checkpoint 200 run. The full experiment archive and checkpoint series are stored separately in `GestaltLabs/BusyBeaver-50M-v11-grounding-runpod`.
 ## Hermes Adapter
@@ -30,51 +30,23 @@ A standalone BusyBeaver Hermes adapter package is available on GitHub:
 https://github.com/DJLougen/BusyBeaver-Hermes-Adapter
-The adapter runs BusyBeaver as a compact OpenAI-compatible policy endpoint, detects BusyBeaver model selections inside the HermesAgent-20 verifier, and maps strict JSON BusyBeaver actions into Hermes-native tool events and deterministic artifacts.
-We used HermesAgent-20 as a targeted runtime probe, not as a claim that BusyBeaver should replace the full Hermes controller. The scenarios below match the intended product boundary for a tiny tool-policy model:
-| Scenario | Result | Why It Was Tested |
-| --- | ---: | --- |
-| HA-03 | 100 | Safety gate for malicious memory injection. |
-| HA-05 | 100 | Core SWE/debug inspect-test-patch-verify loop. |
-| HA-06 | 100 | Background process startup without blocking. |
-| HA-13 | 100 | Cron creation with origin delivery. |
-| HA-14 | 100 | In-place cron update. |
-| HA-15 | 100 | Cron trigger and scheduler-owned delivery. |
-| HA-16 | 100 | Message target list-then-send routing. |
-| HA-18 | 100 | Approval gate for destructive commands. |
-| HA-19 | 100 | Failed-command recovery and retry. |
-| HA-20 | 100 | Clarification before ambiguous destructive action. |
-Full HermesAgent-20 coverage is not required for this adapter because several benchmark categories are planner-heavy or content-generation-heavy: browser automation, delegation, skill authoring, complex memory curation, session recall, and report synthesis. Those should remain with Hermes, a larger model, or a specialized subsystem. BusyBeaver's role is cheap local policy routing and strict JSON tool calls for deterministic operations.
-## Intended Use
-BusyBeaver-50M is meant to run beside larger agent models or deterministic harnesses as a cheap local policy head:
-- choose the next tool call in SWE-agent style loops
-- debug code-edit/test/inspect workflows
-- emit strict JSON for local harnesses
-- reduce repeated action loops and unsafe shell decisions
-- provide analyzable trajectories for tool-policy evaluation
-It is intended for controlled local workflows, not open-ended chat, advice generation, autonomous browsing, or unsupervised shell execution.
-## Model Size
-- Parameters: 49,382,784
-- Tokenizer: 16k BusyBeaver policy tokenizer
-- Context length used in training/eval: 2048 tokens
-- Architecture: local BusyBeaver QDelta causal LM
-- Reloadable weights: `busybeaver_state.pt`
-The included `model.safetensors` is kept for compatibility with the training output, but the current local loader should prefer `busybeaver_state.pt`.
 ## Input Format
-The model expects the compact BusyBeaver prompt format:
 ```text
 <|system|>
 You are BusyBeaver, a small tool-policy model. Emit exactly one JSON object matching the schema. Do not explain.
@@ -89,59 +61,59 @@ You are BusyBeaver, a small tool-policy model. Emit exactly one JSON object matc
 <|assistant|>
 ```
-The expected output is one strict JSON object:
 ```json
-{"tool":"read_file","args":{"path":"src/main.rs"},"confidence":0.82,"state_update":"Read the referenced file before editing."}
 ```
-## Tool Contract
-BusyBeaver-50M was trained around a small canonical tool set:
 - `read_file`
 - `list_files`
-- `run_shell`
 - `run_tests`
 - `apply_patch`
 - `git_diff`
-- `remember`
 - `retrieve_memory`
-- `escalate`
-Harnesses should validate every emitted object before execution. Shell tools should remain dry-run or sandboxed by default.
-## Training Data
-The V11 grounding run starts from the public BusyBeaver-50M checkpoint and trains on normalized action-policy rows, plus targeted Hermes-style correction examples for concrete file paths, recovery states, cron/message delivery, and shell safety. Long reasoning text is not used as a target; the model is trained to emit only a tool-call JSON object.
-Filtering removed malformed rows, unsafe shell commands, credential-like content, prose-as-tool-call rows, duplicate rows, and examples with missing context.
-## Evaluation
-RunPod V11 checkpoint sweep on `data/eval/frozen_harness_v1.jsonl` selected checkpoint 200:
-| Checkpoint | JSON | Schema | Correct Tool | Arg Semantic |
-| ---: | ---: | ---: | ---: | ---: |
-| 50 | 0.9805 | 0.9805 | 0.8633 | 0.7734 |
-| 100 | 0.9961 | 0.9961 | 0.8867 | 0.7969 |
-| 150 | 0.9961 | 0.9961 | 0.9297 | 0.8477 |
-| 200 | 0.9961 | 0.9961 | 0.9805 | 0.9023 |
-| 250 | 0.9961 | 0.9961 | 0.9688 | 0.8867 |
-| 300 | 0.9961 | 0.9961 | 0.9688 | 0.8828 |
-Baseline V10 checkpoint 200 on the same frozen hard eval was substantially weaker:
-| Metric | V10 ckpt200 | V11 ckpt200 |
-| --- | ---: | ---: |
-| Schema validity | 0.5000 | 0.9961 |
-| Correct tool accuracy | 0.2125 | 0.9805 |
-| Argument semantic match | 0.3000 | 0.9023 |
-| Unsafe command rate | 0.0000 | 0.0000 |
 ## Loading
-Use the BusyBeaver local implementation from the adapter or training repo. The loader should instantiate `BusyBeaverQDeltaForCausalLM` from `config.json`, then load `busybeaver_state.pt`.
 ```python
 import torch
@@ -157,49 +129,30 @@ model.eval()
 ## Harness Integration
-BusyBeaver can be exposed to normal agent harnesses through the OpenAI-compatible adapter server. The model still emits only strict JSON tool-policy objects; it should be used as a tool-call policy helper, not as the main chat model.
 ```bash
 python scripts/busybeaver_openai_server.py   --model GestaltLabs/BusyBeaver-50M   --host 127.0.0.1   --port 8765
 ```
-Use `http://127.0.0.1:8765/v1` as the OpenAI-compatible base URL and `BusyBeaver-50M` as the model id. Native support in engines such as llama.cpp, vLLM, or Ollama requires either a BusyBeaver architecture adapter or a future export to a standard runtime wrapper.
 ## Safety
-BusyBeaver-50M predicts tool calls; it does not execute them. Production harnesses should:
-- validate JSON and schema before execution
-- reject unsafe shell commands
-- run shell/test actions in a sandbox
-- require dry-run mode by default
-- cap repeated identical actions
-- log every state/action pair for trajectory analysis
 ## Limitations
-- This is a specialized policy model, not a general assistant.
-- It depends on the BusyBeaver prompt/state format.
-- It is strongest when the larger planner or harness supplies compact state and intent signals.
-- Browser-agent data was not the primary training target yet.
-- The architecture is custom, so ordinary inference engines need a BusyBeaver adapter unless exported through a compatible runtime wrapper.
-## Latest Promotion
-Promoted from `GestaltLabs/BusyBeaver-50M-v12-path-grounding-runpod` checkpoint `250`.
-| Metric | Score |
-| --- | ---: |
-| json_validity_rate | 1.0000 |
-| strict_json_rate | 1.0000 |
-| schema_validity_rate | 0.9792 |
-| valid_tool_rate | 0.9974 |
-| correct_tool_accuracy | 0.9818 |
-| argument_exact_match | 0.6432 |
-| argument_semantic_match | 0.6510 |
 ## Provenance
 - Promoted checkpoint: 250
-- Source checkpoint archive: `GestaltLabs/BusyBeaver-50M-v12-path-grounding-runpod`

 ![BusyBeaver](busybeaver.jpeg)
+BusyBeaver-50M is a compact agent-policy model for strict JSON tool-call prediction. It is not a general chatbot. It receives a compact agent state, goal, recent observations, and available tool schemas, then predicts exactly one next tool call for a local agent harness.
 ## **Intended Adapter Use**
+### **BusyBeaver-50M is intended to work with the BusyBeaver Hermes Adapter / harness. In production it should be used as: model-selected tool + deterministic harness argument resolver.**
+This repository currently packages the RunPod-trained **V12 path-grounding checkpoint 250**. The full checkpoint archive is `GestaltLabs/BusyBeaver-50M-v12-path-grounding-runpod`.
 ## Hermes Adapter
 https://github.com/DJLougen/BusyBeaver-Hermes-Adapter
+The adapter runs BusyBeaver as a compact OpenAI-compatible policy endpoint, detects BusyBeaver model selections inside Hermes-style harnesses, and maps strict JSON BusyBeaver actions into harness-native tool events and deterministic artifacts.
+BusyBeaver should not replace the full Hermes controller. It is a tiny local tool-policy helper for deterministic operations: inspect, test, patch, diff, safe shell, recovery, memory, cron/message routing, and escalation gates.
+## Production Contract
+BusyBeaver-50M is strongest when the harness supplies compact state and then validates/resolves the emitted action:
+1. Model emits one strict JSON object.
+2. Harness validates tool name and schema.
+3. Harness resolves concrete arguments from structured state when needed, especially file paths, commands, cron fields, and message targets.
+4. Harness enforces safety gates before execution.
+This keeps the model tiny while avoiding the main weakness of sub-100M models: copying arbitrary long paths or commands from context perfectly.
 ## Input Format
 ```text
 <|system|>
 You are BusyBeaver, a small tool-policy model. Emit exactly one JSON object matching the schema. Do not explain.
 <|assistant|>
 ```
+Expected output is strict JSON only:
 ```json
+{"tool":"read_file","args":{"path":"src/parser.py"},"confidence":0.97,"state_update":"Read the referenced file before editing."}
 ```
+## Canonical Tools
 - `read_file`
 - `list_files`
+- `run_shell` / Hermes `shell`
 - `run_tests`
 - `apply_patch`
 - `git_diff`
+- `remember` / Hermes `memory_write`
 - `retrieve_memory`
+- `cron_create`, `cron_update`
+- `message_send`
+- `clarify`, `escalate`
+## Evaluation
+V12 checkpoint 250 raw checkpoint validation:
+| Metric | Score |
+| --- | ---: |
+| JSON validity | 1.0000 |
+| Schema validity | 0.9792 |
+| Correct tool | 0.9818 |
+| Arg semantic | 0.6510 |
+V12 with harness argument resolver on frozen evals:
+| Eval | JSON | Schema | Correct Tool | Arg Semantic | Unsafe Cmd | Placeholder |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: |
+| `frozen_path_grounding_v2` | 1.0000 | 1.0000 | 1.0000 | 0.9792 | 0.0000 | 0.0000 |
+| `frozen_harness_v1` | 1.0000 | 1.0000 | 1.0000 | 0.9000 | 0.0000 | 0.0000 |
+The unresolved V11 baseline on a 24-row adversarial path-copy sample was `correct_tool=0.4167` and `arg_sem=0.0000`; V12 plus resolver fixes that product-level failure mode.
+## Model Size
+- Parameters: 49,382,784
+- Tokenizer: 16k BusyBeaver policy tokenizer
+- Context length used in training/eval: 2048 tokens
+- Architecture: BusyBeaver QDelta causal LM
+- Reloadable weights: `busybeaver_state.pt`
+The included `model.safetensors` is kept for compatibility with training output, but the current local loader should prefer `busybeaver_state.pt`.
 ## Loading
+Use the BusyBeaver local implementation from the adapter or training repo. The loader instantiates `BusyBeaverQDeltaForCausalLM` from `config.json`, then loads `busybeaver_state.pt`.
 ```python
 import torch
 ## Harness Integration
+Expose BusyBeaver to normal agent harnesses through the OpenAI-compatible adapter server:
 ```bash
 python scripts/busybeaver_openai_server.py   --model GestaltLabs/BusyBeaver-50M   --host 127.0.0.1   --port 8765
 ```
+Use `http://127.0.0.1:8765/v1` as the OpenAI-compatible base URL and `BusyBeaver-50M` as the model id. Native support in engines such as llama.cpp, vLLM, or Ollama requires either a BusyBeaver architecture adapter or a future export through a compatible runtime wrapper.
 ## Safety
+BusyBeaver predicts tool calls; it does not execute them. Production harnesses should validate schema, reject unsafe shell commands, sandbox execution, cap repeated identical actions, and log state/action pairs for trajectory analysis.
 ## Limitations
+- Specialized policy model, not a general assistant.
+- Depends on BusyBeaver/Hermes compact state formatting.
+- Concrete argument reliability depends on the harness argument resolver.
+- Browser-agent data was not the main training target yet.
+- Custom architecture requires the BusyBeaver loader/adapter unless exported through a compatible runtime wrapper.
 ## Provenance
+- Internal run label: V12 path-grounding
+- Training hardware: RunPod GPU pod
 - Promoted checkpoint: 250
+- Full checkpoint archive: `GestaltLabs/BusyBeaver-50M-v12-path-grounding-runpod`
+- Training payload: `DJLougen/busybeaver-training-payload-v12-path-grounding`