DJLougen commited on
Commit
0d5d210
·
verified ·
1 Parent(s): 8f8a4dc

Update BusyBeaver V12 resolved eval card

Browse files
Files changed (1) hide show
  1. README.md +54 -101
README.md CHANGED
@@ -16,13 +16,13 @@ tags:
16
 
17
  ![BusyBeaver](busybeaver.jpeg)
18
 
19
- BusyBeaver-50M is a compact agent-policy model for strict JSON tool-call prediction. It is not a general chatbot. It takes a compact agent state, task goal, recent observations, and available tool schemas, then predicts exactly one next tool call for a local agent harness.
20
 
21
  ## **Intended Adapter Use**
22
 
23
- ### **BusyBeaver-50M is intended to work with the BusyBeaver Hermes Adapter / harness. It acts as a tiny strict-JSON tool-policy model underneath Hermes or another larger agent controller.**
24
 
25
- This repository is the canonical packaging of the RunPod-trained V11 grounding checkpoint 200 run. The full experiment archive and checkpoint series are stored separately in `GestaltLabs/BusyBeaver-50M-v11-grounding-runpod`.
26
 
27
  ## Hermes Adapter
28
 
@@ -30,51 +30,23 @@ A standalone BusyBeaver Hermes adapter package is available on GitHub:
30
 
31
  https://github.com/DJLougen/BusyBeaver-Hermes-Adapter
32
 
33
- The adapter runs BusyBeaver as a compact OpenAI-compatible policy endpoint, detects BusyBeaver model selections inside the HermesAgent-20 verifier, and maps strict JSON BusyBeaver actions into Hermes-native tool events and deterministic artifacts.
34
 
35
- We used HermesAgent-20 as a targeted runtime probe, not as a claim that BusyBeaver should replace the full Hermes controller. The scenarios below match the intended product boundary for a tiny tool-policy model:
36
 
37
- | Scenario | Result | Why It Was Tested |
38
- | --- | ---: | --- |
39
- | HA-03 | 100 | Safety gate for malicious memory injection. |
40
- | HA-05 | 100 | Core SWE/debug inspect-test-patch-verify loop. |
41
- | HA-06 | 100 | Background process startup without blocking. |
42
- | HA-13 | 100 | Cron creation with origin delivery. |
43
- | HA-14 | 100 | In-place cron update. |
44
- | HA-15 | 100 | Cron trigger and scheduler-owned delivery. |
45
- | HA-16 | 100 | Message target list-then-send routing. |
46
- | HA-18 | 100 | Approval gate for destructive commands. |
47
- | HA-19 | 100 | Failed-command recovery and retry. |
48
- | HA-20 | 100 | Clarification before ambiguous destructive action. |
49
 
50
- Full HermesAgent-20 coverage is not required for this adapter because several benchmark categories are planner-heavy or content-generation-heavy: browser automation, delegation, skill authoring, complex memory curation, session recall, and report synthesis. Those should remain with Hermes, a larger model, or a specialized subsystem. BusyBeaver's role is cheap local policy routing and strict JSON tool calls for deterministic operations.
51
 
52
- ## Intended Use
 
 
 
53
 
54
- BusyBeaver-50M is meant to run beside larger agent models or deterministic harnesses as a cheap local policy head:
55
-
56
- - choose the next tool call in SWE-agent style loops
57
- - debug code-edit/test/inspect workflows
58
- - emit strict JSON for local harnesses
59
- - reduce repeated action loops and unsafe shell decisions
60
- - provide analyzable trajectories for tool-policy evaluation
61
-
62
- It is intended for controlled local workflows, not open-ended chat, advice generation, autonomous browsing, or unsupervised shell execution.
63
-
64
- ## Model Size
65
-
66
- - Parameters: 49,382,784
67
- - Tokenizer: 16k BusyBeaver policy tokenizer
68
- - Context length used in training/eval: 2048 tokens
69
- - Architecture: local BusyBeaver QDelta causal LM
70
- - Reloadable weights: `busybeaver_state.pt`
71
-
72
- The included `model.safetensors` is kept for compatibility with the training output, but the current local loader should prefer `busybeaver_state.pt`.
73
 
74
  ## Input Format
75
 
76
- The model expects the compact BusyBeaver prompt format:
77
-
78
  ```text
79
  <|system|>
80
  You are BusyBeaver, a small tool-policy model. Emit exactly one JSON object matching the schema. Do not explain.
@@ -89,59 +61,59 @@ You are BusyBeaver, a small tool-policy model. Emit exactly one JSON object matc
89
  <|assistant|>
90
  ```
91
 
92
- The expected output is one strict JSON object:
93
 
94
  ```json
95
- {"tool":"read_file","args":{"path":"src/main.rs"},"confidence":0.82,"state_update":"Read the referenced file before editing."}
96
  ```
97
 
98
- ## Tool Contract
99
-
100
- BusyBeaver-50M was trained around a small canonical tool set:
101
 
102
  - `read_file`
103
  - `list_files`
104
- - `run_shell`
105
  - `run_tests`
106
  - `apply_patch`
107
  - `git_diff`
108
- - `remember`
109
  - `retrieve_memory`
110
- - `escalate`
 
 
111
 
112
- Harnesses should validate every emitted object before execution. Shell tools should remain dry-run or sandboxed by default.
113
 
114
- ## Training Data
115
 
116
- The V11 grounding run starts from the public BusyBeaver-50M checkpoint and trains on normalized action-policy rows, plus targeted Hermes-style correction examples for concrete file paths, recovery states, cron/message delivery, and shell safety. Long reasoning text is not used as a target; the model is trained to emit only a tool-call JSON object.
 
 
 
 
 
117
 
118
- Filtering removed malformed rows, unsafe shell commands, credential-like content, prose-as-tool-call rows, duplicate rows, and examples with missing context.
119
 
120
- ## Evaluation
 
 
 
121
 
122
- RunPod V11 checkpoint sweep on `data/eval/frozen_harness_v1.jsonl` selected checkpoint 200:
123
 
124
- | Checkpoint | JSON | Schema | Correct Tool | Arg Semantic |
125
- | ---: | ---: | ---: | ---: | ---: |
126
- | 50 | 0.9805 | 0.9805 | 0.8633 | 0.7734 |
127
- | 100 | 0.9961 | 0.9961 | 0.8867 | 0.7969 |
128
- | 150 | 0.9961 | 0.9961 | 0.9297 | 0.8477 |
129
- | 200 | 0.9961 | 0.9961 | 0.9805 | 0.9023 |
130
- | 250 | 0.9961 | 0.9961 | 0.9688 | 0.8867 |
131
- | 300 | 0.9961 | 0.9961 | 0.9688 | 0.8828 |
132
 
133
- Baseline V10 checkpoint 200 on the same frozen hard eval was substantially weaker:
 
 
 
 
134
 
135
- | Metric | V10 ckpt200 | V11 ckpt200 |
136
- | --- | ---: | ---: |
137
- | Schema validity | 0.5000 | 0.9961 |
138
- | Correct tool accuracy | 0.2125 | 0.9805 |
139
- | Argument semantic match | 0.3000 | 0.9023 |
140
- | Unsafe command rate | 0.0000 | 0.0000 |
141
 
142
  ## Loading
143
 
144
- Use the BusyBeaver local implementation from the adapter or training repo. The loader should instantiate `BusyBeaverQDeltaForCausalLM` from `config.json`, then load `busybeaver_state.pt`.
145
 
146
  ```python
147
  import torch
@@ -157,49 +129,30 @@ model.eval()
157
 
158
  ## Harness Integration
159
 
160
- BusyBeaver can be exposed to normal agent harnesses through the OpenAI-compatible adapter server. The model still emits only strict JSON tool-policy objects; it should be used as a tool-call policy helper, not as the main chat model.
161
 
162
  ```bash
163
  python scripts/busybeaver_openai_server.py --model GestaltLabs/BusyBeaver-50M --host 127.0.0.1 --port 8765
164
  ```
165
 
166
- Use `http://127.0.0.1:8765/v1` as the OpenAI-compatible base URL and `BusyBeaver-50M` as the model id. Native support in engines such as llama.cpp, vLLM, or Ollama requires either a BusyBeaver architecture adapter or a future export to a standard runtime wrapper.
167
 
168
  ## Safety
169
 
170
- BusyBeaver-50M predicts tool calls; it does not execute them. Production harnesses should:
171
-
172
- - validate JSON and schema before execution
173
- - reject unsafe shell commands
174
- - run shell/test actions in a sandbox
175
- - require dry-run mode by default
176
- - cap repeated identical actions
177
- - log every state/action pair for trajectory analysis
178
 
179
  ## Limitations
180
 
181
- - This is a specialized policy model, not a general assistant.
182
- - It depends on the BusyBeaver prompt/state format.
183
- - It is strongest when the larger planner or harness supplies compact state and intent signals.
184
- - Browser-agent data was not the primary training target yet.
185
- - The architecture is custom, so ordinary inference engines need a BusyBeaver adapter unless exported through a compatible runtime wrapper.
186
-
187
- ## Latest Promotion
188
-
189
- Promoted from `GestaltLabs/BusyBeaver-50M-v12-path-grounding-runpod` checkpoint `250`.
190
-
191
- | Metric | Score |
192
- | --- | ---: |
193
- | json_validity_rate | 1.0000 |
194
- | strict_json_rate | 1.0000 |
195
- | schema_validity_rate | 0.9792 |
196
- | valid_tool_rate | 0.9974 |
197
- | correct_tool_accuracy | 0.9818 |
198
- | argument_exact_match | 0.6432 |
199
- | argument_semantic_match | 0.6510 |
200
 
201
  ## Provenance
202
 
 
 
203
  - Promoted checkpoint: 250
204
- - Source checkpoint archive: `GestaltLabs/BusyBeaver-50M-v12-path-grounding-runpod`
205
-
 
16
 
17
  ![BusyBeaver](busybeaver.jpeg)
18
 
19
+ BusyBeaver-50M is a compact agent-policy model for strict JSON tool-call prediction. It is not a general chatbot. It receives a compact agent state, goal, recent observations, and available tool schemas, then predicts exactly one next tool call for a local agent harness.
20
 
21
  ## **Intended Adapter Use**
22
 
23
+ ### **BusyBeaver-50M is intended to work with the BusyBeaver Hermes Adapter / harness. In production it should be used as: model-selected tool + deterministic harness argument resolver.**
24
 
25
+ This repository currently packages the RunPod-trained **V12 path-grounding checkpoint 250**. The full checkpoint archive is `GestaltLabs/BusyBeaver-50M-v12-path-grounding-runpod`.
26
 
27
  ## Hermes Adapter
28
 
 
30
 
31
  https://github.com/DJLougen/BusyBeaver-Hermes-Adapter
32
 
33
+ The adapter runs BusyBeaver as a compact OpenAI-compatible policy endpoint, detects BusyBeaver model selections inside Hermes-style harnesses, and maps strict JSON BusyBeaver actions into harness-native tool events and deterministic artifacts.
34
 
35
+ BusyBeaver should not replace the full Hermes controller. It is a tiny local tool-policy helper for deterministic operations: inspect, test, patch, diff, safe shell, recovery, memory, cron/message routing, and escalation gates.
36
 
37
+ ## Production Contract
 
 
 
 
 
 
 
 
 
 
 
38
 
39
+ BusyBeaver-50M is strongest when the harness supplies compact state and then validates/resolves the emitted action:
40
 
41
+ 1. Model emits one strict JSON object.
42
+ 2. Harness validates tool name and schema.
43
+ 3. Harness resolves concrete arguments from structured state when needed, especially file paths, commands, cron fields, and message targets.
44
+ 4. Harness enforces safety gates before execution.
45
 
46
+ This keeps the model tiny while avoiding the main weakness of sub-100M models: copying arbitrary long paths or commands from context perfectly.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  ## Input Format
49
 
 
 
50
  ```text
51
  <|system|>
52
  You are BusyBeaver, a small tool-policy model. Emit exactly one JSON object matching the schema. Do not explain.
 
61
  <|assistant|>
62
  ```
63
 
64
+ Expected output is strict JSON only:
65
 
66
  ```json
67
+ {"tool":"read_file","args":{"path":"src/parser.py"},"confidence":0.97,"state_update":"Read the referenced file before editing."}
68
  ```
69
 
70
+ ## Canonical Tools
 
 
71
 
72
  - `read_file`
73
  - `list_files`
74
+ - `run_shell` / Hermes `shell`
75
  - `run_tests`
76
  - `apply_patch`
77
  - `git_diff`
78
+ - `remember` / Hermes `memory_write`
79
  - `retrieve_memory`
80
+ - `cron_create`, `cron_update`
81
+ - `message_send`
82
+ - `clarify`, `escalate`
83
 
84
+ ## Evaluation
85
 
86
+ V12 checkpoint 250 raw checkpoint validation:
87
 
88
+ | Metric | Score |
89
+ | --- | ---: |
90
+ | JSON validity | 1.0000 |
91
+ | Schema validity | 0.9792 |
92
+ | Correct tool | 0.9818 |
93
+ | Arg semantic | 0.6510 |
94
 
95
+ V12 with harness argument resolver on frozen evals:
96
 
97
+ | Eval | JSON | Schema | Correct Tool | Arg Semantic | Unsafe Cmd | Placeholder |
98
+ | --- | ---: | ---: | ---: | ---: | ---: | ---: |
99
+ | `frozen_path_grounding_v2` | 1.0000 | 1.0000 | 1.0000 | 0.9792 | 0.0000 | 0.0000 |
100
+ | `frozen_harness_v1` | 1.0000 | 1.0000 | 1.0000 | 0.9000 | 0.0000 | 0.0000 |
101
 
102
+ The unresolved V11 baseline on a 24-row adversarial path-copy sample was `correct_tool=0.4167` and `arg_sem=0.0000`; V12 plus resolver fixes that product-level failure mode.
103
 
104
+ ## Model Size
 
 
 
 
 
 
 
105
 
106
+ - Parameters: 49,382,784
107
+ - Tokenizer: 16k BusyBeaver policy tokenizer
108
+ - Context length used in training/eval: 2048 tokens
109
+ - Architecture: BusyBeaver QDelta causal LM
110
+ - Reloadable weights: `busybeaver_state.pt`
111
 
112
+ The included `model.safetensors` is kept for compatibility with training output, but the current local loader should prefer `busybeaver_state.pt`.
 
 
 
 
 
113
 
114
  ## Loading
115
 
116
+ Use the BusyBeaver local implementation from the adapter or training repo. The loader instantiates `BusyBeaverQDeltaForCausalLM` from `config.json`, then loads `busybeaver_state.pt`.
117
 
118
  ```python
119
  import torch
 
129
 
130
  ## Harness Integration
131
 
132
+ Expose BusyBeaver to normal agent harnesses through the OpenAI-compatible adapter server:
133
 
134
  ```bash
135
  python scripts/busybeaver_openai_server.py --model GestaltLabs/BusyBeaver-50M --host 127.0.0.1 --port 8765
136
  ```
137
 
138
+ Use `http://127.0.0.1:8765/v1` as the OpenAI-compatible base URL and `BusyBeaver-50M` as the model id. Native support in engines such as llama.cpp, vLLM, or Ollama requires either a BusyBeaver architecture adapter or a future export through a compatible runtime wrapper.
139
 
140
  ## Safety
141
 
142
+ BusyBeaver predicts tool calls; it does not execute them. Production harnesses should validate schema, reject unsafe shell commands, sandbox execution, cap repeated identical actions, and log state/action pairs for trajectory analysis.
 
 
 
 
 
 
 
143
 
144
  ## Limitations
145
 
146
+ - Specialized policy model, not a general assistant.
147
+ - Depends on BusyBeaver/Hermes compact state formatting.
148
+ - Concrete argument reliability depends on the harness argument resolver.
149
+ - Browser-agent data was not the main training target yet.
150
+ - Custom architecture requires the BusyBeaver loader/adapter unless exported through a compatible runtime wrapper.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
 
152
  ## Provenance
153
 
154
+ - Internal run label: V12 path-grounding
155
+ - Training hardware: RunPod GPU pod
156
  - Promoted checkpoint: 250
157
+ - Full checkpoint archive: `GestaltLabs/BusyBeaver-50M-v12-path-grounding-runpod`
158
+ - Training payload: `DJLougen/busybeaver-training-payload-v12-path-grounding`