Update BusyBeaver V12 resolved eval card

0d5d210 verified 5 days ago

6.07 kB

	---
	license: other
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- busybeaver
	- tool-calling
	- agent-policy
	- json
	- local-agents
	- qdelta
	- 50m
	---

	# BusyBeaver-50M

	![BusyBeaver](busybeaver.jpeg)

	BusyBeaver-50M is a compact agent-policy model for strict JSON tool-call prediction. It is not a general chatbot. It receives a compact agent state, goal, recent observations, and available tool schemas, then predicts exactly one next tool call for a local agent harness.

	## Intended Adapter Use

	### BusyBeaver-50M is intended to work with the BusyBeaver Hermes Adapter / harness. In production it should be used as: model-selected tool + deterministic harness argument resolver.

	This repository currently packages the RunPod-trained V12 path-grounding checkpoint 250. The full checkpoint archive is `GestaltLabs/BusyBeaver-50M-v12-path-grounding-runpod`.

	## Hermes Adapter

	A standalone BusyBeaver Hermes adapter package is available on GitHub:

	https://github.com/DJLougen/BusyBeaver-Hermes-Adapter

	The adapter runs BusyBeaver as a compact OpenAI-compatible policy endpoint, detects BusyBeaver model selections inside Hermes-style harnesses, and maps strict JSON BusyBeaver actions into harness-native tool events and deterministic artifacts.

	BusyBeaver should not replace the full Hermes controller. It is a tiny local tool-policy helper for deterministic operations: inspect, test, patch, diff, safe shell, recovery, memory, cron/message routing, and escalation gates.

	## Production Contract

	BusyBeaver-50M is strongest when the harness supplies compact state and then validates/resolves the emitted action:

	1. Model emits one strict JSON object.
	2. Harness validates tool name and schema.
	3. Harness resolves concrete arguments from structured state when needed, especially file paths, commands, cron fields, and message targets.
	4. Harness enforces safety gates before execution.

	This keeps the model tiny while avoiding the main weakness of sub-100M models: copying arbitrary long paths or commands from context perfectly.

	## Input Format

	```text
	<\|system\|>
	You are BusyBeaver, a small tool-policy model. Emit exactly one JSON object matching the schema. Do not explain.
	<\|goal\|>
	...
	<\|state\|>
	...
	<\|tools\|>
	...
	<\|output_schema\|>
	{"tool":"string","args":"object","confidence":"number","state_update":"string"}
	<\|assistant\|>
	```

	Expected output is strict JSON only:

	```json
	{"tool":"read_file","args":{"path":"src/parser.py"},"confidence":0.97,"state_update":"Read the referenced file before editing."}
	```

	## Canonical Tools

	- `read_file`
	- `list_files`
	- `run_shell` / Hermes `shell`
	- `run_tests`
	- `apply_patch`
	- `git_diff`
	- `remember` / Hermes `memory_write`
	- `retrieve_memory`
	- `cron_create`, `cron_update`
	- `message_send`
	- `clarify`, `escalate`

	## Evaluation

	V12 checkpoint 250 raw checkpoint validation:

	\| Metric \| Score \|
	\| --- \| ---: \|
	\| JSON validity \| 1.0000 \|
	\| Schema validity \| 0.9792 \|
	\| Correct tool \| 0.9818 \|
	\| Arg semantic \| 0.6510 \|

	V12 with harness argument resolver on frozen evals:

	\| Eval \| JSON \| Schema \| Correct Tool \| Arg Semantic \| Unsafe Cmd \| Placeholder \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| `frozen_path_grounding_v2` \| 1.0000 \| 1.0000 \| 1.0000 \| 0.9792 \| 0.0000 \| 0.0000 \|
	\| `frozen_harness_v1` \| 1.0000 \| 1.0000 \| 1.0000 \| 0.9000 \| 0.0000 \| 0.0000 \|

	The unresolved V11 baseline on a 24-row adversarial path-copy sample was `correct_tool=0.4167` and `arg_sem=0.0000`; V12 plus resolver fixes that product-level failure mode.

	## Model Size

	- Parameters: 49,382,784
	- Tokenizer: 16k BusyBeaver policy tokenizer
	- Context length used in training/eval: 2048 tokens
	- Architecture: BusyBeaver QDelta causal LM
	- Reloadable weights: `busybeaver_state.pt`

	The included `model.safetensors` is kept for compatibility with training output, but the current local loader should prefer `busybeaver_state.pt`.

	## Loading

	Use the BusyBeaver local implementation from the adapter or training repo. The loader instantiates `BusyBeaverQDeltaForCausalLM` from `config.json`, then loads `busybeaver_state.pt`.

	```python
	import torch
	from busybeaver.modeling import BusyBeaverQDeltaConfig, BusyBeaverQDeltaForCausalLM

	model_dir = "path/to/BusyBeaver-50M"
	cfg = BusyBeaverQDeltaConfig.from_pretrained(model_dir)
	model = BusyBeaverQDeltaForCausalLM(cfg)
	state = torch.load(f"{model_dir}/busybeaver_state.pt", map_location="cpu")
	model.load_state_dict(state, strict=True)
	model.eval()
	```

	## Harness Integration

	Expose BusyBeaver to normal agent harnesses through the OpenAI-compatible adapter server:

	```bash
	python scripts/busybeaver_openai_server.py --model GestaltLabs/BusyBeaver-50M --host 127.0.0.1 --port 8765
	```

	Use `http://127.0.0.1:8765/v1` as the OpenAI-compatible base URL and `BusyBeaver-50M` as the model id. Native support in engines such as llama.cpp, vLLM, or Ollama requires either a BusyBeaver architecture adapter or a future export through a compatible runtime wrapper.

	## Safety

	BusyBeaver predicts tool calls; it does not execute them. Production harnesses should validate schema, reject unsafe shell commands, sandbox execution, cap repeated identical actions, and log state/action pairs for trajectory analysis.

	## Limitations

	- Specialized policy model, not a general assistant.
	- Depends on BusyBeaver/Hermes compact state formatting.
	- Concrete argument reliability depends on the harness argument resolver.
	- Browser-agent data was not the main training target yet.
	- Custom architecture requires the BusyBeaver loader/adapter unless exported through a compatible runtime wrapper.

	## Provenance

	- Internal run label: V12 path-grounding
	- Training hardware: RunPod GPU pod
	- Promoted checkpoint: 250
	- Full checkpoint archive: `GestaltLabs/BusyBeaver-50M-v12-path-grounding-runpod`
	- Training payload: `DJLougen/busybeaver-training-payload-v12-path-grounding`