Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

README.md +164 -0
busybeaver_eval/metrics.json +51 -0
busybeaver_eval/report.md +46 -0
busybeaver_eval/traces.jsonl +0 -0
busybeaver_state.pt +3 -0
config.json +29 -0
model.safetensors +3 -0
rng_state.pth +3 -0
scheduler.pt +3 -0
special_tokens_map.json +18 -0
tokenizer.json +0 -0
tokenizer_config.json +20 -0
trainer_state.json +190 -0
training_args.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,164 @@

+---
+license: other
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+  - busybeaver
+  - tool-calling
+  - agent-policy
+  - json
+  - local-agents
+  - qdelta
+  - 50m
+private: true
+---
+# BusyBeaver-50M
+BusyBeaver-50M is a compact agent-policy model for strict JSON tool-call prediction. It is not a general chatbot. It takes a compact agent state, task goal, recent observations, and available tool schemas, then predicts exactly one next tool call for a local agent harness.
+This repository is the canonical packaging of the internally tracked V10 checkpoint 200 run.
+## Intended Use
+BusyBeaver-50M is meant to run beside larger agent models or deterministic harnesses as a cheap local policy head:
+- choose the next tool call in SWE-agent style loops
+- debug code-edit/test/inspect workflows
+- emit strict JSON for local harnesses
+- reduce repeated action loops and unsafe shell decisions
+- provide analyzable trajectories for tool-policy evaluation
+It is intended for controlled local workflows, not open-ended chat, advice generation, autonomous browsing, or unsupervised shell execution.
+## Model Size
+- Parameters: 49,382,784
+- Tokenizer: 16k BusyBeaver policy tokenizer
+- Context length used in training/eval: 2048 tokens
+- Architecture: local BusyBeaver QDelta causal LM
+- Reloadable weights: `busybeaver_state.pt`
+The included `model.safetensors` is kept for compatibility with the training output, but the current local loader should prefer `busybeaver_state.pt`.
+## Input Format
+The model expects the compact BusyBeaver prompt format:
+```text
+<|system|>
+You are BusyBeaver, a small tool-policy model. Emit exactly one JSON object matching the schema. Do not explain.
+<|goal|>
+...
+<|state|>
+...
+<|tools|>
+...
+<|output_schema|>
+{"tool":"string","args":"object","confidence":"number","state_update":"string"}
+<|assistant|>
+```
+The expected output is one strict JSON object:
+```json
+{"tool":"read_file","args":{"path":"<PATH_FROM_STATE>"},"confidence":0.82,"state_update":"Read the referenced file before editing."}
+```
+## Tool Contract
+BusyBeaver-50M was trained around a small canonical tool set:
+- `read_file`
+- `list_files`
+- `run_shell`
+- `run_tests`
+- `apply_patch`
+- `git_diff`
+- `remember`
+- `retrieve_memory`
+- `escalate`
+Harnesses should validate every emitted object before execution. Shell tools should remain dry-run or sandboxed by default.
+## Training Data
+The training pipeline normalized public Hugging Face agent/function-call trajectories into state/action rows, then filtered them through the local Crucible pipeline. Sources included SWE/debug trajectory datasets and tool/function-calling datasets. The shipped V10 dataset uses intent/family state signals such as:
+- `needs_source_lookup`
+- `needs_code_change`
+- `needs_validation`
+- `needs_environment_check`
+It does not use an exact `recommended_tool` field.
+Filtering removed malformed rows, unsafe shell commands, credential-like content, prose-as-tool-call rows, duplicate rows, and examples with missing context. Long reasoning text was not used as a target; the model is trained to emit only a tool-call JSON object.
+## Evaluation
+Held-out evaluation on `data/train_v10/test.jsonl`:
+| Metric | Score |
+| --- | ---: |
+| JSON validity | 1.0000 |
+| Strict JSON | 1.0000 |
+| Schema validity | 1.0000 |
+| Valid tool rate | 1.0000 |
+| Correct tool accuracy | 0.9790 |
+| Argument exact match | 0.9790 |
+| Argument semantic match | 0.9802 |
+| Unnecessary escalation rate | 0.0000 |
+| Unsafe command rate | 0.0000 |
+Grouped correct-tool accuracy:
+| Group | Rows | Correct Tool |
+| --- | ---: | ---: |
+| edit | 138 | 1.0000 |
+| execute | 98 | 1.0000 |
+| inspect | 578 | 0.9689 |
+| test | 43 | 1.0000 |
+## Loading
+Use the BusyBeaver local implementation in this repository. The loader should instantiate `BusyBeaverQDeltaForCausalLM` from `config.json`, then load `busybeaver_state.pt`.
+Example:
+```python
+import torch
+from busybeaver.modeling import BusyBeaverQDeltaConfig, BusyBeaverQDeltaForCausalLM
+model_dir = "path/to/BusyBeaver-50M"
+cfg = BusyBeaverQDeltaConfig.from_pretrained(model_dir)
+model = BusyBeaverQDeltaForCausalLM(cfg)
+state = torch.load(f"{model_dir}/busybeaver_state.pt", map_location="cpu")
+model.load_state_dict(state, strict=True)
+model.eval()
+```
+## Safety
+BusyBeaver-50M predicts tool calls; it does not execute them. Production harnesses should:
+- validate JSON and schema before execution
+- reject unsafe shell commands
+- run shell/test actions in a sandbox
+- require dry-run mode by default
+- cap repeated identical actions
+- log every state/action pair for trajectory analysis
+## Limitations
+- This is a specialized policy model, not a general assistant.
+- It depends on the BusyBeaver prompt/state format.
+- It is strongest when the larger planner or harness supplies compact state and intent signals.
+- Browser-agent data was not the primary training target yet.
+- The architecture is custom, so ordinary inference engines need a BusyBeaver adapter unless exported through a compatible runtime wrapper.
+## Provenance
+- Internal run label: V10 intent-fast
+- Promoted checkpoint: 200
+- Local report: `reports/policy_training_run_v10_ckpt200_candidate.md`
+- Previous best honest baseline: V9 checkpoint 1200 at 0.8959 correct-tool accuracy

busybeaver_eval/metrics.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "json_validity_rate": 1.0,
+  "strict_json_rate": 1.0,
+  "schema_validity_rate": 1.0,
+  "valid_tool_rate": 1.0,
+  "correct_tool_accuracy": 0.9765625,
+  "argument_exact_match": 0.9765625,
+  "argument_semantic_match": 0.9765625,
+  "groups": {
+    "edit": {
+      "n": 40,
+      "json_validity_rate": 1.0,
+      "strict_json_rate": 1.0,
+      "schema_validity_rate": 1.0,
+      "valid_tool_rate": 1.0,
+      "correct_tool_accuracy": 1.0,
+      "argument_exact_match": 1.0,
+      "argument_semantic_match": 1.0
+    },
+    "execute": {
+      "n": 27,
+      "json_validity_rate": 1.0,
+      "strict_json_rate": 1.0,
+      "schema_validity_rate": 1.0,
+      "valid_tool_rate": 1.0,
+      "correct_tool_accuracy": 1.0,
+      "argument_exact_match": 1.0,
+      "argument_semantic_match": 1.0
+    },
+    "inspect": {
+      "n": 172,
+      "json_validity_rate": 1.0,
+      "strict_json_rate": 1.0,
+      "schema_validity_rate": 1.0,
+      "valid_tool_rate": 1.0,
+      "correct_tool_accuracy": 0.9651162790697675,
+      "argument_exact_match": 0.9651162790697675,
+      "argument_semantic_match": 0.9651162790697675
+    },
+    "test": {
+      "n": 17,
+      "json_validity_rate": 1.0,
+      "strict_json_rate": 1.0,
+      "schema_validity_rate": 1.0,
+      "valid_tool_rate": 1.0,
+      "correct_tool_accuracy": 1.0,
+      "argument_exact_match": 1.0,
+      "argument_semantic_match": 1.0
+    }
+  }
+}

busybeaver_eval/report.md ADDED Viewed

	@@ -0,0 +1,46 @@

+# BusyBeaver Checkpoint Evaluation
+- Step: 200
+- json_validity_rate: 1.0000
+- strict_json_rate: 1.0000
+- schema_validity_rate: 1.0000
+- valid_tool_rate: 1.0000
+- correct_tool_accuracy: 0.9766
+- argument_exact_match: 0.9766
+- argument_semantic_match: 0.9766
+## Grouped Metrics
+### edit (n=40)
+- json_validity_rate: 1.0000
+- strict_json_rate: 1.0000
+- schema_validity_rate: 1.0000
+- valid_tool_rate: 1.0000
+- correct_tool_accuracy: 1.0000
+- argument_exact_match: 1.0000
+- argument_semantic_match: 1.0000
+### execute (n=27)
+- json_validity_rate: 1.0000
+- strict_json_rate: 1.0000
+- schema_validity_rate: 1.0000
+- valid_tool_rate: 1.0000
+- correct_tool_accuracy: 1.0000
+- argument_exact_match: 1.0000
+- argument_semantic_match: 1.0000
+### inspect (n=172)
+- json_validity_rate: 1.0000
+- strict_json_rate: 1.0000
+- schema_validity_rate: 1.0000
+- valid_tool_rate: 1.0000
+- correct_tool_accuracy: 0.9651
+- argument_exact_match: 0.9651
+- argument_semantic_match: 0.9651
+### test (n=17)
+- json_validity_rate: 1.0000
+- strict_json_rate: 1.0000
+- schema_validity_rate: 1.0000
+- valid_tool_rate: 1.0000
+- correct_tool_accuracy: 1.0000
+- argument_exact_match: 1.0000
+- argument_semantic_match: 1.0000

busybeaver_eval/traces.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

busybeaver_state.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2ebb34b27c60da61c2122f1891c42dac497cad9df546c02061245d63da106080
+size 222742359

config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "architectures": [
+    "BusyBeaverQDeltaForCausalLM"
+  ],
+  "conv_kernel_size": 4,
+  "dtype": "float32",
+  "hidden_size": 384,
+  "initializer_range": 0.02,
+  "intermediate_size": 1152,
+  "layer_pattern": [
+    "delta",
+    "delta",
+    "delta",
+    "attention"
+  ],
+  "max_position_embeddings": 2048,
+  "model_type": "busybeaver_qdelta",
+  "mtp_steps": 2,
+  "num_attention_heads": 6,
+  "num_hidden_layers": 16,
+  "num_key_value_heads": 2,
+  "num_tool_families": 8,
+  "rms_norm_eps": 1e-06,
+  "rope_theta": 1000000.0,
+  "transformers_version": "4.57.6",
+  "use_mtp": true,
+  "use_router_aux": true,
+  "vocab_size": 16384
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a1443fd3505aa19fa1c6d3ffb7b0e2e6aa82b3941be41d540192d204ea460efb
+size 197545296

rng_state.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f4a9f217e852f439efa6bd32fde98d6867f11aa6ea13ddc021ba10af6a0b0934
+size 14645

scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d8f6ffb2e60dfea393aa338a0a59969df109c78532f2f01b3606dee6d328f3e4
+size 1465

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "unk_token": "<unk>",
+  "pad_token": "<pad>",
+  "additional_special_tokens": [
+    "<busybeaver_task>",
+    "</busybeaver_task>",
+    "<tool_schema>",
+    "</tool_schema>",
+    "<|system|>",
+    "<|goal|>",
+    "<|state|>",
+    "<|tools|>",
+    "<|output_schema|>",
+    "<|assistant|>"
+  ]
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "unk_token": "<unk>",
+  "pad_token": "<pad>",
+  "additional_special_tokens": [
+    "<busybeaver_task>",
+    "</busybeaver_task>",
+    "<tool_schema>",
+    "</tool_schema>",
+    "<|system|>",
+    "<|goal|>",
+    "<|state|>",
+    "<|tools|>",
+    "<|output_schema|>",
+    "<|assistant|>"
+  ],
+  "model_max_length": 2048,
+  "clean_up_tokenization_spaces": false
+}

trainer_state.json ADDED Viewed

	@@ -0,0 +1,190 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 0.15047493651838614,
+  "eval_steps": 100,
+  "global_step": 200,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "epoch": 0.007523746825919308,
+      "grad_norm": 60.44245910644531,
+      "learning_rate": 1.7999999999999997e-05,
+      "loss": 86.4235,
+      "step": 10
+    },
+    {
+      "epoch": 0.015047493651838616,
+      "grad_norm": 40.04138946533203,
+      "learning_rate": 3.8e-05,
+      "loss": 65.01,
+      "step": 20
+    },
+    {
+      "epoch": 0.022571240477757923,
+      "grad_norm": 30.586576461791992,
+      "learning_rate": 5.7999999999999994e-05,
+      "loss": 45.3936,
+      "step": 30
+    },
+    {
+      "epoch": 0.030094987303677233,
+      "grad_norm": 27.570419311523438,
+      "learning_rate": 7.8e-05,
+      "loss": 34.0648,
+      "step": 40
+    },
+    {
+      "epoch": 0.037618734129596536,
+      "grad_norm": 26.078763961791992,
+      "learning_rate": 9.799999999999998e-05,
+      "loss": 28.1261,
+      "step": 50
+    },
+    {
+      "epoch": 0.045142480955515846,
+      "grad_norm": 24.367155075073242,
+      "learning_rate": 0.00011799999999999998,
+      "loss": 23.8896,
+      "step": 60
+    },
+    {
+      "epoch": 0.052666227781435156,
+      "grad_norm": 18.385601043701172,
+      "learning_rate": 0.000138,
+      "loss": 19.575,
+      "step": 70
+    },
+    {
+      "epoch": 0.060189974607354466,
+      "grad_norm": 13.478289604187012,
+      "learning_rate": 0.00015799999999999996,
+      "loss": 14.9189,
+      "step": 80
+    },
+    {
+      "epoch": 0.06771372143327377,
+      "grad_norm": 8.762843132019043,
+      "learning_rate": 0.000178,
+      "loss": 10.0506,
+      "step": 90
+    },
+    {
+      "epoch": 0.07523746825919307,
+      "grad_norm": 8.72549057006836,
+      "learning_rate": 0.000198,
+      "loss": 5.6379,
+      "step": 100
+    },
+    {
+      "epoch": 0.07523746825919307,
+      "eval_loss": 0.49864813685417175,
+      "eval_runtime": 38.1955,
+      "eval_samples_per_second": 32.91,
+      "eval_steps_per_second": 8.247,
+      "step": 100
+    },
+    {
+      "epoch": 0.08276121508511239,
+      "grad_norm": 6.59492826461792,
+      "learning_rate": 0.00021799999999999999,
+      "loss": 2.9898,
+      "step": 110
+    },
+    {
+      "epoch": 0.09028496191103169,
+      "grad_norm": 2.056182384490967,
+      "learning_rate": 0.00023799999999999998,
+      "loss": 1.6078,
+      "step": 120
+    },
+    {
+      "epoch": 0.09780870873695101,
+      "grad_norm": 1.4380172491073608,
+      "learning_rate": 0.000258,
+      "loss": 0.8847,
+      "step": 130
+    },
+    {
+      "epoch": 0.10533245556287031,
+      "grad_norm": 1.7172917127609253,
+      "learning_rate": 0.000278,
+      "loss": 0.6103,
+      "step": 140
+    },
+    {
+      "epoch": 0.11285620238878961,
+      "grad_norm": 0.5045933723449707,
+      "learning_rate": 0.000298,
+      "loss": 0.3398,
+      "step": 150
+    },
+    {
+      "epoch": 0.12037994921470893,
+      "grad_norm": 0.30618351697921753,
+      "learning_rate": 0.0002964,
+      "loss": 0.1732,
+      "step": 160
+    },
+    {
+      "epoch": 0.12790369604062823,
+      "grad_norm": 0.7540925145149231,
+      "learning_rate": 0.0002924,
+      "loss": 0.1196,
+      "step": 170
+    },
+    {
+      "epoch": 0.13542744286654754,
+      "grad_norm": 0.25162777304649353,
+      "learning_rate": 0.00028839999999999996,
+      "loss": 0.1363,
+      "step": 180
+    },
+    {
+      "epoch": 0.14295118969246684,
+      "grad_norm": 0.12225139141082764,
+      "learning_rate": 0.0002844,
+      "loss": 0.0717,
+      "step": 190
+    },
+    {
+      "epoch": 0.15047493651838614,
+      "grad_norm": 1.7007333040237427,
+      "learning_rate": 0.0002804,
+      "loss": 0.1011,
+      "step": 200
+    },
+    {
+      "epoch": 0.15047493651838614,
+      "eval_loss": 0.012117554433643818,
+      "eval_runtime": 38.2632,
+      "eval_samples_per_second": 32.851,
+      "eval_steps_per_second": 8.232,
+      "step": 200
+    }
+  ],
+  "logging_steps": 10,
+  "max_steps": 900,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 1,
+  "save_steps": 100,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": false
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 3388839926169600.0,
+  "train_batch_size": 4,
+  "trial_name": null,
+  "trial_params": null
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0d2535d2d034cc23032036ce4ac74d7680256890d3513f7979a162ddb0b04ced
+size 5841