You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Llama-Claw-8B Q4_K_M

Fine-tuned Meta Llama 3.1 8B Instruct for OpenClaw tool calling. Trained to use all 24 OpenClaw tools correctly via QLoRA + DPO + multi-level SFT.

Model Details

Field Value
Base model meta-llama/Llama-3.1-8B-Instruct
Quantization Q4_K_M (4.6GB GGUF)
Training method QLoRA (4-bit) + DPO + SFT
Hardware NVIDIA RTX PRO 6000 (97GB VRAM)
Framework Unsloth 2026.2.1 + TRL
Best version v4 (combined)
Benchmark 89% core (12 tools) / 86% all (24 tools)

Files

  • llama-claw-8b-Q4_K_M.gguf - Ready for Ollama/llama.cpp (4.6GB)
  • lora-adapter/ - LoRA weights (671MB), merge with base model for full precision

Version History

v1 - DPO Baseline (67%)

  • Training: SFT on 9,566 data-farm examples (broad tool coverage), then DPO on 2,974 preference pairs
  • Data source: Automated data farm - 5 GPUs generating conversations across 3 machines. Base qwen2.5:7b model given tasks, scored by quality scorer, failures become DPO rejected examples
  • Method: QLoRA 4-bit, 3 epochs SFT then 1 epoch DPO (beta=0.1)
  • Result: 67% on 9-test core benchmark (36 tests over 4 rounds)
  • Strengths: Basic tool calling works - read, exec, write functional
  • Weaknesses: Multi-step chains unreliable, edit hallucinations, debug failures

v2 - Level 2 SFT (78%)

  • Training: Continued from v1 DPO checkpoint + 999 Claude-crafted gold-standard examples
  • Data source: Claude Opus generated high-quality training conversations targeting weak areas: tc_edit (25%), debug (28%), process (14%), advanced tools (9%)
  • Method: SFT from DPO checkpoint, 5 epochs, lr=5e-5
  • Result: 78% (+11% from v1)
  • Improvements: edit 4/4, multi 4/4, chain 3/4, debug 1/4
  • Weaknesses: tc_edit still 0/4, debug inconsistent

v3 - Level 3 Targeted SFT (75% - REGRESSION)

  • Training: Continued from v2 + 994 narrowly targeted examples focusing on tc_edit, debug, chain
  • Data source: Claude-crafted examples heavily focused on remaining failures
  • Method: SFT from v2 checkpoint, 3 epochs
  • Result: 75% (-3% from v2) - catastrophic forgetting
  • Problem: Narrow training on tc_edit/debug/chain caused edit to drop 4/4 to 2/4, multi 4/4 to 3/4
  • Lesson: Training on top of previous SFT with narrow data causes forgetting

v4 - Combined SFT (89% - BEST)

  • Training: Combined Level 2 (999) + Level 3 (994) = 1,993 examples in ONE pass from DPO checkpoint
  • Data source: Merged L2 + L3 datasets, trained from DPO checkpoint (not from v2/v3)
  • Method: SFT from DPO checkpoint, 3 epochs, 675 steps, lr=5e-5
  • Result: 89% core / 86% all-tools (+22% from v1)
  • Key insight: Combined training from DPO base prevents catastrophic forgetting
  • Core scores (12 tools): read 4/4, exec 4/4, write 4/4, edit 4/4, multi 4/4, process 4/4, chain 4/4, debug 4/4, tc_edit 0/4
  • Extended scores (24 tools): webfetch 2/4, cronlist 3/4, agentlst 4/4, websrch 4/4, toolsel 4/4
  • Tool selection: 4/4 perfect - correctly picks browser, message, sessions_spawn, memory_search without specific training

v5 - tc_edit Fix Attempt (77% - REGRESSION)

  • Training: v4 base (1,993) + 400 tc_edit fix examples = 2,393 total
  • Data source: Generated by generate_tc_edit_fix.py - teaches model to include surrounding context in oldText to make edit matches unique (file has 8 identical solution None lines)
  • Method: SFT from DPO checkpoint, 3 epochs, 810 steps
  • Result: 77% (-12% from v4)
  • Problem: 400 tc_edit examples with full challenge.py content (~5KB each) dominated training tokens. edit dropped 4/4 to 3/4, multi 4/4 to 2/4, debug 4/4 to 2/4
  • Silver lining: tc_edit improved 0/4 to 1/4 (tc1 passed for first time)
  • Lesson: Training data with large embedded file content is disproportionately heavy in token count

v6 - Reduced tc_edit Fix (83% - REGRESSION)

  • Training: v4 base (1,993) + 50 tc_edit fix examples (sampled from 400) = 2,043 total
  • Data source: Random sample of 50 from v5 400 tc_edit examples
  • Method: SFT from DPO checkpoint, 3 epochs, ~690 steps
  • Result: 83% (-6% from v4)
  • Problem: Even 50 tc_edit examples with full challenge.py content still poisons general edit skills. edit dropped 4/4 to 1/4
  • tc_edit: 1/4 (same as v5)
  • Conclusion: tc_edit SFT approach fundamentally flawed - needs DPO pairs instead

Benchmark Details

9 core tests run 4 rounds each (36 total), plus 5 extended tests with all 24 tools (20 total = 56 grand total):

Test Description v1 v2 v3 v4 v5 v6
read Read /etc/os-release 3/4 4/4 4/4 4/4 4/4 4/4
exec Run hostname + uname 3/4 4/4 4/4 4/4 4/4 4/4
write Write file to disk 3/4 4/4 4/4 4/4 4/4 4/4
edit Replace text in file 2/4 4/4 2/4 4/4 3/4 1/4
multi Multi-step (varies) 2/4 4/4 3/4 4/4 2/4 4/4
tc_edit Edit non-unique text 0/4 0/4 0/4 0/4 1/4 1/4
process Run Python one-liner 3/4 4/4 4/4 4/4 4/4 4/4
chain Read then write 2/4 3/4 4/4 4/4 4/4 4/4
debug Fix buggy Python 1/4 1/4 2/4 4/4 2/4 4/4
Total 67% 78% 75% 89% 77% 83%

OpenClaw Tools (24 total)

Core (12): read, write, edit, exec, process, web_search, web_fetch, memory_search, memory_get, image, cron, session_status

Extended (12): browser, message, sessions_spawn, sessions_send, sessions_list, sessions_history, agents_list, tts, gateway, nodes, canvas, subagents

Training Pipeline

License

This model is based on Meta Llama 3.1 8B Instruct and subject to the Llama 3.1 Community License.

Downloads last month
-
GGUF
Model size
8B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kachowtowmater/llama-claw-8b-q4km

Quantized
(615)
this model