Llama-Claw-8B Q4_K_M
Fine-tuned Meta Llama 3.1 8B Instruct for OpenClaw tool calling. Trained to use all 24 OpenClaw tools correctly via QLoRA + DPO + multi-level SFT.
Model Details
| Field | Value |
|---|---|
| Base model | meta-llama/Llama-3.1-8B-Instruct |
| Quantization | Q4_K_M (4.6GB GGUF) |
| Training method | QLoRA (4-bit) + DPO + SFT |
| Hardware | NVIDIA RTX PRO 6000 (97GB VRAM) |
| Framework | Unsloth 2026.2.1 + TRL |
| Best version | v4 (combined) |
| Benchmark | 89% core (12 tools) / 86% all (24 tools) |
Files
llama-claw-8b-Q4_K_M.gguf- Ready for Ollama/llama.cpp (4.6GB)lora-adapter/- LoRA weights (671MB), merge with base model for full precision
Version History
v1 - DPO Baseline (67%)
- Training: SFT on 9,566 data-farm examples (broad tool coverage), then DPO on 2,974 preference pairs
- Data source: Automated data farm - 5 GPUs generating conversations across 3 machines. Base qwen2.5:7b model given tasks, scored by quality scorer, failures become DPO rejected examples
- Method: QLoRA 4-bit, 3 epochs SFT then 1 epoch DPO (beta=0.1)
- Result: 67% on 9-test core benchmark (36 tests over 4 rounds)
- Strengths: Basic tool calling works - read, exec, write functional
- Weaknesses: Multi-step chains unreliable, edit hallucinations, debug failures
v2 - Level 2 SFT (78%)
- Training: Continued from v1 DPO checkpoint + 999 Claude-crafted gold-standard examples
- Data source: Claude Opus generated high-quality training conversations targeting weak areas: tc_edit (25%), debug (28%), process (14%), advanced tools (9%)
- Method: SFT from DPO checkpoint, 5 epochs, lr=5e-5
- Result: 78% (+11% from v1)
- Improvements: edit 4/4, multi 4/4, chain 3/4, debug 1/4
- Weaknesses: tc_edit still 0/4, debug inconsistent
v3 - Level 3 Targeted SFT (75% - REGRESSION)
- Training: Continued from v2 + 994 narrowly targeted examples focusing on tc_edit, debug, chain
- Data source: Claude-crafted examples heavily focused on remaining failures
- Method: SFT from v2 checkpoint, 3 epochs
- Result: 75% (-3% from v2) - catastrophic forgetting
- Problem: Narrow training on tc_edit/debug/chain caused edit to drop 4/4 to 2/4, multi 4/4 to 3/4
- Lesson: Training on top of previous SFT with narrow data causes forgetting
v4 - Combined SFT (89% - BEST)
- Training: Combined Level 2 (999) + Level 3 (994) = 1,993 examples in ONE pass from DPO checkpoint
- Data source: Merged L2 + L3 datasets, trained from DPO checkpoint (not from v2/v3)
- Method: SFT from DPO checkpoint, 3 epochs, 675 steps, lr=5e-5
- Result: 89% core / 86% all-tools (+22% from v1)
- Key insight: Combined training from DPO base prevents catastrophic forgetting
- Core scores (12 tools): read 4/4, exec 4/4, write 4/4, edit 4/4, multi 4/4, process 4/4, chain 4/4, debug 4/4, tc_edit 0/4
- Extended scores (24 tools): webfetch 2/4, cronlist 3/4, agentlst 4/4, websrch 4/4, toolsel 4/4
- Tool selection: 4/4 perfect - correctly picks browser, message, sessions_spawn, memory_search without specific training
v5 - tc_edit Fix Attempt (77% - REGRESSION)
- Training: v4 base (1,993) + 400 tc_edit fix examples = 2,393 total
- Data source: Generated by generate_tc_edit_fix.py - teaches model to include surrounding context in oldText to make edit matches unique (file has 8 identical solution None lines)
- Method: SFT from DPO checkpoint, 3 epochs, 810 steps
- Result: 77% (-12% from v4)
- Problem: 400 tc_edit examples with full challenge.py content (~5KB each) dominated training tokens. edit dropped 4/4 to 3/4, multi 4/4 to 2/4, debug 4/4 to 2/4
- Silver lining: tc_edit improved 0/4 to 1/4 (tc1 passed for first time)
- Lesson: Training data with large embedded file content is disproportionately heavy in token count
v6 - Reduced tc_edit Fix (83% - REGRESSION)
- Training: v4 base (1,993) + 50 tc_edit fix examples (sampled from 400) = 2,043 total
- Data source: Random sample of 50 from v5 400 tc_edit examples
- Method: SFT from DPO checkpoint, 3 epochs, ~690 steps
- Result: 83% (-6% from v4)
- Problem: Even 50 tc_edit examples with full challenge.py content still poisons general edit skills. edit dropped 4/4 to 1/4
- tc_edit: 1/4 (same as v5)
- Conclusion: tc_edit SFT approach fundamentally flawed - needs DPO pairs instead
Benchmark Details
9 core tests run 4 rounds each (36 total), plus 5 extended tests with all 24 tools (20 total = 56 grand total):
| Test | Description | v1 | v2 | v3 | v4 | v5 | v6 |
|---|---|---|---|---|---|---|---|
| read | Read /etc/os-release | 3/4 | 4/4 | 4/4 | 4/4 | 4/4 | 4/4 |
| exec | Run hostname + uname | 3/4 | 4/4 | 4/4 | 4/4 | 4/4 | 4/4 |
| write | Write file to disk | 3/4 | 4/4 | 4/4 | 4/4 | 4/4 | 4/4 |
| edit | Replace text in file | 2/4 | 4/4 | 2/4 | 4/4 | 3/4 | 1/4 |
| multi | Multi-step (varies) | 2/4 | 4/4 | 3/4 | 4/4 | 2/4 | 4/4 |
| tc_edit | Edit non-unique text | 0/4 | 0/4 | 0/4 | 0/4 | 1/4 | 1/4 |
| process | Run Python one-liner | 3/4 | 4/4 | 4/4 | 4/4 | 4/4 | 4/4 |
| chain | Read then write | 2/4 | 3/4 | 4/4 | 4/4 | 4/4 | 4/4 |
| debug | Fix buggy Python | 1/4 | 1/4 | 2/4 | 4/4 | 2/4 | 4/4 |
| Total | 67% | 78% | 75% | 89% | 77% | 83% |
OpenClaw Tools (24 total)
Core (12): read, write, edit, exec, process, web_search, web_fetch, memory_search, memory_get, image, cron, session_status
Extended (12): browser, message, sessions_spawn, sessions_send, sessions_list, sessions_history, agents_list, tts, gateway, nodes, canvas, subagents
Training Pipeline
License
This model is based on Meta Llama 3.1 8B Instruct and subject to the Llama 3.1 Community License.
- Downloads last month
- -
Hardware compatibility
Log In to add your hardware
4-bit
Model tree for kachowtowmater/llama-claw-8b-q4km
Base model
meta-llama/Llama-3.1-8B Finetuned
meta-llama/Llama-3.1-8B-Instruct