You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Llama-Claw-8B Q4_K_M

Fine-tuned Meta Llama 3.1 8B Instruct for OpenClaw tool calling. Trained to use all 24 OpenClaw tools correctly via QLoRA + DPO + multi-level SFT.

Model Details

Field	Value
Base model	meta-llama/Llama-3.1-8B-Instruct
Quantization	Q4_K_M (4.6GB GGUF)
Training method	QLoRA (4-bit) + DPO + SFT
Hardware	NVIDIA RTX PRO 6000 (97GB VRAM)
Framework	Unsloth 2026.2.1 + TRL
Best version	v4 (combined)
Benchmark	89% core (12 tools) / 86% all (24 tools)

Files

llama-claw-8b-Q4_K_M.gguf - Ready for Ollama/llama.cpp (4.6GB)
lora-adapter/ - LoRA weights (671MB), merge with base model for full precision

Version History

v1 - DPO Baseline (67%)

Training: SFT on 9,566 data-farm examples (broad tool coverage), then DPO on 2,974 preference pairs
Data source: Automated data farm - 5 GPUs generating conversations across 3 machines. Base qwen2.5:7b model given tasks, scored by quality scorer, failures become DPO rejected examples
Method: QLoRA 4-bit, 3 epochs SFT then 1 epoch DPO (beta=0.1)
Result: 67% on 9-test core benchmark (36 tests over 4 rounds)
Strengths: Basic tool calling works - read, exec, write functional
Weaknesses: Multi-step chains unreliable, edit hallucinations, debug failures

v2 - Level 2 SFT (78%)

Training: Continued from v1 DPO checkpoint + 999 Claude-crafted gold-standard examples
Data source: Claude Opus generated high-quality training conversations targeting weak areas: tc_edit (25%), debug (28%), process (14%), advanced tools (9%)
Method: SFT from DPO checkpoint, 5 epochs, lr=5e-5
Result: 78% (+11% from v1)
Improvements: edit 4/4, multi 4/4, chain 3/4, debug 1/4
Weaknesses: tc_edit still 0/4, debug inconsistent

v3 - Level 3 Targeted SFT (75% - REGRESSION)

Training: Continued from v2 + 994 narrowly targeted examples focusing on tc_edit, debug, chain
Data source: Claude-crafted examples heavily focused on remaining failures
Method: SFT from v2 checkpoint, 3 epochs
Result: 75% (-3% from v2) - catastrophic forgetting
Problem: Narrow training on tc_edit/debug/chain caused edit to drop 4/4 to 2/4, multi 4/4 to 3/4
Lesson: Training on top of previous SFT with narrow data causes forgetting

v4 - Combined SFT (89% - BEST)

Training: Combined Level 2 (999) + Level 3 (994) = 1,993 examples in ONE pass from DPO checkpoint
Data source: Merged L2 + L3 datasets, trained from DPO checkpoint (not from v2/v3)
Method: SFT from DPO checkpoint, 3 epochs, 675 steps, lr=5e-5
Result: 89% core / 86% all-tools (+22% from v1)
Key insight: Combined training from DPO base prevents catastrophic forgetting
Core scores (12 tools): read 4/4, exec 4/4, write 4/4, edit 4/4, multi 4/4, process 4/4, chain 4/4, debug 4/4, tc_edit 0/4
Extended scores (24 tools): webfetch 2/4, cronlist 3/4, agentlst 4/4, websrch 4/4, toolsel 4/4
Tool selection: 4/4 perfect - correctly picks browser, message, sessions_spawn, memory_search without specific training

v5 - tc_edit Fix Attempt (77% - REGRESSION)

Training: v4 base (1,993) + 400 tc_edit fix examples = 2,393 total
Data source: Generated by generate_tc_edit_fix.py - teaches model to include surrounding context in oldText to make edit matches unique (file has 8 identical solution None lines)
Method: SFT from DPO checkpoint, 3 epochs, 810 steps
Result: 77% (-12% from v4)
Problem: 400 tc_edit examples with full challenge.py content (~5KB each) dominated training tokens. edit dropped 4/4 to 3/4, multi 4/4 to 2/4, debug 4/4 to 2/4
Silver lining: tc_edit improved 0/4 to 1/4 (tc1 passed for first time)
Lesson: Training data with large embedded file content is disproportionately heavy in token count

v6 - Reduced tc_edit Fix (83% - REGRESSION)

Training: v4 base (1,993) + 50 tc_edit fix examples (sampled from 400) = 2,043 total
Data source: Random sample of 50 from v5 400 tc_edit examples
Method: SFT from DPO checkpoint, 3 epochs, ~690 steps
Result: 83% (-6% from v4)
Problem: Even 50 tc_edit examples with full challenge.py content still poisons general edit skills. edit dropped 4/4 to 1/4
tc_edit: 1/4 (same as v5)
Conclusion: tc_edit SFT approach fundamentally flawed - needs DPO pairs instead

Benchmark Details

9 core tests run 4 rounds each (36 total), plus 5 extended tests with all 24 tools (20 total = 56 grand total):

Test	Description	v1	v2	v3	v4	v5	v6
read	Read /etc/os-release	3/4	4/4	4/4	4/4	4/4	4/4
exec	Run hostname + uname	3/4	4/4	4/4	4/4	4/4	4/4
write	Write file to disk	3/4	4/4	4/4	4/4	4/4	4/4
edit	Replace text in file	2/4	4/4	2/4	4/4	3/4	1/4
multi	Multi-step (varies)	2/4	4/4	3/4	4/4	2/4	4/4
tc_edit	Edit non-unique text	0/4	0/4	0/4	0/4	1/4	1/4
process	Run Python one-liner	3/4	4/4	4/4	4/4	4/4	4/4
chain	Read then write	2/4	3/4	4/4	4/4	4/4	4/4
debug	Fix buggy Python	1/4	1/4	2/4	4/4	2/4	4/4
Total		67%	78%	75%	89%	77%	83%

OpenClaw Tools (24 total)

Core (12): read, write, edit, exec, process, web_search, web_fetch, memory_search, memory_get, image, cron, session_status

Extended (12): browser, message, sessions_spawn, sessions_send, sessions_list, sessions_history, agents_list, tts, gateway, nodes, canvas, subagents

Training Pipeline

License

This model is based on Meta Llama 3.1 8B Instruct and subject to the Llama 3.1 Community License.

Downloads last month: -

GGUF

Model size

8B params

Architecture

llama

Hardware compatibility

4-bit

Model tree for kachowtowmater/llama-claw-8b-q4km

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Quantized

(615)

this model