Tool Calling LoRA Adapters

LoRA adapters for improving LLM tool calling, trained as part of the research paper "What Actually Improves LLM Tool Calling?"

Model Description

These are LoRA adapters (rank 8, alpha 16) trained on top of Qwen2.5-1.5B-Instruct for function calling tasks using the Berkeley Function Calling Leaderboard (BFCL) dataset.

Key Findings

Our ablation study found that:

SFT provides +47 points accuracy improvement (9.7% → 57%)
DPO and RL add <1 point when applied post-SFT
Training data diversity matters more than quantity: 500 diverse examples outperform 500 homogeneous examples by 26 points
Tool generalization works well (79% on unseen tools) but pattern generalization is harder (42% on unseen patterns)

Available Adapters

Adapter	Description	Accuracy
`sft/`	SFT on diverse BFCL data	57.0%
`sft_dpo/`	SFT + DPO preference tuning	57.7%
`sft_rl/`	SFT + reward-filtered RL	58.0%
`tool_generalization/sft/`	SFT for unseen tools experiment	79% on held-out tools
`category_holdout/sft/`	SFT for pattern generalization	42% on held-out patterns
`diversity/high_diversity/`	Diverse training (125 x 4 categories)	53%
`diversity/low_diversity/`	Homogeneous training (500 simple)	27%

Usage

from mlx_lm import load, generate

# Load base model with SFT adapter
model, tokenizer = load(
    "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
    adapter_path="path/to/sft"
)

# Format your prompt with function definitions
messages = [
    {"role": "system", "content": "You are a helpful assistant with access to functions..."},
    {"role": "user", "content": "What's the weather in Paris?"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
output = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(output)
# Output: {"name": "get_weather", "arguments": {"city": "Paris"}}

Training Details

Base model: Qwen2.5-1.5B-Instruct (4-bit quantization)
LoRA config: rank=8, alpha=16
Training: 300 iterations, learning rate 1e-5
Framework: MLX on Apple Silicon
Data: Berkeley Function Calling Leaderboard (BFCL)

Call Pattern Categories

The BFCL benchmark includes four call pattern categories:

Simple: Single function call (e.g., get_weather(city="Paris"))
Multiple: Sequential calls where later calls depend on earlier results
Parallel: Independent concurrent calls
Parallel-multiple: Combinations requiring both parallel and sequential structure

Recommendations

Based on our research:

Use SFT with diverse training data - this provides nearly all achievable gains
Prioritize pattern diversity over tool coverage - models generalize well to new tools but struggle with new patterns
Skip complex pipelines - DPO, RL, and scaffolding provide minimal benefit in our setting

Citation

@article{ramakrishnan2024toolcalling,
  title={What Actually Improves LLM Tool Calling?},
  author={Ramakrishnan, Siddharth},
  year={2024}
}

Model tree for siddharthvader/tool-calling-lora-qwen2.5-1.5b

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Adapter

(816)

this model

siddharthvader
/

tool-calling-lora-qwen2.5-1.5b