Tool Calling LoRA Adapters

LoRA adapters for improving LLM tool calling, trained as part of the research paper "What Actually Improves LLM Tool Calling?"

Model Description

These are LoRA adapters (rank 8, alpha 16) trained on top of Qwen2.5-1.5B-Instruct for function calling tasks using the Berkeley Function Calling Leaderboard (BFCL) dataset.

Key Findings

Our ablation study found that:

  • SFT provides +47 points accuracy improvement (9.7% → 57%)
  • DPO and RL add <1 point when applied post-SFT
  • Training data diversity matters more than quantity: 500 diverse examples outperform 500 homogeneous examples by 26 points
  • Tool generalization works well (79% on unseen tools) but pattern generalization is harder (42% on unseen patterns)

Available Adapters

Adapter Description Accuracy
sft/ SFT on diverse BFCL data 57.0%
sft_dpo/ SFT + DPO preference tuning 57.7%
sft_rl/ SFT + reward-filtered RL 58.0%
tool_generalization/sft/ SFT for unseen tools experiment 79% on held-out tools
category_holdout/sft/ SFT for pattern generalization 42% on held-out patterns
diversity/high_diversity/ Diverse training (125 x 4 categories) 53%
diversity/low_diversity/ Homogeneous training (500 simple) 27%

Usage

from mlx_lm import load, generate

# Load base model with SFT adapter
model, tokenizer = load(
    "mlx-community/Qwen2.5-1.5B-Instruct-4bit",
    adapter_path="path/to/sft"
)

# Format your prompt with function definitions
messages = [
    {"role": "system", "content": "You are a helpful assistant with access to functions..."},
    {"role": "user", "content": "What's the weather in Paris?"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
output = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(output)
# Output: {"name": "get_weather", "arguments": {"city": "Paris"}}

Training Details

  • Base model: Qwen2.5-1.5B-Instruct (4-bit quantization)
  • LoRA config: rank=8, alpha=16
  • Training: 300 iterations, learning rate 1e-5
  • Framework: MLX on Apple Silicon
  • Data: Berkeley Function Calling Leaderboard (BFCL)

Call Pattern Categories

The BFCL benchmark includes four call pattern categories:

  1. Simple: Single function call (e.g., get_weather(city="Paris"))
  2. Multiple: Sequential calls where later calls depend on earlier results
  3. Parallel: Independent concurrent calls
  4. Parallel-multiple: Combinations requiring both parallel and sequential structure

Recommendations

Based on our research:

  1. Use SFT with diverse training data - this provides nearly all achievable gains
  2. Prioritize pattern diversity over tool coverage - models generalize well to new tools but struggle with new patterns
  3. Skip complex pipelines - DPO, RL, and scaffolding provide minimal benefit in our setting

Citation

@article{ramakrishnan2024toolcalling,
  title={What Actually Improves LLM Tool Calling?},
  author={Ramakrishnan, Siddharth},
  year={2024}
}

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for siddharthvader/tool-calling-lora-qwen2.5-1.5b

Adapter
(816)
this model