Qwen-2.5-1.5B-GRPO-Tool-Calling
This model is a fine-tuned version of Qwen/Qwen2.5-1.5B-Instruct designed to excel at tool-calling and functional reasoning. It was trained using GRPO (Group Relative Policy Optimization).
The model incorporates a <think> block for internal reasoning before outputting the final <tool> call, enabling it to better plan complex function parameters.
π― Model Details
- Base Model: Qwen/Qwen2.5-1.5B-Instruct
- Training Method: GRPO (Group Relative Policy Optimization)
- Adaptation: QLoRA (4-bit quantization)
- Primary Task: Verifiable Tool-Calling and Argument Generation
π§ Training Hyperparameters
The model was trained on a single NVIDIA T4 GPU with the following configuration:
| Parameter | Value |
|---|---|
| Steps | 25 |
| Effective Batch Size | 16 |
| Rollouts per Query | 4 |
| Max Prompt Length | 512 |
| Max Completion Length | 1024 |
| LoRA Rank (r) | 32 |
| LoRA Alpha | 32 |
| Target Modules | All Linear (q, k, v, o, gate, up, down) |
π Reward Functions
The model's behavior is shaped by a multi-objective reward system:
- Format Reward [Weight: 0.3]: Incentivizes the model to follow the specific XML-style structure:
<think> ... </think> <tool> {"name": "...", "parameters": {...}} </tool> - Correctness Reward [Weight: 0.7]: Validates the output against the ground truth. It awards points based on:
- Exact match of the tool name.
- Syntactic validity of the JSON parameters.
- Semantic match of the arguments provided.
- Downloads last month
- 1
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support