Ministral-3-14B Agent Diff SFT + GRPO
LoRA fine-tune of Ministral-3-14B-Instruct-2512-BF16 for API tool-calling tasks across Slack, Linear, Box, and Google Calendar. Two-stage training: SFT warm-start on filtered expert trajectories, then GRPO reinforcement learning on agent-diff-bench via prime-rl infrastructure.
Results
Evaluated on agent-diff-bench (45 tasks, test split). Pass@1 task success rate:
| Stage | Reward |
|---|---|
| Base (Ministral 14B) | 0.282 |
| + SFT (epoch 5) | 0.356 |
| + GRPO (step 11) | 0.449 |
+60% improvement over base model through SFT + GRPO.
Training Stages
Stage 1: SFT
The starting SFT adapter was trained on filtered agent-diff trajectories. Document the exact SFT details here when publishing the final repo:
- base model:
mistralai/Ministral-3-14B-Instruct-2512-BF16 - adapter repo:
hubertmarek/Ministral-3-14B-Agent-Diff-SFT-LoRA - SFT source dataset: hubertmarek/mistral-large-agent-diff-sft-mixed-old-plus-devstral-r0p8-64k
- max length:
64000 - epochs:
8 - learning rate:
5e-5 - per-device batch size:
1 - gradient accumulation steps:
6 - lr scheduler:
cosine - warmup ratio:
0.08 - LoRA rank:
64 - LoRA alpha:
128
Stage 2: GRPO
Current GRPO run configuration:
- output dir:
outputs/agent_diff_mistral_rl - initialization model:
outputs/Ministral-3-14B-Agent-Diff-SFT-merged - hardware:
4x NVIDIA A100 80GBtotal (2trainer GPUs +2inference GPUs) - max steps:
50 - max async level:
2 - global sequence length:
15360 - checkpoint interval:
5 - train GPUs:
2 - inference GPUs:
2 - trainer optimizer:
adamw, lr1e-5, weight decay0.0 - trainer dtypes:
bfloat16 - context parallelism:
2 - activation checkpointing:
freq=1 - LoRA rank:
64 - LoRA alpha:
128 - LoRA dropout:
0.05 - target modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - rollout batch size:
256 - rollouts per example:
8 - oversampling factor:
1.2 - sampling max tokens:
2096 - sampling temperature:
0.7 - max retries per env rollout:
25 - eval interval:
5 - eval rollouts per example:
3 - env services:
slack,linear,calendar,box - task horizon:
6 - max turns:
20 - tool efficiency weight:
0.0
Inference
Serve the merged SFT model and load the RL adapter checkpoint you want to evaluate.
Example vLLM settings used during training:
vllm serve hubertmarek/Ministral-3-14B-agent-diff-sft \
--dtype bfloat16 \
--max-model-len 65536 \
--enable-lora \
--lora-modules agent-diff=hubertmarek/Ministral-3-14B-Agent-Diff-SFT-GRPO-LoRA/step_11 \
--max-lora-rank 64 \
--enable-auto-tool-choice \
--tool-call-parser mistral \
--tokenizer mistralai/Ministral-3-14B-Instruct-2512-BF16 \
--tokenizer-mode mistral \
--enforce-eager
If you are loading from disk instead of HF, use:
--lora-modules agent-diff=outputs/agent_diff_mistral_rl/run_default/broadcasts/step_11
Evaluation
prime eval hubert-marek/agent-diff-bench \
-m agent-diff \
--api-base-url http://localhost:8000/v1 \
-n -1 -r 3 -c 15 \
--max-retries 20 \
--env-args '{"agentdiff_api_key": "YOUR_KEY"}' \
--save-results \
--temperature 0.5
Training Code
This run uses the forked PRIME-RL training code here:
- GitHub repo: hubert-marek/prime-rl
Changes for Ministral-3 compatibility (hubert-marek/prime-rl):
37d7e2e3— Trainer Mistral3 support: text config normalization, automatic state dict key remapping for merged multimodal checkpoints, FSDP meta-device loadingf29374f2— vLLM inference: patched Mistral3 multimodal pipeline for text-only serving from merged checkpoints5c6de77f— Scheduler resilience: capped group reschedules with graceful degradation (zero-reward conversion for partial rollouts, clean drops for empty ones), full test coverage
Model tree for hubertmarek/Ministral-3-14B-Agent-Diff-SFT-GRPO-LoRA
Base model
mistralai/Ministral-3-14B-Base-2512