Ministral-3-14B Agent Diff SFT + GRPO

LoRA fine-tune of Ministral-3-14B-Instruct-2512-BF16 for API tool-calling tasks across Slack, Linear, Box, and Google Calendar. Two-stage training: SFT warm-start on filtered expert trajectories, then GRPO reinforcement learning on agent-diff-bench via prime-rl infrastructure.

Results

Evaluated on agent-diff-bench (45 tasks, test split). Pass@1 task success rate:

Stage	Reward
Base (Ministral 14B)	0.282
+ SFT (epoch 5)	0.356
+ GRPO (step 11)	0.449

+60% improvement over base model through SFT + GRPO.

Training Stages

Stage 1: SFT

The starting SFT adapter was trained on filtered agent-diff trajectories. Document the exact SFT details here when publishing the final repo:

base model: mistralai/Ministral-3-14B-Instruct-2512-BF16
adapter repo: hubertmarek/Ministral-3-14B-Agent-Diff-SFT-LoRA
SFT source dataset: hubertmarek/mistral-large-agent-diff-sft-mixed-old-plus-devstral-r0p8-64k
max length: 64000
epochs: 8
learning rate: 5e-5
per-device batch size: 1
gradient accumulation steps: 6
lr scheduler: cosine
warmup ratio: 0.08
LoRA rank: 64
LoRA alpha: 128

Stage 2: GRPO

Current GRPO run configuration:

output dir: outputs/agent_diff_mistral_rl
initialization model: outputs/Ministral-3-14B-Agent-Diff-SFT-merged
hardware: 4x NVIDIA A100 80GB total (2 trainer GPUs + 2 inference GPUs)
max steps: 50
max async level: 2
global sequence length: 15360
checkpoint interval: 5
train GPUs: 2
inference GPUs: 2
trainer optimizer: adamw, lr 1e-5, weight decay 0.0
trainer dtypes: bfloat16
context parallelism: 2
activation checkpointing: freq=1
LoRA rank: 64
LoRA alpha: 128
LoRA dropout: 0.05
target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
rollout batch size: 256
rollouts per example: 8
oversampling factor: 1.2
sampling max tokens: 2096
sampling temperature: 0.7
max retries per env rollout: 25
eval interval: 5
eval rollouts per example: 3
env services: slack,linear,calendar,box
task horizon: 6
max turns: 20
tool efficiency weight: 0.0

Inference

Serve the merged SFT model and load the RL adapter checkpoint you want to evaluate.

Example vLLM settings used during training:

vllm serve hubertmarek/Ministral-3-14B-agent-diff-sft \
  --dtype bfloat16 \
  --max-model-len 65536 \
  --enable-lora \
  --lora-modules agent-diff=hubertmarek/Ministral-3-14B-Agent-Diff-SFT-GRPO-LoRA/step_11 \
  --max-lora-rank 64 \
  --enable-auto-tool-choice \
  --tool-call-parser mistral \
  --tokenizer mistralai/Ministral-3-14B-Instruct-2512-BF16 \
  --tokenizer-mode mistral \
  --enforce-eager

If you are loading from disk instead of HF, use:

--lora-modules agent-diff=outputs/agent_diff_mistral_rl/run_default/broadcasts/step_11

Evaluation

prime eval hubert-marek/agent-diff-bench \
  -m agent-diff \
  --api-base-url http://localhost:8000/v1 \
  -n -1 -r 3 -c 15 \
  --max-retries 20 \
  --env-args '{"agentdiff_api_key": "YOUR_KEY"}' \
  --save-results \
  --temperature 0.5

Training Code

This run uses the forked PRIME-RL training code here:

GitHub repo: hubert-marek/prime-rl

Changes for Ministral-3 compatibility (hubert-marek/prime-rl):

37d7e2e3 — Trainer Mistral3 support: text config normalization, automatic state dict key remapping for merged multimodal checkpoints, FSDP meta-device loading
f29374f2 — vLLM inference: patched Mistral3 multimodal pipeline for text-only serving from merged checkpoints
5c6de77f — Scheduler resilience: capped group reschedules with graceful degradation (zero-reward conversion for partial rollouts, clean drops for empty ones), full test coverage

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hubertmarek/Ministral-3-14B-Agent-Diff-SFT-GRPO-LoRA

Base model

mistralai/Ministral-3-14B-Base-2512

Finetuned

mistralai/Ministral-3-14B-Instruct-2512-BF16

Adapter

(3)

this model

hubertmarek
/

Ministral-3-14B-Agent-Diff-SFT-GRPO-LoRA