Ministral-3-14B Agent Diff SFT + GRPO

LoRA fine-tune of Ministral-3-14B-Instruct-2512-BF16 for API tool-calling tasks across Slack, Linear, Box, and Google Calendar. Two-stage training: SFT warm-start on filtered expert trajectories, then GRPO reinforcement learning on agent-diff-bench via prime-rl infrastructure.

Results

Evaluated on agent-diff-bench (45 tasks, test split). Pass@1 task success rate:

Stage Reward
Base (Ministral 14B) 0.282
+ SFT (epoch 5) 0.356
+ GRPO (step 11) 0.449

+60% improvement over base model through SFT + GRPO.

Training Stages

Stage 1: SFT

The starting SFT adapter was trained on filtered agent-diff trajectories. Document the exact SFT details here when publishing the final repo:

  • base model: mistralai/Ministral-3-14B-Instruct-2512-BF16
  • adapter repo: hubertmarek/Ministral-3-14B-Agent-Diff-SFT-LoRA
  • SFT source dataset: hubertmarek/mistral-large-agent-diff-sft-mixed-old-plus-devstral-r0p8-64k
  • max length: 64000
  • epochs: 8
  • learning rate: 5e-5
  • per-device batch size: 1
  • gradient accumulation steps: 6
  • lr scheduler: cosine
  • warmup ratio: 0.08
  • LoRA rank: 64
  • LoRA alpha: 128

Stage 2: GRPO

Current GRPO run configuration:

  • output dir: outputs/agent_diff_mistral_rl
  • initialization model: outputs/Ministral-3-14B-Agent-Diff-SFT-merged
  • hardware: 4x NVIDIA A100 80GB total (2 trainer GPUs + 2 inference GPUs)
  • max steps: 50
  • max async level: 2
  • global sequence length: 15360
  • checkpoint interval: 5
  • train GPUs: 2
  • inference GPUs: 2
  • trainer optimizer: adamw, lr 1e-5, weight decay 0.0
  • trainer dtypes: bfloat16
  • context parallelism: 2
  • activation checkpointing: freq=1
  • LoRA rank: 64
  • LoRA alpha: 128
  • LoRA dropout: 0.05
  • target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • rollout batch size: 256
  • rollouts per example: 8
  • oversampling factor: 1.2
  • sampling max tokens: 2096
  • sampling temperature: 0.7
  • max retries per env rollout: 25
  • eval interval: 5
  • eval rollouts per example: 3
  • env services: slack,linear,calendar,box
  • task horizon: 6
  • max turns: 20
  • tool efficiency weight: 0.0

Inference

Serve the merged SFT model and load the RL adapter checkpoint you want to evaluate.

Example vLLM settings used during training:

vllm serve hubertmarek/Ministral-3-14B-agent-diff-sft \
  --dtype bfloat16 \
  --max-model-len 65536 \
  --enable-lora \
  --lora-modules agent-diff=hubertmarek/Ministral-3-14B-Agent-Diff-SFT-GRPO-LoRA/step_11 \
  --max-lora-rank 64 \
  --enable-auto-tool-choice \
  --tool-call-parser mistral \
  --tokenizer mistralai/Ministral-3-14B-Instruct-2512-BF16 \
  --tokenizer-mode mistral \
  --enforce-eager

If you are loading from disk instead of HF, use:

--lora-modules agent-diff=outputs/agent_diff_mistral_rl/run_default/broadcasts/step_11

Evaluation

prime eval hubert-marek/agent-diff-bench \
  -m agent-diff \
  --api-base-url http://localhost:8000/v1 \
  -n -1 -r 3 -c 15 \
  --max-retries 20 \
  --env-args '{"agentdiff_api_key": "YOUR_KEY"}' \
  --save-results \
  --temperature 0.5

Training Code

This run uses the forked PRIME-RL training code here:

Changes for Ministral-3 compatibility (hubert-marek/prime-rl):

  • 37d7e2e3 — Trainer Mistral3 support: text config normalization, automatic state dict key remapping for merged multimodal checkpoints, FSDP meta-device loading
  • f29374f2 — vLLM inference: patched Mistral3 multimodal pipeline for text-only serving from merged checkpoints
  • 5c6de77f — Scheduler resilience: capped group reschedules with graceful degradation (zero-reward conversion for partial rollouts, clean drops for empty ones), full test coverage
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hubertmarek/Ministral-3-14B-Agent-Diff-SFT-GRPO-LoRA

Datasets used to train hubertmarek/Ministral-3-14B-Agent-Diff-SFT-GRPO-LoRA