Mistral Small Agent-Diff
LoRA fine-tune of Ministral-3-14B-Instruct-2512-BF16 for API tool-calling tasks across Box, Google Calendar, Linear, and Slack.
Results
Evaluated on agent-diff-bench (45 tasks, test split).
Per-example average reward (higher is better):
| Config | Reward | Error Rate |
|---|---|---|
| LoRA ep5 t=0.5 | 0.356 | 22.2% |
| LoRA ep4 t=0.7 | 0.341 | 21.6% |
| Base t=0.5 | 0.322 | 28.4% |
| Base t=0.7 | 0.220 | 27.8% |
Grand means (per-example average, all rollouts pooled):
- All LoRA configs: 0.350
- All Base configs: 0.282
- Delta: +0.068
Best-of per example:
- Best LoRA: 0.454
- Best Base: 0.362
- Delta: +0.092
Per-Service Breakdown
| Service | LoRA ep5 t=0.5 | Base t=0.5 | Delta |
|---|---|---|---|
| Box | 0.266 | 0.100 | +0.166 |
| Calendar | 0.453 | 0.369 | +0.084 |
| Linear | 0.317 | 0.142 | +0.175 |
| Slack | 0.435 | 0.452 | -0.017 |
Head-to-head (best LoRA vs best Base per example): LoRA wins 14, Base wins 5, Tied 15
Training
Data
- Source: Devstral-2512 rollouts on agent-diff-bench, filtered for reward > 0.8
- Processing pipeline:
- Native formatting (0 missing content, 0 consecutive assistant issues)
- Command flattening (multi-line curl to single-line, reduced from 44% to 6%)
- Error turn removal (failed API call + error response pairs removed, error rate 20% to 1.8%)
- Prompt-level train/val split (0% leakage)
- Final dataset: 361 rows, 142 unique prompts, ~2.5 rollouts per prompt
- Dataset: hubertmarek/mistral-large-agent-diff-sft-mixed-old-plus-devstral-r0p8-64k
Hyperparameters
SFTConfig(
num_train_epochs=8,
per_device_train_batch_size=1,
gradient_accumulation_steps=6,
learning_rate=5e-5,
lr_scheduler_type="cosine",
warmup_ratio=0.08,
bf16=True,
max_length=64000,
optim="adamw_torch_fused",
gradient_checkpointing=True,
save_strategy="epoch",
packing=False,
)
LoRA Config
- Rank: 64
- Alpha: 128
- Target modules: all linear layers
Inference
vLLM
export HF_TOKEN='your_token'
vllm serve mistralai/Ministral-3-14B-Instruct-2512 \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--enable-lora \
--lora-modules agent-diff=ministral-3-14b-agent-diff-sft-lora \
--enable-auto-tool-choice --tool-call-parser mistral \
--max-model-len 64000 \
--max-lora-rank 64 \
--enforce-eager
Evaluation
prime eval hubert-marek/agent-diff-bench \
-m agent-diff \
--api-base-url http://localhost:8000/v1 \
-n -1 -r 3 -c 15 \
--max-retries 20 \
--env-args '{"agentdiff_api_key": "YOUR_KEY"}' \
--save-results \
--temperature 0.5
Checkpoints
This repo contains multiple epochs as commits:
- Epoch 3 (checkpoint-183): Recommended starting point
- Epoch 5 (checkpoint-305): Best benchmark performance
- Epoch 8 (checkpoint-488): Overfitting
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for hubertmarek/Ministral-3-14B-Agent-Diff-SFT-LoRA
Base model
mistralai/Ministral-3-14B-Base-2512