File size: 8,900 Bytes
b2ff2b4 c5e09da b2ff2b4 ec8ec27 b2ff2b4 2464e30 b2ff2b4 8dbae86 7a1cb9f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | ---
base_model: unsloth/Devstral-Small-2507-unsloth-bnb-4bit
library_name: peft
model_name: devstral-v3-sft
tags:
- base_model:adapter:unsloth/Devstral-Small-2507-unsloth-bnb-4bit
- lora
- sft
- transformers
- trl
- unsloth
license: apache-2.0
pipeline_tag: text-generation
---
# Model Card for devstral-v3-sft
This model is a fine-tuned version of [unsloth/Devstral-Small-2507-unsloth-bnb-4bit](https://huggingface.co/unsloth/Devstral-Small-2507-unsloth-bnb-4bit).
It has been trained using [TRL](https://github.com/huggingface/trl).
## Quick start
```python
from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="None", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
```
## Training procedure
This model was trained with SFT.
### Framework versions
- PEFT 0.18.1
- TRL: 0.24.0
- Transformers: 5.5.0
- Pytorch: 2.10.0
- Datasets: 4.3.0
- Tokenizers: 0.22.2
## Citations
Cite TRL as:
```bibtex
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
```
# devstral-v3-sft
## 🇪🇺 EU AI Act transparency
This model is published under the AI Act framework (Regulation EU 2024/1689).
| Field | Value |
|---|---|
| Provider | Ailiance (clemsail) |
| Role under AI Act | GPAI provider |
| Adapter type | LoRA / PEFT — supervised fine-tune adapter |
| Base model | `mistralai/Devstral-Small-2-24B-Instruct-2512` |
| License | Apache-2.0 (this artefact); upstream Mistral licence applies separately |
| Intended use | Code generation across Python / Rust / TypeScript / C++ / SQL / shell, with stronger reasoning on engineering questions |
| Out of scope | Healthcare diagnosis, legal advice, autonomous safety-critical decisions, generation of malicious code or exploits |
| Risk classification | Limited risk — Article 50 transparency obligations apply |
| Copyright respect | Training data does not include scraped copyrighted material. Public engineering documentation under permissive licences plus internal synthetic distillation. |
| Full provenance | https://ailiance.fr/transparency |
| Contact | postmaster@saillant.cc |
⚠️ **You are using an AI model.** Outputs may be inaccurate, biased or
fabricated. Do not act on them without independent verification, especially
in regulated domains.
## Benchmarks
Run via `lm-eval-harness` v0.4.x against the FUSED checkpoint (base + this
adapter merged for inference). Strict-match where applicable.
| Task | Metric | Score |
|---|---|---|
| gsm8k | `exact_match,strict-match` | **0.844** |
| ifeval | `prompt_level_strict_acc,none` | **0.691** |
| bbh_cot_fewshot | `exact_match,get-answer` | **0.795** |
| bbh_cot_fewshot_boolean_expressions | `exact_match,get-answer` | **0.900** |
| bbh_cot_fewshot_causal_judgement | `exact_match,get-answer` | **0.600** |
| bbh_cot_fewshot_date_understanding | `exact_match,get-answer` | **0.933** |
| bbh_cot_fewshot_disambiguation_qa | `exact_match,get-answer` | **0.767** |
| bbh_cot_fewshot_dyck_languages | `exact_match,get-answer` | **0.100** |
| bbh_cot_fewshot_formal_fallacies | `exact_match,get-answer` | **0.600** |
| bbh_cot_fewshot_geometric_shapes | `exact_match,get-answer` | **0.367** |
| bbh_cot_fewshot_hyperbaton | `exact_match,get-answer` | **1.000** |
| bbh_cot_fewshot_logical_deduction_five_objects | `exact_match,get-answer` | **0.767** |
| bbh_cot_fewshot_logical_deduction_seven_objects | `exact_match,get-answer` | **0.533** |
| bbh_cot_fewshot_logical_deduction_three_objects | `exact_match,get-answer` | **0.900** |
| bbh_cot_fewshot_movie_recommendation | `exact_match,get-answer` | **0.833** |
| bbh_cot_fewshot_multistep_arithmetic_two | `exact_match,get-answer` | **0.867** |
| bbh_cot_fewshot_navigate | `exact_match,get-answer` | **0.967** |
| bbh_cot_fewshot_object_counting | `exact_match,get-answer` | **0.967** |
| bbh_cot_fewshot_penguins_in_a_table | `exact_match,get-answer` | **0.933** |
| bbh_cot_fewshot_reasoning_about_colored_objects | `exact_match,get-answer` | **0.967** |
| bbh_cot_fewshot_ruin_names | `exact_match,get-answer` | **0.667** |
| bbh_cot_fewshot_salient_translation_error_detection | `exact_match,get-answer` | **0.700** |
| bbh_cot_fewshot_snarks | `exact_match,get-answer` | **0.700** |
| bbh_cot_fewshot_sports_understanding | `exact_match,get-answer` | **0.900** |
| bbh_cot_fewshot_temporal_sequences | `exact_match,get-answer` | **0.967** |
| bbh_cot_fewshot_tracking_shuffled_objects_five_objects | `exact_match,get-answer` | **0.967** |
| bbh_cot_fewshot_tracking_shuffled_objects_seven_objects | `exact_match,get-answer` | **0.933** |
| bbh_cot_fewshot_tracking_shuffled_objects_three_objects | `exact_match,get-answer` | **0.967** |
| bbh_cot_fewshot_web_of_lies | `exact_match,get-answer` | **1.000** |
| bbh_cot_fewshot_word_sorting | `exact_match,get-answer` | **0.667** |
| mmlu_pro | `exact_match,custom-extract` | **0.619** |
| mmlu_pro_biology | `exact_match,custom-extract` | **0.768** |
| mmlu_pro_business | `exact_match,custom-extract` | **0.660** |
| mmlu_pro_chemistry | `exact_match,custom-extract` | **0.580** |
| mmlu_pro_computer_science | `exact_match,custom-extract` | **0.676** |
| mmlu_pro_economics | `exact_match,custom-extract` | **0.678** |
| mmlu_pro_engineering | `exact_match,custom-extract` | **0.448** |
| mmlu_pro_health | `exact_match,custom-extract` | **0.678** |
| mmlu_pro_history | `exact_match,custom-extract` | **0.575** |
| mmlu_pro_law | `exact_match,custom-extract` | **0.432** |
| mmlu_pro_math | `exact_match,custom-extract` | **0.678** |
| mmlu_pro_other | `exact_match,custom-extract` | **0.612** |
| mmlu_pro_philosophy | `exact_match,custom-extract` | **0.549** |
| mmlu_pro_physics | `exact_match,custom-extract` | **0.630** |
| mmlu_pro_psychology | `exact_match,custom-extract` | **0.704** |
| leaderboard_math_hard | `exact_match,none` | **0.341** |
| leaderboard_math_algebra_hard | `exact_match,none` | **0.570** |
| leaderboard_math_counting_and_prob_hard | `exact_match,none` | **0.252** |
| leaderboard_math_geometry_hard | `exact_match,none` | **0.182** |
| leaderboard_math_intermediate_algebra_hard | `exact_match,none` | **0.139** |
| leaderboard_math_num_theory_hard | `exact_match,none` | **0.416** |
| leaderboard_math_prealgebra_hard | `exact_match,none` | **0.523** |
| leaderboard_math_precalculus_hard | `exact_match,none` | **0.126** |
Raw `results_*.json` files are committed under `evals/`.
## Validated in `ailiance/ailiance-bench` v0.2
This model is referenced in the [Ailiance benchmark suite](https://github.com/ailiance/ailiance-bench)
(Phase 6 scoreboard, 7-task hardware-design evaluation).
See the full scoreboard:
[ailiance-bench README#scoreboard-lora-phase-6](https://github.com/ailiance/ailiance-bench#scoreboard-lora-phase-6--2026-05-11).
## Upstream base model — official evaluations
This LoRA fine-tunes [`mistralai/Devstral-Small-2-24B-Instruct-2512`](https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512),
Mistral's coding-specialist LLM. Headline software-engineering benchmarks
from the upstream model card:
| Benchmark | Devstral Small 2 (24B) | Devstral 2 (123B) | DeepSeek v3.2 (671B) | Claude Sonnet 4.5 |
|--------------------------|-----------------------:|------------------:|---------------------:|------------------:|
| **SWE Bench Verified** | **68.0 %** | 72.2 % | 73.1 % | 77.2 % |
| **SWE Bench Multilingual** | **55.7 %** | 61.3 % | 70.2 % | 68.0 % |
| **Terminal Bench 2** | **22.5 %** | 32.6 % | 46.4 % | 42.8 % |
(For reference, GPT-5.1 Codex High: 73.7 % SWE Verified · 52.8 % Terminal Bench 2.)
Devstral Small 2 (24B) is competitive with much larger open models on
SWE Bench Verified (e.g. matches GLM-4.6 at 355B). Architecture uses
rope-scaling per Llama 4 + Scalable-Softmax ([arXiv:2501.19399](https://arxiv.org/abs/2501.19399)).
**Source:** [official Devstral-Small-2-24B-Instruct-2512 model card](https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512).
> **Reading these alongside this LoRA:** Devstral Small 2 is a strong
> coding base. This LoRA inherits its SWE-Bench performance and adds
> language- or domain-specific specialization.
|