Which finetuning toolkits?
This seems to be the first finetune of the Ministral3 models on HF. What toolkits did you use?
I managed to finetuned the model using Unsloth and TRL, but getting the model back to HF-compatible version failed.
Note: This version fails to load on inference engines as well. vllm throws:
vllm serve ramendik/miki-breeze-20260208 --tokenizer_mode mistral --config_format mistral --load_format mistral --enable-auto-tool-choice --tool-call-parser mistral --max_model_len 16384 --tensor_parallel_size 4 --enforce-eager
->
(APIServer pid=26406) s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=26406) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=26406) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=26406) Value error, Failed to load mistral 'params.json' config for model ramendik/miki-breeze-20260208. Please check if the model is a mistral-format model and if the config file exists. [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]
(APIServer pid=26406) For further information visit https://errors.pydantic.dev/2.12/v/value_error
So, the toolkit I used is TRL+peft. No Unsloth. But most critically, I had to use the custom model class, Mistral3ForConditionalGeneration. I also had to use the latest transformers library - and it is NOT compatible with vllm, so I had to keep vllm and Ministral training in separate venvs.
The relevant code snippets follow, and if you want, I can drop the full script thingie. The only problem with the full script thingie is it relies on my custom data prebatching, and I frankly have no time to document that in detail. If you want my framework, just tell me and I'll bake a public GitHub version, and then you can either run my prebatching or else rip out the custom data stuff. It also has a choice three optimizers - AdamW, GrokAdamW, and Muon; I really like how Muon works. It is also heavily vibe-coded and tested by exactly one person, so tread with caution.
For now, the most relevant code snippets.
from transformers.models.mistral3.modeling_mistral3 import Mistral3ForConditionalGeneration
from peft import get_peft_model, LoraConfig, PeftModel
model = Mistral3ForConditionalGeneration.from_pretrained(
args.model_name,
torch_dtype="auto",
attn_implementation="flash_attention_2",
device_map="auto",
trust_remote_code=True
)
# Enable gradient checkpointing BEFORE applying PEFT
model.gradient_checkpointing_enable()
# Enable input gradients (required for PEFT with gradient checkpointing)
model.enable_input_require_grads()
# Find text attention modules (not VL)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
text_targets = []
for i in range(40):
for proj in target_modules:
text_targets.append(f"language_model.layers.{i}.self_attn.{proj}")
# Create LoRA config (only apply to language_model layers)
# Ministral 3 has 40 language_model layers (layers.0 through layers.39)
lora_config = LoraConfig(
r=args.rank,
lora_alpha=args.alpha,
target_modules=text_targets,
lora_dropout=0,
bias="none",
task_type="CAUSAL_LM",
)
# Apply LoRA to model
model = get_peft_model(model, lora_config)
# Print trainable parameters
model.print_trainable_parameters()
# ... train with Trainer as usual, lots of my custom stuff here so not pasting
# Save final merged 16-bit model (unless --skip-merge)
if not args.force_steps and not args.skip_merge:
print("\n" + "="*80)
print("SAVING MERGED 16-BIT MODEL")
print("="*80)
final_dir = output_dir / "final_merged_model"
final_dir.mkdir(exist_ok=True)
print(f"Merging to 16-bit and saving to: {final_dir}")
# Merge and unload adapter
merged_model = model.merge_and_unload()
print("Saving merged model...")
merged_model.save_pretrained(str(final_dir))
tokenizer.save_pretrained(str(final_dir))
Regarding the vllm issue. I have, at least for now, abandoned my attempt on Ministral. My fine-tune is style-centric and Ministral outright resists it, compared to Granite on which I do my other tries. I temporarily switched to Ministral because I could not get Granite 4-h Small to train with quantized experts and lacked the VRAM to train it otherwise. I have since resolved the problem (that framework is VERY MUCH in early testing, though).
So, I have tested now and the checkpoint does not work for me in vllm either, but I won't be trying to fix it now, sorry. At the time I was doing it I could not get vllm to run the original Ministral either, so I had no reason to try. Apparently Ministral is supported now?
I actually inferred with this checkpoint by converting it to GGUF, which did work with the latest llama.cpp at the time.
Thanks for the details!
vLLM works with Ministral3. Some versions of SGLang have worked, but with the volume of updates to other libraries, it currently doesn't. The devs are working on the support. The support for many new OSS models (Ministral3, Granite4, Nemotron3) is patchy and breaking down, even more so with quantization, tensor-parallel and other features needed for actual deployment
I did not have any problems running my Granite4-h finetuned checkpoints with vllm, one is benchmarking right now :)
Best of luck with your Ministral finetunes!