Gave it a try, here are my findings
@Firworks I gave this a try and I (with the help of Claude Code) managed to load and run the model. But it is not reliable.
Issue Report: Firworks/Ministral-3-14B-Reasoning-2512-nvfp4 Compatibility Issues
Environment
- vLLM Version: 0.14.0rc1.dev127+g2d6001f49
- PyTorch Version: 2.9.1+cu130
- Transformers Version: 4.57.3
Summary
The Firworks/Ministral-3-14B-Reasoning-2512-nvfp4 model fails to load on vLLM and requires multiple config workarounds. Even after applying fixes, the model produces degenerate output on longer generations.
Problem 1: Unknown model_type ministral3
Error:
KeyError: 'ministral3'
Cause: text_config.model_type is set to ministral3, which doesn't exist in transformers' CONFIG_MAPPING. The valid types are ministral or mistral.
Workaround: Changed text_config.model_type from ministral3 to mistral in config.json.
Problem 2: Missing architectures in text_config
Error:
TypeError: 'NoneType' object is not iterable
Cause: text_config lacks an architectures field, causing vLLM to fail when resolving the language model class.
Workaround: Added "architectures": ["MistralForCausalLM"] to text_config.
Problem 3: Missing preprocessor/processor config files
Error:
OSError: Can't load image processor... preprocessor_config.json
Cause: The quantized model is missing preprocessor_config.json and processor_config.json required for the Pixtral vision encoder.
Workaround: Copied these files from the original mistralai/Ministral-3-14B-Reasoning-2512 model.
Problem 4: Missing tokenizer files
Error:
AssertionError: Expected to decode 1 token, got 3
Cause: The model is missing proper tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json) with the [IMG], [THINK], [/THINK] special tokens properly defined.
Workaround: Downloaded tokenizer files from the original model.
Problem 5: MistralTokenizer incompatibility with HuggingFace tokenizer mode
Error:
AttributeError: 'MistralTokenizer' object has no attribute 'convert_tokens_to_ids'
Cause: When using native Mistral tokenizer mode, it lacks methods expected by vLLM's processing pipeline.
Workaround: Used --tokenizer-mode hf to force HuggingFace tokenizer.
Problem 6: Reasoning parser requires MistralTokenizer
Error:
ValueError: The tokenizer must be an instance of MistralTokenizer.
Cause: --reasoning-parser mistral requires the native MistralTokenizer, but we're forced to use HF tokenizer (see Problem 5).
Workaround: Created a custom ministral reasoning parser that uses [THINK]/[/THINK] tokens but works with HF tokenizer.
Problem 7: Missing chat_template
Error:
ChatTemplateResolutionError: default chat template is no longer allowed
Cause: Neither the quantized model nor the tokenizer_config.json includes a chat_template.
Workaround: Downloaded chat_template.jinja from the original model and added --chat-template flag.
Problem 8: Tokenizer regex warning
Warning:
The tokenizer you are loading with an incorrect regex pattern...
You should set fix_mistral_regex=True
Cause: Known issue with Mistral tokenizer regex patterns for contraction handling.
Status: Warning only; tokenizer.json appears to have correct pattern but warning persists.
Problem 9: Degenerate output on longer generations
Symptom: Model produces coherent output for short responses but degenerates into:
- Random Unicode characters from multiple languages (漢, Bาง, пре历, دياس, Україна, რომელიც)
- Nonsense word fragments mixed with English
- Extreme repetition loops ("harry harry harry..." or "above it, above it, above it...")
Cause (suspected):
- Architecture mismatch: Forcing MistralForCausalLM instead of native MinistralForCausalLM (which vLLM doesn't support yet)
- RoPE configuration: The original uses rope_parameters with Ministral-specific fields; we converted to standard rope_scaling format, potentially losing important parameters
- Possible quantization corruption
Status: UNRESOLVED. Model is unusable for production.
Files Missing from Quantized Model
The following files present in the original model are missing from the Firworks quantization:
- chat_template.jinja
- preprocessor_config.json
- processor_config.json
- params.json (required for --config_format mistral --load_format mistral)
Recommended Fixes
- Fix text_config.model_type: Change from ministral3 to ministral (or mistral if targeting compatibility)
- Add architectures to text_config: Include "architectures": ["MinistralForCausalLM"] or appropriate class
- Include all required files: Bundle chat_template.jinja, preprocessor_config.json, processor_config.json, and tokenizer files
- Consider including params.json: This would allow native Mistral format loading which might work better
- Validate quantization quality: The degenerate output suggests possible weight corruption or incompatibility in the nvfp4 quantization process
- Test with vLLM: Verify the model loads and generates coherently on vLLM before publishing
Config Changes Applied (for reference)
text_config modifications:
{
"model_type": "mistral", // changed from "ministral3"
"architectures": ["MistralForCausalLM"], // added
"rope_theta": 1000000000.0,
"rope_scaling": { // converted from "rope_parameters"
"type": "yarn",
"factor": 16.0,
"original_max_position_embeddings": 16384,
"beta_fast": 32.0,
"beta_slow": 1.0,
"mscale": 1.0,
"mscale_all_dim": 1.0
}
}
vLLM launch flags:
--tokenizer-mode hf
--reasoning-parser ministral \ # custom parser
--chat-template /path/to/chat_template.jinja
--trust-remote-code
Conclusion
The Firworks/Ministral-3-14B-Reasoning-2512-nvfp4 model requires significant patching to load on vLLM, and even after all workarounds, produces unusable output for longer generations. The quantization appears to be incompatible with current vLLM/transformers infrastructure and may have quality issues in the quantized weights themselves.