Gave it a try, here are my findings

#2
by ktsaou - opened

@Firworks I gave this a try and I (with the help of Claude Code) managed to load and run the model. But it is not reliable.


Issue Report: Firworks/Ministral-3-14B-Reasoning-2512-nvfp4 Compatibility Issues

Environment

  • vLLM Version: 0.14.0rc1.dev127+g2d6001f49
  • PyTorch Version: 2.9.1+cu130
  • Transformers Version: 4.57.3

Summary

The Firworks/Ministral-3-14B-Reasoning-2512-nvfp4 model fails to load on vLLM and requires multiple config workarounds. Even after applying fixes, the model produces degenerate output on longer generations.


Problem 1: Unknown model_type ministral3

Error:
KeyError: 'ministral3'

Cause: text_config.model_type is set to ministral3, which doesn't exist in transformers' CONFIG_MAPPING. The valid types are ministral or mistral.

Workaround: Changed text_config.model_type from ministral3 to mistral in config.json.


Problem 2: Missing architectures in text_config

Error:
TypeError: 'NoneType' object is not iterable

Cause: text_config lacks an architectures field, causing vLLM to fail when resolving the language model class.

Workaround: Added "architectures": ["MistralForCausalLM"] to text_config.


Problem 3: Missing preprocessor/processor config files

Error:
OSError: Can't load image processor... preprocessor_config.json

Cause: The quantized model is missing preprocessor_config.json and processor_config.json required for the Pixtral vision encoder.

Workaround: Copied these files from the original mistralai/Ministral-3-14B-Reasoning-2512 model.


Problem 4: Missing tokenizer files

Error:
AssertionError: Expected to decode 1 token, got 3

Cause: The model is missing proper tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json) with the [IMG], [THINK], [/THINK] special tokens properly defined.

Workaround: Downloaded tokenizer files from the original model.


Problem 5: MistralTokenizer incompatibility with HuggingFace tokenizer mode

Error:
AttributeError: 'MistralTokenizer' object has no attribute 'convert_tokens_to_ids'

Cause: When using native Mistral tokenizer mode, it lacks methods expected by vLLM's processing pipeline.

Workaround: Used --tokenizer-mode hf to force HuggingFace tokenizer.


Problem 6: Reasoning parser requires MistralTokenizer

Error:
ValueError: The tokenizer must be an instance of MistralTokenizer.

Cause: --reasoning-parser mistral requires the native MistralTokenizer, but we're forced to use HF tokenizer (see Problem 5).

Workaround: Created a custom ministral reasoning parser that uses [THINK]/[/THINK] tokens but works with HF tokenizer.


Problem 7: Missing chat_template

Error:
ChatTemplateResolutionError: default chat template is no longer allowed

Cause: Neither the quantized model nor the tokenizer_config.json includes a chat_template.

Workaround: Downloaded chat_template.jinja from the original model and added --chat-template flag.


Problem 8: Tokenizer regex warning

Warning:
The tokenizer you are loading with an incorrect regex pattern...
You should set fix_mistral_regex=True

Cause: Known issue with Mistral tokenizer regex patterns for contraction handling.

Status: Warning only; tokenizer.json appears to have correct pattern but warning persists.


Problem 9: Degenerate output on longer generations

Symptom: Model produces coherent output for short responses but degenerates into:

  • Random Unicode characters from multiple languages (漢, Bาง, пре历, دياس, Україна, რომელიც)
  • Nonsense word fragments mixed with English
  • Extreme repetition loops ("harry harry harry..." or "above it, above it, above it...")

Cause (suspected):

  1. Architecture mismatch: Forcing MistralForCausalLM instead of native MinistralForCausalLM (which vLLM doesn't support yet)
  2. RoPE configuration: The original uses rope_parameters with Ministral-specific fields; we converted to standard rope_scaling format, potentially losing important parameters
  3. Possible quantization corruption

Status: UNRESOLVED. Model is unusable for production.


Files Missing from Quantized Model

The following files present in the original model are missing from the Firworks quantization:

  • chat_template.jinja
  • preprocessor_config.json
  • processor_config.json
  • params.json (required for --config_format mistral --load_format mistral)

Recommended Fixes

  1. Fix text_config.model_type: Change from ministral3 to ministral (or mistral if targeting compatibility)
  2. Add architectures to text_config: Include "architectures": ["MinistralForCausalLM"] or appropriate class
  3. Include all required files: Bundle chat_template.jinja, preprocessor_config.json, processor_config.json, and tokenizer files
  4. Consider including params.json: This would allow native Mistral format loading which might work better
  5. Validate quantization quality: The degenerate output suggests possible weight corruption or incompatibility in the nvfp4 quantization process
  6. Test with vLLM: Verify the model loads and generates coherently on vLLM before publishing

Config Changes Applied (for reference)

text_config modifications:
{
"model_type": "mistral", // changed from "ministral3"
"architectures": ["MistralForCausalLM"], // added
"rope_theta": 1000000000.0,
"rope_scaling": { // converted from "rope_parameters"
"type": "yarn",
"factor": 16.0,
"original_max_position_embeddings": 16384,
"beta_fast": 32.0,
"beta_slow": 1.0,
"mscale": 1.0,
"mscale_all_dim": 1.0
}
}

vLLM launch flags:
--tokenizer-mode hf
--reasoning-parser ministral \ # custom parser
--chat-template /path/to/chat_template.jinja
--trust-remote-code


Conclusion

The Firworks/Ministral-3-14B-Reasoning-2512-nvfp4 model requires significant patching to load on vLLM, and even after all workarounds, produces unusable output for longer generations. The quantization appears to be incompatible with current vLLM/transformers infrastructure and may have quality issues in the quantized weights themselves.

Sign up or log in to comment