# Voxtral-Mini-3B (ExecuTorch, XNNPACK, 8da4w) This folder contains an ExecuTorch .pte export of https://huggingface.co/mistralai/Voxtral-Mini-3B-2507 for CPU inference via the XNNPACK backend, with post-training quantization enabled. Voxtral is a multimodal speech-language model that accepts audio and text inputs. ## Contents - model.pte: ExecuTorch program - voxtral_preprocessor.pte: Audio preprocessor (mel spectrogram extractor) ## Quantization - --qlinear 8da4w: text decoder linear layers use 8-bit dynamic activations + 4-bit weights - --qlinear_encoder 8da4w: audio encoder linear layers use 8-bit dynamic activations + 4-bit weights - --qembedding 4w: embeddings use 4-bit weights ## Export model ``` pip install mistral_common optimum-cli export executorch \ --model "mistralai/Voxtral-Mini-3B-2507" \ --task "multimodal-text-to-text" \ --recipe "xnnpack" \ --use_custom_sdpa \ --use_custom_kv_cache \ --max_seq_len 2048 \ --qlinear 8da4w \ --qlinear_encoder 8da4w \ --qembedding 4w \ --output_dir="voxtral" ``` ## Export audio preprocessor (supports up to 5 min / 300s audio) ``` python -m executorch.extension.audio.mel_spectrogram \ --feature_size 128 \ --stack_output \ --max_audio_len 300 \ --output_file voxtral_preprocessor.pte ``` ## Run Download tokenizer ``` curl -L https://huggingface.co/mistralai/Voxtral-Mini-3B-2507/resolve/main/tekken.json --output tekken.json ``` Build the runner from the ExecuTorch repo root ``` make voxtral-cpu ``` Run model ``` ./cmake-out/examples/models/voxtral/voxtral_runner \ --model_path "model.pte" \ --tokenizer_path "tekken.json" \ --prompt "What can you tell me about this audio?" \ --audio_path "audio.wav" \ --processor_path "voxtral_preprocessor.pte" \ --temperature 0 ```