# Voxtral-Mini-3B (ExecuTorch, XNNPACK, 8da4w)

This folder contains an ExecuTorch .pte export of https://huggingface.co/mistralai/Voxtral-Mini-3B-2507 for CPU inference via the XNNPACK backend, with post-training quantization enabled. Voxtral is a multimodal speech-language 
model that accepts audio and text inputs.

## Contents

- model.pte: ExecuTorch program
- voxtral_preprocessor.pte: Audio preprocessor (mel spectrogram extractor)

## Quantization

- --qlinear 8da4w: text decoder linear layers use 8-bit dynamic activations + 4-bit weights
- --qlinear_encoder 8da4w: audio encoder linear layers use 8-bit dynamic activations + 4-bit weights
- --qembedding 4w: embeddings use 4-bit weights

## Export model
```
pip install mistral_common

optimum-cli export executorch \
  --model "mistralai/Voxtral-Mini-3B-2507" \
  --task "multimodal-text-to-text" \
  --recipe "xnnpack" \
  --use_custom_sdpa \
  --use_custom_kv_cache \
  --max_seq_len 2048 \
  --qlinear 8da4w \
  --qlinear_encoder 8da4w \
  --qembedding 4w \
  --output_dir="voxtral"
```
## Export audio preprocessor (supports up to 5 min / 300s audio)
```
python -m executorch.extension.audio.mel_spectrogram \
  --feature_size 128 \
  --stack_output \
  --max_audio_len 300 \
  --output_file voxtral_preprocessor.pte
```
## Run
Download tokenizer
```
curl -L https://huggingface.co/mistralai/Voxtral-Mini-3B-2507/resolve/main/tekken.json --output tekken.json
```
Build the runner from the ExecuTorch repo root
```
make voxtral-cpu
```
Run model
```
./cmake-out/examples/models/voxtral/voxtral_runner \
  --model_path "model.pte" \
  --tokenizer_path "tekken.json" \
  --prompt "What can you tell me about this audio?" \
  --audio_path "audio.wav" \
  --processor_path "voxtral_preprocessor.pte" \
  --temperature 0
```