Voxtral-Mini-4B-Realtime-2602-ExecuTorch-CUDA-Windows

Pre-exported ExecuTorch artifacts for Voxtral-Mini-4B-Realtime-2602 with CUDA-Windows backend (NVIDIA GPU), streaming speech-to-text, bf16 precision, 4-bit weight-only quantization for encoder/decoder linear layers, and 8-bit weight-only quantization for embeddings.

These artifacts were exported from WSL Ubuntu using the cuda-windows export path and the CUDA 12.9 Windows CUDA payload, then run natively on Windows with the ExecuTorch Voxtral realtime runner.

Installation

git clone https://github.com/pytorch/executorch/ ~/executorch
cd ~/executorch
pip install -e . --no-build-isolation

Build on Windows (PowerShell):

cmake --workflow --preset llm-release-cuda
Push-Location examples/models/voxtral_realtime
cmake --workflow --preset voxtral-realtime-cuda
Pop-Location

Download

pip install huggingface_hub

hf download younghan-meta/Voxtral-Mini-4B-Realtime-2602-ExecuTorch-CUDA-Windows --local-dir voxtral_cuda_windows
hf download mistralai/Voxtral-Mini-4B-Realtime-2602 tekken.json --local-dir voxtral_tokenizer

If you also upload tekken.json to this repo, you can skip the second download command and point --tokenizer_path at the local copy instead.

Run

Input audio should be 16 kHz mono WAV.

Windows (PowerShell):

.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
    --model_path voxtral_cuda_windows\model.pte `
    --data_path voxtral_cuda_windows\aoti_cuda_blob.ptd `
    --preprocessor_path voxtral_cuda_windows\preprocessor.pte `
    --tokenizer_path voxtral_tokenizer\tekken.json `
    --audio_path C:\path\to\audio.wav `
    --streaming

Optional flags:

  • --temperature 0.0 -- greedy decoding (default)
  • --mic -- live microphone input from stdin
  • --mic_chunk_ms 80 -- microphone read chunk size in ms

Export Command

These artifacts were exported from WSL Ubuntu. Follow the same Windows CUDA extraction/setup flow as the Parakeet cuda-windows instructions, but use the CUDA 12.9 Windows installer payload before setting WINDOWS_CUDA_HOME.

export WINDOWS_CUDA_HOME=/opt/cuda-windows/extracted/cuda_cudart/cudart

python -m executorch.extension.audio.mel_spectrogram \
    --feature_size 128 \
    --streaming \
    --output_file ./voxtral_rt_exports/preprocessor.pte

python examples/models/voxtral_realtime/export_voxtral_rt.py \
    --model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \
    --backend cuda-windows \
    --dtype bf16 \
    --streaming \
    --sliding-window 2048 \
    --output-dir ./voxtral_rt_exports \
    --qlinear-encoder 4w \
    --qlinear-encoder-packing-format tile_packed_to_4d \
    --qlinear 4w \
    --qlinear-packing-format tile_packed_to_4d \
    --qembedding 8w

Notes

  • cuda-windows exports produce model.pte and aoti_cuda_blob.ptd.
  • The streaming runner also needs preprocessor.pte.
  • Keep tekken.json matched with the same base model revision used during export.

More Info

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for younghan-meta/Voxtral-Mini-4B-Realtime-2602-ExecuTorch-CUDA-Windows