Voxtral-Mini-4B-Realtime-2602-ExecuTorch-CUDA-Windows
Pre-exported ExecuTorch artifacts for Voxtral-Mini-4B-Realtime-2602 with CUDA-Windows backend (NVIDIA GPU), streaming speech-to-text, bf16 precision, 4-bit weight-only quantization for encoder/decoder linear layers, and 8-bit weight-only quantization for embeddings.
These artifacts were exported from WSL Ubuntu using the cuda-windows export path and the CUDA 12.9 Windows CUDA payload, then run natively on Windows with the ExecuTorch Voxtral realtime runner.
Installation
git clone https://github.com/pytorch/executorch/ ~/executorch
cd ~/executorch
pip install -e . --no-build-isolation
Build on Windows (PowerShell):
cmake --workflow --preset llm-release-cuda
Push-Location examples/models/voxtral_realtime
cmake --workflow --preset voxtral-realtime-cuda
Pop-Location
Download
pip install huggingface_hub
hf download younghan-meta/Voxtral-Mini-4B-Realtime-2602-ExecuTorch-CUDA-Windows --local-dir voxtral_cuda_windows
hf download mistralai/Voxtral-Mini-4B-Realtime-2602 tekken.json --local-dir voxtral_tokenizer
If you also upload tekken.json to this repo, you can skip the second download command and point --tokenizer_path at the local copy instead.
Run
Input audio should be 16 kHz mono WAV.
Windows (PowerShell):
.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
--model_path voxtral_cuda_windows\model.pte `
--data_path voxtral_cuda_windows\aoti_cuda_blob.ptd `
--preprocessor_path voxtral_cuda_windows\preprocessor.pte `
--tokenizer_path voxtral_tokenizer\tekken.json `
--audio_path C:\path\to\audio.wav `
--streaming
Optional flags:
--temperature 0.0-- greedy decoding (default)--mic-- live microphone input from stdin--mic_chunk_ms 80-- microphone read chunk size in ms
Export Command
These artifacts were exported from WSL Ubuntu. Follow the same Windows CUDA extraction/setup flow as the Parakeet cuda-windows instructions, but use the CUDA 12.9 Windows installer payload before setting WINDOWS_CUDA_HOME.
export WINDOWS_CUDA_HOME=/opt/cuda-windows/extracted/cuda_cudart/cudart
python -m executorch.extension.audio.mel_spectrogram \
--feature_size 128 \
--streaming \
--output_file ./voxtral_rt_exports/preprocessor.pte
python examples/models/voxtral_realtime/export_voxtral_rt.py \
--model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \
--backend cuda-windows \
--dtype bf16 \
--streaming \
--sliding-window 2048 \
--output-dir ./voxtral_rt_exports \
--qlinear-encoder 4w \
--qlinear-encoder-packing-format tile_packed_to_4d \
--qlinear 4w \
--qlinear-packing-format tile_packed_to_4d \
--qembedding 8w
Notes
cuda-windowsexports producemodel.pteandaoti_cuda_blob.ptd.- The streaming runner also needs
preprocessor.pte. - Keep
tekken.jsonmatched with the same base model revision used during export.
More Info
- Downloads last month
- 16
Model tree for younghan-meta/Voxtral-Mini-4B-Realtime-2602-ExecuTorch-CUDA-Windows
Base model
mistralai/Ministral-3-3B-Base-2512