Voxtral TTS Q4 GGUF
Q4_0 quantized weights for Voxtral 4B TTS in GGUF format. For use with voxtral-mini-realtime-rs.
Try the browser demo β runs entirely client-side via WASM + WebGPU.
Files
| File | Size | Description |
|---|---|---|
voxtral-tts-q4.gguf |
2.67 GB | Full Q4 model (single file, for native use) |
shard-{aa..af} |
6 Γ β€512 MB | Sharded for browser (WASM ArrayBuffer limit) |
voice_embedding/*.safetensors |
~50-200 KB each | 20 voice presets across 9 languages |
tekken.json |
14.9 MB | Tekken BPE tokenizer |
Model Details
- Base model: mistralai/Voxtral-4B-TTS-2603
- Quantization: Q4_0 (4-bit, 18 bytes per 32 elements)
- File size: 2.67 GB (vs ~8 GB BF16 original)
- Format: GGUF v3 (381 tensors)
- Inference: Burn ML framework with custom WGSL compute shaders
What is Quantized
| Component | Quantization |
|---|---|
| Backbone (Ministral 3B, 26 layers) β attention + FFN | Q4_0 |
| Flow-matching transformer (3 layers) β attention + FFN + projections | Q4_0 |
| Token embeddings [131072, 3072] | Q4_0 |
| Semantic codebook output [8320, 3072] | Q4_0 |
| Codec decoder (8 transformer + 5 conv layers) | F32 |
| RMSNorm, LayerScale, QK-norm, small projections | F32 |
| Audio codebook embeddings [9088, 3072] | F32 |
Codec weights stored as F32 with pre-fused weight normalization.
Benchmarks
NVIDIA DGX Spark (GB10, LPDDR5x), "The quick brown fox jumps over the lazy dog":
| Euler Steps | RTF | Quality (Whisper large-v3) |
|---|---|---|
| 8 (default) | 1.61x | Perfect |
| 4 | 1.24x | Perfect |
| 3 | ~1.0x (real-time) | Perfect |
Optimizations: batched CFG, fused QKV+gate/up projections, pre-allocated KV cache.
Usage
Native CLI
# Download
uv run --with huggingface_hub \
hf download TrevorJS/voxtral-tts-q4-gguf voxtral-tts-q4.gguf --local-dir models
# Synthesize (unified voxtral CLI)
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
speak --text "Hello world" --voice casual_female --gguf models/voxtral-tts-q4.gguf
# Real-time with 3 Euler steps
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
speak --text "Hello world" --gguf models/voxtral-tts-q4.gguf --euler-steps 3
# List voices
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- speak --list-voices
Browser (WASM + WebGPU)
Shards are pre-split for browser loading. The TTS demo loads them automatically.
For local dev:
wasm-pack build --target web --no-default-features --features wasm
bun serve.mjs # serves shards from models/voxtral-tts-q4-shards/
Available Voices
20 presets across 9 languages:
| Voice | Language |
|---|---|
| casual_female, casual_male | English |
| neutral_female, neutral_male | English |
| cheerful_female | English |
| fr_female, fr_male | French |
| de_female, de_male | German |
| es_female, es_male | Spanish |
| it_female, it_male | Italian |
| pt_female, pt_male | Portuguese |
| nl_female, nl_male | Dutch |
| hi_female, hi_male | Hindi |
| ar_male | Arabic |
Quantization Script
uv run --with safetensors --with torch --with numpy --with packaging \
scripts/quantize_tts_gguf.py models/voxtral-tts/ -o voxtral-tts-q4.gguf
Source: scripts/quantize_tts_gguf.py
Related
- Code: TrevorS/voxtral-mini-realtime-rs
- ASR Model: TrevorJS/voxtral-mini-realtime-gguf
- ASR Demo: TrevorJS/voxtral-mini-realtime
- TTS Demo: TrevorJS/voxtral-4b-tts
- Downloads last month
- 2,014
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.
Model tree for TrevorJS/voxtral-tts-q4-gguf
Base model
mistralai/Ministral-3-3B-Base-2512 Finetuned
mistralai/Voxtral-4B-TTS-2603