Voxtral TTS Q4 GGUF

Q4_0 quantized weights for Voxtral 4B TTS in GGUF format. For use with voxtral-mini-realtime-rs.

Try the browser demo β€” runs entirely client-side via WASM + WebGPU.

Files

File Size Description
voxtral-tts-q4.gguf 2.67 GB Full Q4 model (single file, for native use)
shard-{aa..af} 6 Γ— ≀512 MB Sharded for browser (WASM ArrayBuffer limit)
voice_embedding/*.safetensors ~50-200 KB each 20 voice presets across 9 languages
tekken.json 14.9 MB Tekken BPE tokenizer

Model Details

  • Base model: mistralai/Voxtral-4B-TTS-2603
  • Quantization: Q4_0 (4-bit, 18 bytes per 32 elements)
  • File size: 2.67 GB (vs ~8 GB BF16 original)
  • Format: GGUF v3 (381 tensors)
  • Inference: Burn ML framework with custom WGSL compute shaders

What is Quantized

Component Quantization
Backbone (Ministral 3B, 26 layers) β€” attention + FFN Q4_0
Flow-matching transformer (3 layers) β€” attention + FFN + projections Q4_0
Token embeddings [131072, 3072] Q4_0
Semantic codebook output [8320, 3072] Q4_0
Codec decoder (8 transformer + 5 conv layers) F32
RMSNorm, LayerScale, QK-norm, small projections F32
Audio codebook embeddings [9088, 3072] F32

Codec weights stored as F32 with pre-fused weight normalization.

Benchmarks

NVIDIA DGX Spark (GB10, LPDDR5x), "The quick brown fox jumps over the lazy dog":

Euler Steps RTF Quality (Whisper large-v3)
8 (default) 1.61x Perfect
4 1.24x Perfect
3 ~1.0x (real-time) Perfect

Optimizations: batched CFG, fused QKV+gate/up projections, pre-allocated KV cache.

Usage

Native CLI

# Download
uv run --with huggingface_hub \
  hf download TrevorJS/voxtral-tts-q4-gguf voxtral-tts-q4.gguf --local-dir models

# Synthesize (unified voxtral CLI)
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
  speak --text "Hello world" --voice casual_female --gguf models/voxtral-tts-q4.gguf

# Real-time with 3 Euler steps
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
  speak --text "Hello world" --gguf models/voxtral-tts-q4.gguf --euler-steps 3

# List voices
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- speak --list-voices

Browser (WASM + WebGPU)

Shards are pre-split for browser loading. The TTS demo loads them automatically.

For local dev:

wasm-pack build --target web --no-default-features --features wasm
bun serve.mjs  # serves shards from models/voxtral-tts-q4-shards/

Available Voices

20 presets across 9 languages:

Voice Language
casual_female, casual_male English
neutral_female, neutral_male English
cheerful_female English
fr_female, fr_male French
de_female, de_male German
es_female, es_male Spanish
it_female, it_male Italian
pt_female, pt_male Portuguese
nl_female, nl_male Dutch
hi_female, hi_male Hindi
ar_male Arabic

Quantization Script

uv run --with safetensors --with torch --with numpy --with packaging \
  scripts/quantize_tts_gguf.py models/voxtral-tts/ -o voxtral-tts-q4.gguf

Source: scripts/quantize_tts_gguf.py

Related

Downloads last month
2,014
GGUF
Model size
4B params
Architecture
voxtral-tts
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TrevorJS/voxtral-tts-q4-gguf

Quantized
(4)
this model

Spaces using TrevorJS/voxtral-tts-q4-gguf 4