Voxtral TTS Q4 GGUF

Q4_0 quantized weights for Voxtral 4B TTS in GGUF format. For use with voxtral-mini-realtime-rs.

Try the browser demo — runs entirely client-side via WASM + WebGPU.

Files

File	Size	Description
`voxtral-tts-q4.gguf`	2.67 GB	Full Q4 model (single file, for native use)
`shard-{aa..af}`	6 × ≤512 MB	Sharded for browser (WASM ArrayBuffer limit)
`voice_embedding/*.safetensors`	~50-200 KB each	20 voice presets across 9 languages
`tekken.json`	14.9 MB	Tekken BPE tokenizer

Model Details

Base model: mistralai/Voxtral-4B-TTS-2603
Quantization: Q4_0 (4-bit, 18 bytes per 32 elements)
File size: 2.67 GB (vs ~8 GB BF16 original)
Format: GGUF v3 (381 tensors)
Inference: Burn ML framework with custom WGSL compute shaders

What is Quantized

Component	Quantization
Backbone (Ministral 3B, 26 layers) — attention + FFN	Q4_0
Flow-matching transformer (3 layers) — attention + FFN + projections	Q4_0
Token embeddings [131072, 3072]	Q4_0
Semantic codebook output [8320, 3072]	Q4_0
Codec decoder (8 transformer + 5 conv layers)	F32
RMSNorm, LayerScale, QK-norm, small projections	F32
Audio codebook embeddings [9088, 3072]	F32

Codec weights stored as F32 with pre-fused weight normalization.

Benchmarks

NVIDIA DGX Spark (GB10, LPDDR5x), "The quick brown fox jumps over the lazy dog":

Euler Steps	RTF	Quality (Whisper large-v3)
8 (default)	1.61x	Perfect
4	1.24x	Perfect
3	~1.0x (real-time)	Perfect

Optimizations: batched CFG, fused QKV+gate/up projections, pre-allocated KV cache.

Usage

Native CLI

# Download
uv run --with huggingface_hub \
  hf download TrevorJS/voxtral-tts-q4-gguf voxtral-tts-q4.gguf --local-dir models

# Synthesize (unified voxtral CLI)
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
  speak --text "Hello world" --voice casual_female --gguf models/voxtral-tts-q4.gguf

# Real-time with 3 Euler steps
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- \
  speak --text "Hello world" --gguf models/voxtral-tts-q4.gguf --euler-steps 3

# List voices
cargo run --release --features "wgpu,cli,hub" --bin voxtral -- speak --list-voices

Browser (WASM + WebGPU)

Shards are pre-split for browser loading. The TTS demo loads them automatically.

For local dev:

wasm-pack build --target web --no-default-features --features wasm
bun serve.mjs  # serves shards from models/voxtral-tts-q4-shards/

Available Voices

20 presets across 9 languages:

Voice	Language
casual_female, casual_male	English
neutral_female, neutral_male	English
cheerful_female	English
fr_female, fr_male	French
de_female, de_male	German
es_female, es_male	Spanish
it_female, it_male	Italian
pt_female, pt_male	Portuguese
nl_female, nl_male	Dutch
hi_female, hi_male	Hindi
ar_male	Arabic

Quantization Script

uv run --with safetensors --with torch --with numpy --with packaging \
  scripts/quantize_tts_gguf.py models/voxtral-tts/ -o voxtral-tts-q4.gguf

Source: scripts/quantize_tts_gguf.py

Code: TrevorS/voxtral-mini-realtime-rs
ASR Model: TrevorJS/voxtral-mini-realtime-gguf
ASR Demo: TrevorJS/voxtral-mini-realtime
TTS Demo: TrevorJS/voxtral-4b-tts

Downloads last month: 2,014

GGUF

Model size

4B params

Architecture

voxtral-tts

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for TrevorJS/voxtral-tts-q4-gguf

Base model

mistralai/Ministral-3-3B-Base-2512

Finetuned

mistralai/Voxtral-4B-TTS-2603

Quantized

(4)

this model

TrevorJS
/

voxtral-tts-q4-gguf