Voxtral-Mini-4B-Realtime β€” GGUF

GGUF quantizations of mistralai/Voxtral-Mini-4B-Realtime-2602, a 4.4B-parameter realtime streaming speech-to-text model with a causal audio encoder and configurable transcription delay.

Converted and tested with CrispASR, a multi-model ASR framework built on ggml.

Files

File Quant Size Description
voxtral-mini-4b-realtime.gguf F16 8.3 GB Full precision (reference)
voxtral-mini-4b-realtime-q8_0.gguf Q8_0 4.5 GB 8-bit quantized
voxtral-mini-4b-realtime-q4_k.gguf Q4_K 2.4 GB 4-bit K-quant (recommended)

Performance (CPU, 4 threads, AVX2, jfk.wav 11s)

Quant Encoder Prefill Decode (ms/tok) Total RTFx
F16 39s 30s 430 133s 0.08Γ—
Q8_0 30s 9s 257 79s 0.14Γ—
Q4_K 19s 3s 177 49s 0.22Γ—

Q4_K recommended β€” 3.5Γ— smaller than F16, 2.7Γ— faster, identical transcription quality.

Usage

# Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target voxtral4b-main

# Download Q4_K (recommended)
huggingface-cli download cstr/voxtral-mini-4b-realtime-GGUF \
    voxtral-mini-4b-realtime-q4_k.gguf --local-dir .

# Transcribe
./build/bin/voxtral4b-main -m voxtral-mini-4b-realtime-q4_k.gguf -f audio.wav

With word-level timestamps

# Also download the CTC aligner
huggingface-cli download cstr/canary-ctc-aligner-GGUF \
    canary-ctc-aligner-q4_k.gguf --local-dir .

./build/bin/voxtral4b-main -m voxtral-mini-4b-realtime-q4_k.gguf \
    -f audio.wav -am canary-ctc-aligner-q4_k.gguf -timestamps

CLI options

-m  FNAME   GGUF model file (required)
-f  FNAME   Input audio, 16 kHz mono WAV (required)
-t  N       Threads (default: 4)
-l  LANG    Language hint (default: en)
-n  N       Max new tokens (default: 512)
-am FNAME   CTC aligner GGUF for word timestamps
-timestamps Enable word-level timestamps (requires -am)
-np         Suppress stderr info

Architecture

  • Audio encoder: 32-layer causal transformer (RoPE, SwiGLU, RMSNorm, sliding window 750)
  • LLM decoder: 26-layer Mistral (GQA 32/8, SwiGLU, adaptive RMSNorm, sliding window 8192)
  • Projector: 4-frame stack β†’ Linear(5120β†’3072) β†’ GELU β†’ Linear(3072β†’3072)
  • Tokenizer: Mistral Tekken (150K vocab, 1000 special tokens)
  • Audio injection: adapter output ADDED to token embeddings (streaming format)

Key features

  • Natively streaming architecture with causal encoder
  • 13 languages: en, fr, es, de, ru, zh, ja, it, pt, nl, ar, hi, ko
  • Configurable delay: 480ms default (6 tokens Γ— 80ms)
  • Apache 2.0 license

Conversion

python models/convert-voxtral4b-to-gguf.py \
    --input /path/to/Voxtral-Mini-4B-Realtime-2602 \
    --output voxtral-mini-4b-realtime.gguf

# Then quantize
./build/bin/cohere-quantize voxtral-mini-4b-realtime.gguf \
    voxtral-mini-4b-realtime-q4_k.gguf q4_k

Credits

Downloads last month
135
GGUF
Model size
4B params
Architecture
voxtral4b
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/voxtral-mini-4b-realtime-GGUF