Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

Voxtral-Mini-4B Realtime — HLWQ Q5

First HLWQ of a speech recognition model — Mistral's real-time streaming ASR.

8.86 GB → 3.4 GB (-62%) | cos_sim 0.9986 | 408 layers quantized

ASR Validated

"Yesterday it was 35 degrees in Barcelona, but today the temperature will go down to minus 20 degrees."

11s audio transcribed in 3.3s — RTF 0.30 (3x faster than real-time).

Download Size

Component Compression

Real-Time Factor

Quick Start

pip install polarquant "mistral-common[audio]"

import polarengine_vllm
from transformers import VoxtralRealtimeForConditionalGeneration, AutoProcessor
from mistral_common.tokens.tokenizers.audio import Audio

model = VoxtralRealtimeForConditionalGeneration.from_pretrained(
    "caiovicentino1/Voxtral-Mini-4B-Realtime-HLWQ-Q5",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    "caiovicentino1/Voxtral-Mini-4B-Realtime-HLWQ-Q5"
)

audio = Audio.from_file("audio.mp3", strict=False)
audio.resample(processor.feature_extractor.sampling_rate)

inputs = processor(audio.audio_array, return_tensors="pt")
inputs = inputs.to(model.device, dtype=model.dtype)
outputs = model.generate(**inputs)
print(processor.batch_decode(outputs, skip_special_tokens=True)[0])

Compression Details

Component	Original	PQ5 Packed	Reduction
Text Decoder (26 layers, head_dim=128)	~6.8 GB	~2.0 GB	-71%
Audio Encoder (32 layers, head_dim=64)	~2.0 GB	~0.6 GB	-70%
Skip layers (norms, conv, mel)	—	0.8 GB	kept
Total	8.86 GB	3.4 GB	-62%

Why this matters

8 GB GPU: Original barely fits → PQ5 runs comfortably
Edge/mobile: 3.4 GB download enables on-device ASR
Streaming: <500ms latency, 13 languages
RTF 0.30: 3x faster than real-time on GPU
Apache 2.0: Commercial use allowed

Architecture

Audio Encoder: 32 layers, hidden=1280, 32 heads, head_dim=64
Text Decoder: 26 layers, hidden=3072, 32 heads (8 KV), head_dim=128
Causal streaming with sliding window attention
Configurable latency (80ms to 2.4s)

Model tree for caiovicentino1/Voxtral-Mini-4B-Realtime-HLWQ-Q5

Base model

mistralai/Ministral-3-3B-Base-2512

Finetuned

mistralai/Voxtral-Mini-4B-Realtime-2602

Quantized

(20)

this model

Collection including caiovicentino1/Voxtral-Mini-4B-Realtime-HLWQ-Q5

HLWQ Models

Collection

Hadamard-Lloyd Weight Quantization · arXiv:2603.29078 · formerly PolarQuant • 26 items • Updated 3 days ago • 1

Papers for caiovicentino1/Voxtral-Mini-4B-Realtime-HLWQ-Q5

PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Paper • 2603.29078 • Published 23 days ago

PolarQuant: Quantizing KV Caches with Polar Transformation

Paper • 2502.02617 • Published Feb 4, 2025 • 1

caiovicentino1
/

Voxtral-Mini-4B-Realtime-HLWQ-Q5