parakeet-tdt-0.6b-v3-mlx-encoder-int8

Encoder-only INT8 variant of nvidia/parakeet-tdt-0.6b-v3 converted for MLX / Apple Silicon inference via mlx-audio-swift.

Quantization scheme

Component	dtype	notes
Conformer encoder (`encoder.*`) Linear weights	INT8, group_size=64	217 layers
Decoder (`decoder.*`) weights	FP16	unchanged
Joint network (`joint.*`) weights	FP16	unchanged
Norms, biases, embeddings, convs	FP16	never quantized

Scales/biases sidecars are emitted at <path>.scales / <path>.biases (sibling keys, not suffixes of .weight) and cast to FP16 — this matches the mlx-audio-swift loader's expectation.

Rationale

Motivation: the Conformer encoder dominates matmul bandwidth; its matrices are large enough that INT8 weight quantization amortizes dequant overhead and reduces memory pressure. The decoder's many small matmuls do not amortize dequant cost, so keeping them FP16 avoids regressions seen with whole-model INT4.

Usage

Load via the MLXAudioSTT fork's Parakeet loader — the quantization section in config.json is auto-detected; only layers that have a .scales sibling in the safetensors will be quantized at load time.

Related variants

beshkenadze/parakeet-tdt-0.6b-v3-mlx-fp16 — baseline FP16
beshkenadze/parakeet-tdt-0.6b-v3-mlx-4bit — whole-model INT4

Downloads last month: 120

MLX

Hardware compatibility

Quantized

Model tree for beshkenadze/parakeet-tdt-0.6b-v3-mlx-encoder-int8

Base model

nvidia/parakeet-tdt-0.6b-v3

Finetuned

(44)

this model