parakeet-tdt-0.6b-v3-mlx-encoder-int8

Encoder-only INT8 variant of nvidia/parakeet-tdt-0.6b-v3 converted for MLX / Apple Silicon inference via mlx-audio-swift.

Quantization scheme

Component dtype notes
Conformer encoder (encoder.*) Linear weights INT8, group_size=64 217 layers
Decoder (decoder.*) weights FP16 unchanged
Joint network (joint.*) weights FP16 unchanged
Norms, biases, embeddings, convs FP16 never quantized

Scales/biases sidecars are emitted at <path>.scales / <path>.biases (sibling keys, not suffixes of .weight) and cast to FP16 โ€” this matches the mlx-audio-swift loader's expectation.

Rationale

Motivation: the Conformer encoder dominates matmul bandwidth; its matrices are large enough that INT8 weight quantization amortizes dequant overhead and reduces memory pressure. The decoder's many small matmuls do not amortize dequant cost, so keeping them FP16 avoids regressions seen with whole-model INT4.

Usage

Load via the MLXAudioSTT fork's Parakeet loader โ€” the quantization section in config.json is auto-detected; only layers that have a .scales sibling in the safetensors will be quantized at load time.

Related variants

  • beshkenadze/parakeet-tdt-0.6b-v3-mlx-fp16 โ€” baseline FP16
  • beshkenadze/parakeet-tdt-0.6b-v3-mlx-4bit โ€” whole-model INT4
Downloads last month
120
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for beshkenadze/parakeet-tdt-0.6b-v3-mlx-encoder-int8

Finetuned
(44)
this model