Instructions to use beshkenadze/parakeet-tdt-0.6b-v3-mlx-encoder-int8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use beshkenadze/parakeet-tdt-0.6b-v3-mlx-encoder-int8 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir parakeet-tdt-0.6b-v3-mlx-encoder-int8 beshkenadze/parakeet-tdt-0.6b-v3-mlx-encoder-int8
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
parakeet-tdt-0.6b-v3-mlx-encoder-int8
Encoder-only INT8 variant of nvidia/parakeet-tdt-0.6b-v3 converted for
MLX / Apple Silicon inference via mlx-audio-swift.
Quantization scheme
| Component | dtype | notes |
|---|---|---|
Conformer encoder (encoder.*) Linear weights |
INT8, group_size=64 | 217 layers |
Decoder (decoder.*) weights |
FP16 | unchanged |
Joint network (joint.*) weights |
FP16 | unchanged |
| Norms, biases, embeddings, convs | FP16 | never quantized |
Scales/biases sidecars are emitted at <path>.scales / <path>.biases
(sibling keys, not suffixes of .weight) and cast to FP16 โ this matches
the mlx-audio-swift loader's expectation.
Rationale
Motivation: the Conformer encoder dominates matmul bandwidth; its matrices are large enough that INT8 weight quantization amortizes dequant overhead and reduces memory pressure. The decoder's many small matmuls do not amortize dequant cost, so keeping them FP16 avoids regressions seen with whole-model INT4.
Usage
Load via the MLXAudioSTT fork's Parakeet loader โ the quantization
section in config.json is auto-detected; only layers that have a
.scales sibling in the safetensors will be quantized at load time.
Related variants
beshkenadze/parakeet-tdt-0.6b-v3-mlx-fp16โ baseline FP16beshkenadze/parakeet-tdt-0.6b-v3-mlx-4bitโ whole-model INT4
- Downloads last month
- 120
Quantized
Model tree for beshkenadze/parakeet-tdt-0.6b-v3-mlx-encoder-int8
Base model
nvidia/parakeet-tdt-0.6b-v3