Parakeet-TDT-ExecuTorch-CUDA-Windows-Quantized

Pre-exported ExecuTorch .pte file for Parakeet TDT 0.6B with CUDA-Windows backend (NVIDIA GPU), bf16 precision, 4-bit weight linear quantization, and 8-bit weight embedding quantization.

3.2ร— smaller and 2.5ร— faster prefill than the fp32 variant with identical word-level transcription accuracy.

For the non-quantized variant, see Parakeet-TDT-ExecuTorch-CUDA-Windows.

Quantization Details

Component Quantization Precision
Encoder linear layers 4-bit weight-only (4w, group_size=32) int4 weights, bf16 activations
Decoder linear layers 4-bit weight-only (4w, group_size=32) int4 weights, bf16 activations
Decoder embedding 8-bit weight-only (8w) int8
Conv / LSTM / Norm None bf16
Preprocessor None fp32 (always CPU)

Installation

git clone https://github.com/pytorch/executorch/ ~/executorch
cd ~/executorch && pip install .

Build on Windows (PowerShell):

cmake --workflow --preset llm-release-cuda
Push-Location examples/models/parakeet
cmake --workflow --preset parakeet-cuda
Pop-Location

Download

pip install huggingface_hub
huggingface-cli download younghan-meta/Parakeet-TDT-ExecuTorch-CUDA-Windows-Quantized --local-dir parakeet_cuda_windows_quantized

Run

Windows (PowerShell):

.\cmake-out\examples\models\parakeet\Release\parakeet_runner.exe `
    --model_path parakeet_cuda_windows_quantized\model.pte `
    --data_path parakeet_cuda_windows_quantized\aoti_cuda_blob.ptd `
    --audio_path C:\path\to\audio.wav `
    --tokenizer_path parakeet_cuda_windows_quantized\tokenizer.model

Optional flags:

  • --timestamps segment โ€” timestamp granularity: none|token|word|segment|all (default: segment)

Export Command

pip install "nemo_toolkit[asr]"

python examples/models/parakeet/export_parakeet_tdt.py \
    --backend cuda-windows \
    --dtype bf16 \
    --qlinear_encoder 4w \
    --qlinear_encoder_group_size 32 \
    --qlinear 4w \
    --qlinear_group_size 32 \
    --qembedding 8w \
    --output-dir ./parakeet_cuda_windows_quantized

Cross-compilation requires x86_64-w64-mingw32-g++ on PATH and WINDOWS_CUDA_HOME pointing to the extracted Windows CUDA package. See the Parakeet README for detailed setup steps.

Note: tile_packed_to_4d packing format is not supported for cuda-windows (the cross-compilation path keeps tensors on CPU during export and aten::_convert_weight_to_int4pack only has a CUDA kernel). The quantization uses the generic IntxWeightOnlyConfig path instead.

Benchmark (RTX 5080, ~20s audio)

Metric Quantized vs Non-Quantized
Prefill throughput 3,218 tok/s 2.5ร— faster
Decode throughput 1,545 tok/s 1.2ร— faster
Model load time 1.6 s 2.5ร— faster
Time to first token 86 ms 2.4ร— faster
Total inference 144 ms 1.9ร— faster
Real-time factor 139ร— real-time โ€”
Model size 763 MB 3.2ร— smaller
Transcription accuracy Identical words No loss

More Info

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for younghan-meta/Parakeet-TDT-ExecuTorch-CUDA-Windows-Quantized

Quantized
(23)
this model