Parakeet-TDT-ExecuTorch-CUDA-Windows-Quantized
Pre-exported ExecuTorch .pte file
for Parakeet TDT 0.6B with
CUDA-Windows backend (NVIDIA GPU), bf16 precision, 4-bit weight linear
quantization, and 8-bit weight embedding quantization.
3.2ร smaller and 2.5ร faster prefill than the fp32 variant with identical word-level transcription accuracy.
For the non-quantized variant, see Parakeet-TDT-ExecuTorch-CUDA-Windows.
Quantization Details
| Component | Quantization | Precision |
|---|---|---|
| Encoder linear layers | 4-bit weight-only (4w, group_size=32) |
int4 weights, bf16 activations |
| Decoder linear layers | 4-bit weight-only (4w, group_size=32) |
int4 weights, bf16 activations |
| Decoder embedding | 8-bit weight-only (8w) |
int8 |
| Conv / LSTM / Norm | None | bf16 |
| Preprocessor | None | fp32 (always CPU) |
Installation
git clone https://github.com/pytorch/executorch/ ~/executorch
cd ~/executorch && pip install .
Build on Windows (PowerShell):
cmake --workflow --preset llm-release-cuda
Push-Location examples/models/parakeet
cmake --workflow --preset parakeet-cuda
Pop-Location
Download
pip install huggingface_hub
huggingface-cli download younghan-meta/Parakeet-TDT-ExecuTorch-CUDA-Windows-Quantized --local-dir parakeet_cuda_windows_quantized
Run
Windows (PowerShell):
.\cmake-out\examples\models\parakeet\Release\parakeet_runner.exe `
--model_path parakeet_cuda_windows_quantized\model.pte `
--data_path parakeet_cuda_windows_quantized\aoti_cuda_blob.ptd `
--audio_path C:\path\to\audio.wav `
--tokenizer_path parakeet_cuda_windows_quantized\tokenizer.model
Optional flags:
--timestamps segmentโ timestamp granularity:none|token|word|segment|all(default:segment)
Export Command
pip install "nemo_toolkit[asr]"
python examples/models/parakeet/export_parakeet_tdt.py \
--backend cuda-windows \
--dtype bf16 \
--qlinear_encoder 4w \
--qlinear_encoder_group_size 32 \
--qlinear 4w \
--qlinear_group_size 32 \
--qembedding 8w \
--output-dir ./parakeet_cuda_windows_quantized
Cross-compilation requires x86_64-w64-mingw32-g++ on PATH and WINDOWS_CUDA_HOME
pointing to the extracted Windows CUDA package. See the
Parakeet README
for detailed setup steps.
Note: tile_packed_to_4d packing format is not supported for cuda-windows
(the cross-compilation path keeps tensors on CPU during export and
aten::_convert_weight_to_int4pack only has a CUDA kernel). The quantization
uses the generic IntxWeightOnlyConfig path instead.
Benchmark (RTX 5080, ~20s audio)
| Metric | Quantized | vs Non-Quantized |
|---|---|---|
| Prefill throughput | 3,218 tok/s | 2.5ร faster |
| Decode throughput | 1,545 tok/s | 1.2ร faster |
| Model load time | 1.6 s | 2.5ร faster |
| Time to first token | 86 ms | 2.4ร faster |
| Total inference | 144 ms | 1.9ร faster |
| Real-time factor | 139ร real-time | โ |
| Model size | 763 MB | 3.2ร smaller |
| Transcription accuracy | Identical words | No loss |
More Info
- Downloads last month
- 8
Model tree for younghan-meta/Parakeet-TDT-ExecuTorch-CUDA-Windows-Quantized
Base model
nvidia/parakeet-tdt-0.6b-v3