parakeet-primeline-trt-a100

A pre-compiled TensorRT encoder engine for the German FastConformer encoder of primeline/parakeet-primeline, ready to drop into the encoder.plan slot of NVIDIA's parakeet-0.6b-tdt NIM Triton model_repository. Saves the ~3 min ONNX-export β†’ TRT engine compilation when standing the slim NIM up on an A100 host, and gives every deploy a bit-identical engine for reproducible WER measurements.

What this is β€” and is not

Is A single TensorRT encoder.plan (~1.2 GB), strongly-typed network, batch=1 fixed for audio_signal and length, FP16 inherited from the source FastConformer weights.
Is not A standalone model. The decoder + joint + tokenizer still live in the source *.nemo; the python backend wrapping the engine still comes from NVIDIA's NIM (see "How to use" below).

This artifact is the only piece of the slim primeline-v2 build pipeline that is freely redistributable: the surrounding Riva/Triton scaffold (config.pbtxt, niva_asr_model_cpp.py, riva_config.json, streaming_config.yaml) is NVIDIA NIM IP and stays out of this repo.

Source attribution

This engine is a derivative of primeline/parakeet-primeline (CC-BY-4.0 β€” "primeline-parakeet", 600 M-parameter German fine-tune of NVIDIA parakeet-tdt-0.6b-v3, WER 2.95% average, 4.11% on Tuda-De). Full credit to primeline GmbH for the underlying weights; this artifact would not exist without their fine-tune. Inputs to the compilation:

Source NEMO:  primeline/parakeet-primeline (2_95_WER.nemo)
              sha256 1b7f6e4f5dcffabd44464a4f8afbb0688f71ea972c960e19efce36e73494c80f
              size   2.4 GB (2,509,332,480 bytes)
Engine sha256: 57393c716bdc7bf3510d7a03275f3fb4cd991208a82159d541246213398b3027
Engine size:   ~1.2 GB

The license of this derivative inherits CC-BY-4.0 from the source. Attribution required if you redistribute or modify.

Build environment (engine is bound to all of these)

Spec Value
GPU NVIDIA A100 80GB (compute capability sm_80)
TensorRT 10.13.2.6
ONNX export host NVIDIA parakeet-0.6b-tdt:latest NIM container (NeMo 2.7.0rc0)
Batch size fixed at 1 for audio_signal and length
Precision FP16 (inherited from the FP16 encoder in primeline's .nemo)
Network mode Strongly-typed

The engine will fail to load on a different GPU architecture (sm_86/sm_89/sm_90 need their own compile) or a different TRT major version. Don't try to mix across major versions β€” TRT engines are version-bound by design.

How to use

This file is a drop-in for the encoder.plan slot inside NVIDIA's parakeet-0.6b-tdt NIM model_repository (the AM Triton submodel riva-nemo-parakeet-tdt-0.6b-multi-asr-offline-am-streaming-offline/1/). You need an NVIDIA AI Enterprise license to obtain that NIM image (which is where the python backend + Triton ensemble live). Steps for the slim primeline-v2 build flow:

  1. Pull nvcr.io/nim/nvidia/parakeet-0.6b-tdt from NGC.
  2. Extract its model_repository to $WORK/models/ (the BLS dir and the AM riva-nemo-...-am-streaming-offline/ dir).
  3. In the AM dir's 1/ subfolder:
    • Replace model_graph.nemo with primeline's 2_95_WER.nemo.
    • Drop encoder.plan from this repo in alongside it.
  4. Apply the slim config overlay (max_batch_size=1, language_code=de-DE,multi) on the BLS + AM config.pbtxt and the BLS riva_bls_config.yaml.
  5. Patch the NIM's niva_asr_model_cpp.py (in the AM 1/ dir):
    • Wrap the trt_compile(...) call in a RIVA_SKIP_TRT_COMPILE env-var guard so a pre-baked engine is loaded instead of recompiled.
    • Pass "fallback": True to trt_compile (defensive β€” the prebuilt engine is the load path).
  6. docker build a slim image over the upstream NIM that bakes the staged model_repository.

The full build script for the slim primeline-v2 image lives at marcoleder/stt-QualForschung scripts/build-slim-tdt-primeline-v2.sh, with the rationale and reproduction walkthrough in marcoleder/core-QualForschung docs/stt-primeline-v2.md. Do not redistribute the resulting Triton model_repository β€” it contains NVIDIA NIM components governed by the NIM SLA, not by CC-BY-4.0.

Reproducing the engine

ONNX export from primeline's encoder runs into two NeMo 2.7.0rc0 bugs that both need one-line workarounds:

  • @typecheck() rejects positional args during ONNX trace β€” wrap the export in with typecheck.disable_checks():.
  • polygraphy's constant folding removes a still-referenced Cast β€” pass do_constant_folding=False to torch.onnx.export.

After ONNX is produced, the TRT engine is built with trtexec (strongly-typed network, batch=1, input shapes audio_signal=[1,80,N] and length=[1], FP16 inherited from the model). Full implementation is in stt-QualForschung/scripts/build-slim-tdt-primeline-v2.sh Β§ Phase 6.

License

This derivative is released under CC-BY-4.0, inheriting from primeline/parakeet-primeline. Attribution to primeline GmbH is required for redistribution.

The TensorRT runtime + NVIDIA NIM scaffold needed at deploy time is governed by NVIDIA's own license terms, independent of this artifact's CC-BY-4.0.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for qualforsch/parakeet-primeline-trt-a100

Finetuned
(1)
this model