parakeet-primeline-trt-a100

A pre-compiled TensorRT encoder engine for the German FastConformer encoder of primeline/parakeet-primeline, ready to drop into the encoder.plan slot of NVIDIA's parakeet-0.6b-tdt NIM Triton model_repository. Saves the ~3 min ONNX-export → TRT engine compilation when standing the slim NIM up on an A100 host, and gives every deploy a bit-identical engine for reproducible WER measurements.

What this is — and is not


Is	A single TensorRT `encoder.plan` (~1.2 GB), strongly-typed network, batch=1 fixed for `audio_signal` and `length`, FP16 inherited from the source FastConformer weights.
Is not	A standalone model. The decoder + joint + tokenizer still live in the source `*.nemo`; the python backend wrapping the engine still comes from NVIDIA's NIM (see "How to use" below).

This artifact is the only piece of the slim primeline-v2 build pipeline that is freely redistributable: the surrounding Riva/Triton scaffold (config.pbtxt, niva_asr_model_cpp.py, riva_config.json, streaming_config.yaml) is NVIDIA NIM IP and stays out of this repo.

Source attribution

This engine is a derivative of primeline/parakeet-primeline (CC-BY-4.0 — "primeline-parakeet", 600 M-parameter German fine-tune of NVIDIA parakeet-tdt-0.6b-v3, WER 2.95% average, 4.11% on Tuda-De). Full credit to primeline GmbH for the underlying weights; this artifact would not exist without their fine-tune. Inputs to the compilation:

Source NEMO:  primeline/parakeet-primeline (2_95_WER.nemo)
              sha256 1b7f6e4f5dcffabd44464a4f8afbb0688f71ea972c960e19efce36e73494c80f
              size   2.4 GB (2,509,332,480 bytes)
Engine sha256: 57393c716bdc7bf3510d7a03275f3fb4cd991208a82159d541246213398b3027
Engine size:   ~1.2 GB

The license of this derivative inherits CC-BY-4.0 from the source. Attribution required if you redistribute or modify.

Build environment (engine is bound to all of these)

Spec	Value
GPU	NVIDIA A100 80GB (compute capability sm_80)
TensorRT	10.13.2.6
ONNX export host	NVIDIA `parakeet-0.6b-tdt:latest` NIM container (NeMo 2.7.0rc0)
Batch size	fixed at 1 for `audio_signal` and `length`
Precision	FP16 (inherited from the FP16 encoder in primeline's `.nemo`)
Network mode	Strongly-typed

The engine will fail to load on a different GPU architecture (sm_86/sm_89/sm_90 need their own compile) or a different TRT major version. Don't try to mix across major versions — TRT engines are version-bound by design.

How to use

This file is a drop-in for the encoder.plan slot inside NVIDIA's parakeet-0.6b-tdt NIM model_repository (the AM Triton submodel riva-nemo-parakeet-tdt-0.6b-multi-asr-offline-am-streaming-offline/1/). You need an NVIDIA AI Enterprise license to obtain that NIM image (which is where the python backend + Triton ensemble live). Steps for the slim primeline-v2 build flow:

Pull nvcr.io/nim/nvidia/parakeet-0.6b-tdt from NGC.
Extract its model_repository to $WORK/models/ (the BLS dir and the AM riva-nemo-...-am-streaming-offline/ dir).
In the AM dir's 1/ subfolder:
- Replace model_graph.nemo with primeline's 2_95_WER.nemo.
- Drop encoder.plan from this repo in alongside it.
Apply the slim config overlay (max_batch_size=1, language_code=de-DE,multi) on the BLS + AM config.pbtxt and the BLS riva_bls_config.yaml.
Patch the NIM's niva_asr_model_cpp.py (in the AM 1/ dir):
- Wrap the trt_compile(...) call in a RIVA_SKIP_TRT_COMPILE env-var guard so a pre-baked engine is loaded instead of recompiled.
- Pass "fallback": True to trt_compile (defensive — the prebuilt engine is the load path).
docker build a slim image over the upstream NIM that bakes the staged model_repository.

The full build script for the slim primeline-v2 image lives at marcoleder/stt-QualForschung scripts/build-slim-tdt-primeline-v2.sh, with the rationale and reproduction walkthrough in marcoleder/core-QualForschung docs/stt-primeline-v2.md. Do not redistribute the resulting Triton model_repository — it contains NVIDIA NIM components governed by the NIM SLA, not by CC-BY-4.0.

Reproducing the engine

ONNX export from primeline's encoder runs into two NeMo 2.7.0rc0 bugs that both need one-line workarounds:

@typecheck() rejects positional args during ONNX trace — wrap the export in with typecheck.disable_checks():.
polygraphy's constant folding removes a still-referenced Cast — pass do_constant_folding=False to torch.onnx.export.

After ONNX is produced, the TRT engine is built with trtexec (strongly-typed network, batch=1, input shapes audio_signal=[1,80,N] and length=[1], FP16 inherited from the model). Full implementation is in stt-QualForschung/scripts/build-slim-tdt-primeline-v2.sh § Phase 6.

License

This derivative is released under CC-BY-4.0, inheriting from primeline/parakeet-primeline. Attribution to primeline GmbH is required for redistribution.

The TensorRT runtime + NVIDIA NIM scaffold needed at deploy time is governed by NVIDIA's own license terms, independent of this artifact's CC-BY-4.0.

Downloads last month: -

Model tree for qualforsch/parakeet-primeline-trt-a100

Base model

primeline/parakeet-primeline

Finetuned

(1)

this model