Reworked the infer_trt script

72ba9a3 verified 1 day ago

11.7 kB

	---
	license: other
	license_name: embedl-models-community-licence-v1.0
	license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
	base_model: amazon/chronos-2
	quantized_from: amazon/chronos-2
	tags:
	- time-series
	- time-series-forecasting
	- chronos
	- chronos-2
	- int8
	- tensorrt
	- quantization
	- edge
	- jetson
	- orin
	library_name: onnx
	pipeline_tag: time-series-forecasting
	gated: true
	extra_gated_heading: Acknowledge Embedl Models Community Licence v1.0
	extra_gated_description: \|
	By requesting access you agree to the Embedl Models Community
	Licence v1.0 (no redistribution as a hosted service) and to the
	upstream chronos-2 license terms.
	extra_gated_button_content: Request access
	---

	<!-- embedl-banner:start -->
	<style>
	.embedl-btn-primary { transition: background 160ms ease, box-shadow 160ms ease; }
	.embedl-btn-primary:hover { background: #4FDCE4 !important; box-shadow: 0 8px 22px rgba(45,212,221,0.45) !important; }
	.embedl-btn-secondary { transition: background 160ms ease; }
	.embedl-btn-secondary:hover { background: rgba(45,212,221,0.15) !important; }
	.embedl-headline { font-size: clamp(11px, 2.15vw, 15px) !important; }
	.embedl-btn-primary, .embedl-btn-secondary {
	font-size: clamp(11px, 1.65vw, 13px) !important;
	padding: clamp(6px, 1.1vw, 9px) clamp(10px, 1.6vw, 14px) !important;
	}
	</style>
	<div style="background:radial-gradient(600px 220px at 0% 50%,rgba(45,212,221,0.22) 0%,rgba(45,212,221,0) 60%),radial-gradient(400px 180px at 100% 100%,rgba(45,212,221,0.10) 0%,rgba(45,212,221,0) 55%),linear-gradient(135deg,#0B1626 0%,#142338 100%);border:1px solid rgba(45,212,221,0.28);border-radius:12px;padding:22px 24px;margin:0 0 24px 0;color:#F2F6FA;box-shadow:0 4px 16px rgba(11,22,38,0.18);overflow:hidden;box-sizing:border-box;max-width:100%;">
	<table style="width:100%;border-collapse:collapse;border:0;background:transparent;">
	<tr style="background:transparent;">
	<td style="vertical-align:middle;border:0;padding:0;background:transparent;">
	<div style="display:inline-block;font-size:10px;letter-spacing:0.08em;text-transform:uppercase;font-weight:700;color:#2DD4DD;background:rgba(45,212,221,0.15);border:1px solid rgba(45,212,221,0.35);padding:4px 10px;border-radius:999px;margin-bottom:10px;white-space:nowrap;">Optimized by Embedl</div>
	<div class="embedl-headline" style="font-size:15px;font-weight:700;line-height:1.35;color:#F2F6FA;margin-bottom:4px;">Need to <span style="color:#2DD4DD;white-space:nowrap;">fine-tune</span>, hit <span style="color:#2DD4DD;white-space:nowrap;">performance targets</span>, or deploy on <span style="color:#2DD4DD;white-space:nowrap;">specific hardware</span>?</div>
	<div style="font-size:13px;color:#9BA7B5;">We've got you covered.</div>
	</td>
	<td width="1%" style="vertical-align:middle;border:0;padding:0 0 0 18px;white-space:nowrap;text-align:right;background:transparent;">
	<a href="https://www.embedl.com/models" class="embedl-btn-secondary" style="display:inline-block;font-size:13px;font-weight:600;padding:9px 14px;border-radius:6px;border:1px solid #2DD4DD;color:#2DD4DD;text-decoration:none;margin-right:8px;">Learn more</a>
	<a href="https://www.embedl.com/contact" class="embedl-btn-primary" style="display:inline-block;font-size:13px;font-weight:600;padding:9px 14px;border-radius:6px;border:1px solid #2DD4DD;background:#2DD4DD;color:#0B1626;text-decoration:none;box-shadow:0 6px 18px rgba(45,212,221,0.28);">Get in touch →</a>
	</td>
	</tr>
	</table>
	</div>
	<!-- embedl-banner:end -->

	# Embedl Chronos-2 (Quantized for TensorRT)

	Deployable INT8-quantized version of
	[`amazon/chronos-2`](https://huggingface.co/amazon/chronos-2),
	optimized with
	[embedl-deploy](https://github.com/embedl/embedl-deploy) for
	low-latency NVIDIA TensorRT inference on edge GPUs. Two
	static-context variants ship: ctx=512 for short-history
	forecasting and ctx=2048 for long-history use cases.

	## Upstream Model

	<a href="https://hfviewer.com/amazon/chronos-2?utm_source=huggingface&utm_medium=embedded_model_card&utm_campaign=amazon__chronos-2_card" target="_blank" rel="noopener">
	<img
	src="https://hfviewer.com/api/card.svg?source=amazon%2Fchronos-2&v=20260501clipcard"
	alt="Open amazon/chronos-2 in hfviewer"
	width="100%"
	/>
	</a>

	## Highlights

	- Per-tensor INT8 activations + per-channel INT8 weights via
	embedl-deploy's PTQ flow on top of TensorRT's fused MHA kernel.
	No QAT or distillation needed.
	- Drop-in replacement for `amazon/chronos-2` inference: same
	`(context, group_ids) → quantile_preds` signature; 21 evenly
	spaced quantile levels with the median at index 10.
	- Validated on the [GIFT-Eval](https://huggingface.co/datasets/Salesforce/GiftEval)
	benchmark across 125 task configurations. See Accuracy below.
	- Two ctx variants so you can pick the latency/history-window
	trade-off that fits your deployment.

	## Quick Start

	```bash
	pip install tensorrt pycuda numpy
	python infer_trt.py --ctx 512 # 1.2× faster than FP16 on Orin
	python infer_trt.py --ctx 2048 # 1.3× faster than FP16 on Orin
	```

	The `infer_trt.py` helper script builds a TensorRT engine from the
	ONNX on first run (cached as `*.engine` next to the artifact) and
	feeds a synthetic seasonal context for demonstration. Replace the
	context generator with your own series of the right length.

	## Files

	\| File \| Purpose \|
	\|---\|---\|
	\| `embedl_chronos_2_ctx512_int8.onnx` \| INT8 ONNX with Q/DQ — ctx=512, 1024-step horizon. \|
	\| `embedl_chronos_2_ctx2048_int8.onnx` \| INT8 ONNX with Q/DQ — ctx=2048, 1024-step horizon. \|
	\| `infer_trt.py` \| ONNX Runtime / TensorRT inference example. \|

	Both artifacts emit a `(1, 21, 1024)` quantile tensor (21 quantile
	levels × 64 output patches × 16 steps-per-patch = 1024 horizon
	steps). Slice the median (`preds[0, 10]`) for a point forecast and
	clip to your needed prediction length.

	## Performance

	Latency measured with TensorRT + `trtexec`, GPU compute time only
	(`--noDataTransfers`), CUDA Graph + Spin Wait enabled, clocks locked
	(`nvpmodel -m 0 && jetson_clocks` on Jetson).

	### Jetson AGX Orin (MAXN)

	#### ctx=512

	<p align="center">
	<img src="https://huggingface.co/datasets/embedl/documentation-images/resolve/main/chronos-2-quantized-trt/chronos-2-quantized-trt__orin-mountain-view__latency_ctx512.svg" alt="Chronos-2 INT8 latency, ctx=512" width="640">
	</p>

	\| Build \| Mean latency (ms) \|
	\|---\|---\|
	\| TensorRT FP16 \| 2.977 \|
	\| TensorRT `--best` \| 2.974 \|
	\| embedl INT8 \| 2.432 \|
	\| Speedup (FP16 → embedl INT8) \| 1.22× \|

	#### ctx=2048

	<p align="center">
	<img src="https://huggingface.co/datasets/embedl/documentation-images/resolve/main/chronos-2-quantized-trt/chronos-2-quantized-trt__orin-mountain-view__latency_ctx2048.svg" alt="Chronos-2 INT8 latency, ctx=2048" width="640">
	</p>

	\| Build \| Mean latency (ms) \|
	\|---\|---\|
	\| TensorRT FP16 \| 4.482 \|
	\| TensorRT `--best` \| 4.482 \|
	\| embedl INT8 \| 3.482 \|
	\| Speedup (FP16 → embedl INT8) \| 1.29× \|

	## Accuracy

	Evaluated on the
	[GIFT-Eval](https://huggingface.co/datasets/Salesforce/GiftEval)
	benchmark — 125 task configurations spanning 50 datasets ×
	{short, medium, long} horizons. Aggregate WQL (weighted quantile
	loss, lower is better) reported using the
	[TIME-paper normalization](https://arxiv.org/html/2602.12147v2):
	geomean of per-task ratio against the Seasonal-Naive baseline.

	\| Metric \| FP32 baseline \| embedl INT8 ctx=512 \| embedl INT8 ctx=2048 \|
	\|---\|---\|---\|---\|
	\| Geomean WQL / Seasonal-Naive \| 0.549 \| 0.634 \| 0.618 \|
	\| Geomean WQL / FP32 \| 1.000 \| 1.156× \| 1.126× \|
	\| Median WQL / FP32 \| 1.000 \| 1.074× \| 1.045× \|
	\| Cells within 10 % of FP32 \| — \| 71 / 125 (57 %) \| 79 / 125 (63 %) \|
	\| Cells within 20 % of FP32 \| — \| 96 / 125 (77 %) \| 98 / 125 (78 %) \|
	\| Cells beating FP32 \| — \| 14 / 125 \| 19 / 125 \|

	How to read the headline number. Geomean WQL/S-Naive 0.634
	(ctx=512) and 0.618 (ctx=2048) means the INT8 model retains the
	bulk of `chronos-2`'s skill margin over the no-model Seasonal-Naive
	baseline. The FP32 model sits at 0.549 by the same convention; the
	INT8 versions are 15-16 % closer to S-Naive but still convincingly
	beat it on the geomean.

	Where the regression concentrates. Worst-case cells are
	out-of-distribution low-frequency series (`us_births/M`,
	`m4_hourly/{medium,long}`) and high-frequency long-horizon
	forecasts (`solar/10T/{medium,long}`). The full per-task CSVs
	ship with the artifacts; check them before deploying to a domain
	that resembles those outliers.

	## Creating Your Own Optimized Models

	This artifact was produced with
	[embedl-deploy](https://github.com/embedl/embedl-deploy), Embedl's
	open-source PyTorch → TensorRT deployment library. The same workflow
	applies to your own models — see
	[the documentation](https://github.com/embedl/embedl-deploy#readme)
	for installation and usage.

	## License

	\| Component \| License \|
	\|---\|---\|
	\| Optimized model artifacts (this repo) \| [Embedl Models Community Licence v1.0](https://github.com/embedl/embedl-models/blob/main/LICENSE) — no redistribution as a hosted service \|
	\| Upstream architecture and weights \| [Amazon Chronos-2 License](https://huggingface.co/amazon/chronos-2/blob/main/LICENSE) \|

	## Contact

	We offer engineering support for on-prem/edge deployments and partner
	co-marketing opportunities. Reach out at
	[contact@embedl.com](mailto:contact@embedl.com), or open an issue on
	[GitHub](https://github.com/embedl/embedl-deploy).

	<!-- embedl-discord-banner:start -->
	<style>
	.embedl-discord-btn { transition: background 160ms ease, box-shadow 160ms ease; }
	.embedl-discord-btn:hover { background: #6C77F5 !important; box-shadow: 0 8px 22px rgba(88,101,242,0.55) !important; }
	</style>
	<div style="background:radial-gradient(600px 220px at 0% 50%,rgba(88,101,242,0.22) 0%,rgba(88,101,242,0) 60%),radial-gradient(400px 180px at 100% 100%,rgba(88,101,242,0.10) 0%,rgba(88,101,242,0) 55%),linear-gradient(135deg,#0B1626 0%,#142338 100%);border:1px solid rgba(88,101,242,0.35);border-radius:12px;padding:22px 24px;margin:24px 0 0 0;color:#F2F6FA;box-shadow:0 4px 16px rgba(11,22,38,0.18);overflow:hidden;box-sizing:border-box;max-width:100%;">
	<table style="width:100%;border-collapse:collapse;border:0;background:transparent;">
	<tr style="background:transparent;">
	<td style="vertical-align:middle;border:0;padding:0;background:transparent;">
	<div style="display:inline-block;font-size:10px;letter-spacing:0.08em;text-transform:uppercase;font-weight:700;color:#A5B4FC;background:rgba(88,101,242,0.18);border:1px solid rgba(88,101,242,0.45);padding:4px 10px;border-radius:999px;margin-bottom:10px;white-space:nowrap;">Community & support</div>
	<div style="font-size:15px;font-weight:700;line-height:1.35;color:#F2F6FA;margin-bottom:4px;">Need help with this model? Chat with the Embedl team and other engineers on <span style="color:#A5B4FC;white-space:nowrap;">Discord</span>.</div>
	<div style="font-size:13px;color:#9BA7B5;">Quantization gotchas, hardware questions, fine-tuning tips — bring them all.</div>
	</td>
	<td width="1%" style="vertical-align:middle;border:0;padding:0 0 0 18px;white-space:nowrap;text-align:right;background:transparent;">
	<a href="https://discord.gg/MTbMWdKqE" class="embedl-discord-btn" style="display:inline-block;font-size:13px;font-weight:600;padding:9px 14px;border-radius:6px;border:1px solid #5865F2;background:#5865F2;color:#FFFFFF;text-decoration:none;box-shadow:0 6px 18px rgba(88,101,242,0.35);">Join our Discord →</a>
	</td>
	</tr>
	</table>
	</div>
	<!-- embedl-discord-banner:end -->