Embedl Chronos-2 (Quantized for TensorRT)

Deployable INT8-quantized version of amazon/chronos-2, optimized with embedl-deploy for low-latency NVIDIA TensorRT inference on edge GPUs. Two static-context variants ship: ctx=512 for short-history forecasting and ctx=2048 for long-history use cases.

Upstream Model

Highlights

Per-tensor INT8 activations + per-channel INT8 weights via embedl-deploy's PTQ flow on top of TensorRT's fused MHA kernel. No QAT or distillation needed.
Drop-in replacement for amazon/chronos-2 inference: same (context, group_ids) → quantile_preds signature; 21 evenly spaced quantile levels with the median at index 10.
Validated on the GIFT-Eval benchmark across 125 task configurations. See Accuracy below.
Two ctx variants so you can pick the latency/history-window trade-off that fits your deployment.

Quick Start

pip install tensorrt pycuda numpy
python infer_trt.py --ctx 512    # 1.2× faster than FP16 on Orin
python infer_trt.py --ctx 2048   # 1.3× faster than FP16 on Orin

The infer_trt.py helper script builds a TensorRT engine from the ONNX on first run (cached as *.engine next to the artifact) and feeds a synthetic seasonal context for demonstration. Replace the context generator with your own series of the right length.

Files

File	Purpose
`embedl_chronos_2_ctx512_int8.onnx`	INT8 ONNX with Q/DQ — ctx=512, 1024-step horizon.
`embedl_chronos_2_ctx2048_int8.onnx`	INT8 ONNX with Q/DQ — ctx=2048, 1024-step horizon.
`infer_trt.py`	ONNX Runtime / TensorRT inference example.

Both artifacts emit a (1, 21, 1024) quantile tensor (21 quantile levels × 64 output patches × 16 steps-per-patch = 1024 horizon steps). Slice the median (preds[0, 10]) for a point forecast and clip to your needed prediction length.

Performance

Latency measured with TensorRT + trtexec, GPU compute time only (--noDataTransfers), CUDA Graph + Spin Wait enabled, clocks locked (nvpmodel -m 0 && jetson_clocks on Jetson).

Jetson AGX Orin (MAXN)

ctx=512

Chronos-2 INT8 latency, ctx=512

Build	Mean latency (ms)
TensorRT FP16	2.977
TensorRT `--best`	2.974
embedl INT8	2.432
Speedup (FP16 → embedl INT8)	1.22×

ctx=2048

Chronos-2 INT8 latency, ctx=2048

Build	Mean latency (ms)
TensorRT FP16	4.482
TensorRT `--best`	4.482
embedl INT8	3.482
Speedup (FP16 → embedl INT8)	1.29×

Accuracy

Evaluated on the GIFT-Eval benchmark — 125 task configurations spanning 50 datasets × {short, medium, long} horizons. Aggregate WQL (weighted quantile loss, lower is better) reported using the TIME-paper normalization: geomean of per-task ratio against the Seasonal-Naive baseline.

Metric	FP32 baseline	embedl INT8 ctx=512	embedl INT8 ctx=2048
Geomean WQL / Seasonal-Naive	0.549	0.634	0.618
Geomean WQL / FP32	1.000	1.156×	1.126×
Median WQL / FP32	1.000	1.074×	1.045×
Cells within 10 % of FP32	—	71 / 125 (57 %)	79 / 125 (63 %)
Cells within 20 % of FP32	—	96 / 125 (77 %)	98 / 125 (78 %)
Cells beating FP32	—	14 / 125	19 / 125

How to read the headline number. Geomean WQL/S-Naive 0.634 (ctx=512) and 0.618 (ctx=2048) means the INT8 model retains the bulk of chronos-2's skill margin over the no-model Seasonal-Naive baseline. The FP32 model sits at 0.549 by the same convention; the INT8 versions are 15-16 % closer to S-Naive but still convincingly beat it on the geomean.

Where the regression concentrates. Worst-case cells are out-of-distribution low-frequency series (us_births/M, m4_hourly/{medium,long}) and high-frequency long-horizon forecasts (solar/10T/{medium,long}). The full per-task CSVs ship with the artifacts; check them before deploying to a domain that resembles those outliers.

Creating Your Own Optimized Models

This artifact was produced with embedl-deploy, Embedl's open-source PyTorch → TensorRT deployment library. The same workflow applies to your own models — see the documentation for installation and usage.

License

Component	License
Optimized model artifacts (this repo)	Embedl Models Community Licence v1.0 — no redistribution as a hosted service
Upstream architecture and weights	Amazon Chronos-2 License

Contact

We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.

Community & support

Need help with this model? Chat with the Embedl team and other engineers on Discord.

Quantization gotchas, hardware questions, fine-tuning tips — bring them all.

Join our Discord →

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Time Series Forecasting

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/chronos-2-quantized-trt

Base model

amazon/chronos-2

Quantized

(1)

this model