Optimized by Embedl
Need to fine-tune, hit performance targets, or deploy on specific hardware?
We've got you covered.
Learn more Get in touch β†’

Embedl Chronos-2 (Quantized for TensorRT)

Deployable INT8-quantized version of amazon/chronos-2, optimized with embedl-deploy for low-latency NVIDIA TensorRT inference on edge GPUs. Two static-context variants ship: ctx=512 for short-history forecasting and ctx=2048 for long-history use cases.

Upstream Model

Open amazon/chronos-2 in hfviewer

Highlights

  • Per-tensor INT8 activations + per-channel INT8 weights via embedl-deploy's PTQ flow on top of TensorRT's fused MHA kernel. No QAT or distillation needed.
  • Drop-in replacement for amazon/chronos-2 inference: same (context, group_ids) β†’ quantile_preds signature; 21 evenly spaced quantile levels with the median at index 10.
  • Validated on the GIFT-Eval benchmark across 125 task configurations. See Accuracy below.
  • Two ctx variants so you can pick the latency/history-window trade-off that fits your deployment.

Quick Start

pip install tensorrt pycuda numpy
python infer_trt.py --ctx 512    # 1.2Γ— faster than FP16 on Orin
python infer_trt.py --ctx 2048   # 1.3Γ— faster than FP16 on Orin

The infer_trt.py helper script builds a TensorRT engine from the ONNX on first run (cached as *.engine next to the artifact) and feeds a synthetic seasonal context for demonstration. Replace the context generator with your own series of the right length.

Files

File Purpose
embedl_chronos_2_ctx512_int8.onnx INT8 ONNX with Q/DQ β€” ctx=512, 1024-step horizon.
embedl_chronos_2_ctx2048_int8.onnx INT8 ONNX with Q/DQ β€” ctx=2048, 1024-step horizon.
infer_trt.py ONNX Runtime / TensorRT inference example.

Both artifacts emit a (1, 21, 1024) quantile tensor (21 quantile levels Γ— 64 output patches Γ— 16 steps-per-patch = 1024 horizon steps). Slice the median (preds[0, 10]) for a point forecast and clip to your needed prediction length.

Performance

Latency measured with TensorRT + trtexec, GPU compute time only (--noDataTransfers), CUDA Graph + Spin Wait enabled, clocks locked (nvpmodel -m 0 && jetson_clocks on Jetson).

Jetson AGX Orin (MAXN)

ctx=512

Chronos-2 INT8 latency, ctx=512

Build Mean latency (ms)
TensorRT FP16 2.977
TensorRT --best 2.974
embedl INT8 2.432
Speedup (FP16 β†’ embedl INT8) 1.22Γ—

ctx=2048

Chronos-2 INT8 latency, ctx=2048

Build Mean latency (ms)
TensorRT FP16 4.482
TensorRT --best 4.482
embedl INT8 3.482
Speedup (FP16 β†’ embedl INT8) 1.29Γ—

Accuracy

Evaluated on the GIFT-Eval benchmark β€” 125 task configurations spanning 50 datasets Γ— {short, medium, long} horizons. Aggregate WQL (weighted quantile loss, lower is better) reported using the TIME-paper normalization: geomean of per-task ratio against the Seasonal-Naive baseline.

Metric FP32 baseline embedl INT8 ctx=512 embedl INT8 ctx=2048
Geomean WQL / Seasonal-Naive 0.549 0.634 0.618
Geomean WQL / FP32 1.000 1.156Γ— 1.126Γ—
Median WQL / FP32 1.000 1.074Γ— 1.045Γ—
Cells within 10 % of FP32 β€” 71 / 125 (57 %) 79 / 125 (63 %)
Cells within 20 % of FP32 β€” 96 / 125 (77 %) 98 / 125 (78 %)
Cells beating FP32 β€” 14 / 125 19 / 125

How to read the headline number. Geomean WQL/S-Naive 0.634 (ctx=512) and 0.618 (ctx=2048) means the INT8 model retains the bulk of chronos-2's skill margin over the no-model Seasonal-Naive baseline. The FP32 model sits at 0.549 by the same convention; the INT8 versions are 15-16 % closer to S-Naive but still convincingly beat it on the geomean.

Where the regression concentrates. Worst-case cells are out-of-distribution low-frequency series (us_births/M, m4_hourly/{medium,long}) and high-frequency long-horizon forecasts (solar/10T/{medium,long}). The full per-task CSVs ship with the artifacts; check them before deploying to a domain that resembles those outliers.

Creating Your Own Optimized Models

This artifact was produced with embedl-deploy, Embedl's open-source PyTorch β†’ TensorRT deployment library. The same workflow applies to your own models β€” see the documentation for installation and usage.

License

Component License
Optimized model artifacts (this repo) Embedl Models Community Licence v1.0 β€” no redistribution as a hosted service
Upstream architecture and weights Amazon Chronos-2 License

Contact

We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.

Community & support
Need help with this model? Chat with the Embedl team and other engineers on Discord.
Quantization gotchas, hardware questions, fine-tuning tips β€” bring them all.
Join our Discord β†’
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for embedl/chronos-2-quantized-trt

Base model

amazon/chronos-2
Quantized
(1)
this model