--- license: other license_name: embedl-models-community-licence-v1.0 license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE base_model: amazon/chronos-2 quantized_from: amazon/chronos-2 tags: - time-series - time-series-forecasting - chronos - chronos-2 - int8 - tensorrt - quantization - edge - jetson - orin library_name: onnx pipeline_tag: time-series-forecasting gated: true extra_gated_heading: Acknowledge Embedl Models Community Licence v1.0 extra_gated_description: | By requesting access you agree to the Embedl Models Community Licence v1.0 (no redistribution as a hosted service) and to the upstream chronos-2 license terms. extra_gated_button_content: Request access ---
Optimized by Embedl
Need to fine-tune, hit performance targets, or deploy on specific hardware?
We've got you covered.
Learn more Get in touch →
# Embedl Chronos-2 (Quantized for TensorRT) Deployable INT8-quantized version of [`amazon/chronos-2`](https://huggingface.co/amazon/chronos-2), optimized with [embedl-deploy](https://github.com/embedl/embedl-deploy) for low-latency NVIDIA TensorRT inference on edge GPUs. Two static-context variants ship: **ctx=512** for short-history forecasting and **ctx=2048** for long-history use cases. ## Upstream Model Open amazon/chronos-2 in hfviewer ## Highlights - **Per-tensor INT8** activations + **per-channel INT8** weights via embedl-deploy's PTQ flow on top of TensorRT's fused MHA kernel. No QAT or distillation needed. - **Drop-in replacement** for `amazon/chronos-2` inference: same `(context, group_ids) → quantile_preds` signature; 21 evenly spaced quantile levels with the median at index 10. - **Validated** on the [GIFT-Eval](https://huggingface.co/datasets/Salesforce/GiftEval) benchmark across 125 task configurations. See Accuracy below. - **Two ctx variants** so you can pick the latency/history-window trade-off that fits your deployment. ## Quick Start ```bash pip install tensorrt pycuda numpy python infer_trt.py --ctx 512 # 1.2× faster than FP16 on Orin python infer_trt.py --ctx 2048 # 1.3× faster than FP16 on Orin ``` The `infer_trt.py` helper script builds a TensorRT engine from the ONNX on first run (cached as `*.engine` next to the artifact) and feeds a synthetic seasonal context for demonstration. Replace the context generator with your own series of the right length. ## Files | File | Purpose | |---|---| | `embedl_chronos_2_ctx512_int8.onnx` | INT8 ONNX with Q/DQ — ctx=512, 1024-step horizon. | | `embedl_chronos_2_ctx2048_int8.onnx` | INT8 ONNX with Q/DQ — ctx=2048, 1024-step horizon. | | `infer_trt.py` | ONNX Runtime / TensorRT inference example. | Both artifacts emit a `(1, 21, 1024)` quantile tensor (21 quantile levels × 64 output patches × 16 steps-per-patch = 1024 horizon steps). Slice the median (`preds[0, 10]`) for a point forecast and clip to your needed prediction length. ## Performance Latency measured with TensorRT + `trtexec`, GPU compute time only (`--noDataTransfers`), CUDA Graph + Spin Wait enabled, clocks locked (`nvpmodel -m 0 && jetson_clocks` on Jetson). ### Jetson AGX Orin (MAXN) #### ctx=512

Chronos-2 INT8 latency, ctx=512

| Build | Mean latency (ms) | |---|---| | TensorRT FP16 | **2.977** | | TensorRT `--best` | 2.974 | | **embedl INT8** | **2.432** | | Speedup (FP16 → embedl INT8) | **1.22×** | #### ctx=2048

Chronos-2 INT8 latency, ctx=2048

| Build | Mean latency (ms) | |---|---| | TensorRT FP16 | **4.482** | | TensorRT `--best` | 4.482 | | **embedl INT8** | **3.482** | | Speedup (FP16 → embedl INT8) | **1.29×** | ## Accuracy Evaluated on the [GIFT-Eval](https://huggingface.co/datasets/Salesforce/GiftEval) benchmark — 125 task configurations spanning 50 datasets × {short, medium, long} horizons. Aggregate WQL (weighted quantile loss, lower is better) reported using the [TIME-paper normalization](https://arxiv.org/html/2602.12147v2): geomean of per-task ratio against the Seasonal-Naive baseline. | Metric | FP32 baseline | **embedl INT8 ctx=512** | **embedl INT8 ctx=2048** | |---|---|---|---| | Geomean WQL / Seasonal-Naive | 0.549 | **0.634** | **0.618** | | Geomean WQL / FP32 | 1.000 | **1.156×** | **1.126×** | | Median WQL / FP32 | 1.000 | 1.074× | 1.045× | | Cells within 10 % of FP32 | — | 71 / 125 (57 %) | 79 / 125 (63 %) | | Cells within 20 % of FP32 | — | 96 / 125 (77 %) | 98 / 125 (78 %) | | Cells beating FP32 | — | 14 / 125 | 19 / 125 | **How to read the headline number.** Geomean WQL/S-Naive 0.634 (ctx=512) and 0.618 (ctx=2048) means the INT8 model retains the bulk of `chronos-2`'s skill margin over the no-model Seasonal-Naive baseline. The FP32 model sits at 0.549 by the same convention; the INT8 versions are 15-16 % closer to S-Naive but still convincingly beat it on the geomean. **Where the regression concentrates.** Worst-case cells are out-of-distribution low-frequency series (`us_births/M`, `m4_hourly/{medium,long}`) and high-frequency long-horizon forecasts (`solar/10T/{medium,long}`). The full per-task CSVs ship with the artifacts; check them before deploying to a domain that resembles those outliers. ## Creating Your Own Optimized Models This artifact was produced with [embedl-deploy](https://github.com/embedl/embedl-deploy), Embedl's open-source PyTorch → TensorRT deployment library. The same workflow applies to your own models — see [the documentation](https://github.com/embedl/embedl-deploy#readme) for installation and usage. ## License | Component | License | |---|---| | Optimized model artifacts (this repo) | [Embedl Models Community Licence v1.0](https://github.com/embedl/embedl-models/blob/main/LICENSE) — no redistribution as a hosted service | | Upstream architecture and weights | [Amazon Chronos-2 License](https://huggingface.co/amazon/chronos-2/blob/main/LICENSE) | ## Contact We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at [contact@embedl.com](mailto:contact@embedl.com), or open an issue on [GitHub](https://github.com/embedl/embedl-deploy).
Community & support
Need help with this model? Chat with the Embedl team and other engineers on Discord.
Quantization gotchas, hardware questions, fine-tuning tips — bring them all.
Join our Discord →