Instructions to use embedl/chronos-2-quantized-trt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TensorRT
How to use embedl/chronos-2-quantized-trt with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Embedl Chronos-2 (Quantized for TensorRT)
Deployable INT8-quantized version of
amazon/chronos-2,
optimized with
embedl-deploy for
low-latency NVIDIA TensorRT inference on edge GPUs. Two
static-context variants ship: ctx=512 for short-history
forecasting and ctx=2048 for long-history use cases.
Upstream Model
Highlights
- Per-tensor INT8 activations + per-channel INT8 weights via embedl-deploy's PTQ flow on top of TensorRT's fused MHA kernel. No QAT or distillation needed.
- Drop-in replacement for
amazon/chronos-2inference: same(context, group_ids) β quantile_predssignature; 21 evenly spaced quantile levels with the median at index 10. - Validated on the GIFT-Eval benchmark across 125 task configurations. See Accuracy below.
- Two ctx variants so you can pick the latency/history-window trade-off that fits your deployment.
Quick Start
pip install tensorrt pycuda numpy
python infer_trt.py --ctx 512 # 1.2Γ faster than FP16 on Orin
python infer_trt.py --ctx 2048 # 1.3Γ faster than FP16 on Orin
The infer_trt.py helper script builds a TensorRT engine from the
ONNX on first run (cached as *.engine next to the artifact) and
feeds a synthetic seasonal context for demonstration. Replace the
context generator with your own series of the right length.
Files
| File | Purpose |
|---|---|
embedl_chronos_2_ctx512_int8.onnx |
INT8 ONNX with Q/DQ β ctx=512, 1024-step horizon. |
embedl_chronos_2_ctx2048_int8.onnx |
INT8 ONNX with Q/DQ β ctx=2048, 1024-step horizon. |
infer_trt.py |
ONNX Runtime / TensorRT inference example. |
Both artifacts emit a (1, 21, 1024) quantile tensor (21 quantile
levels Γ 64 output patches Γ 16 steps-per-patch = 1024 horizon
steps). Slice the median (preds[0, 10]) for a point forecast and
clip to your needed prediction length.
Performance
Latency measured with TensorRT + trtexec, GPU compute time only
(--noDataTransfers), CUDA Graph + Spin Wait enabled, clocks locked
(nvpmodel -m 0 && jetson_clocks on Jetson).
Jetson AGX Orin (MAXN)
ctx=512
| Build | Mean latency (ms) |
|---|---|
| TensorRT FP16 | 2.977 |
TensorRT --best |
2.974 |
| embedl INT8 | 2.432 |
| Speedup (FP16 β embedl INT8) | 1.22Γ |
ctx=2048
| Build | Mean latency (ms) |
|---|---|
| TensorRT FP16 | 4.482 |
TensorRT --best |
4.482 |
| embedl INT8 | 3.482 |
| Speedup (FP16 β embedl INT8) | 1.29Γ |
Accuracy
Evaluated on the GIFT-Eval benchmark β 125 task configurations spanning 50 datasets Γ {short, medium, long} horizons. Aggregate WQL (weighted quantile loss, lower is better) reported using the TIME-paper normalization: geomean of per-task ratio against the Seasonal-Naive baseline.
| Metric | FP32 baseline | embedl INT8 ctx=512 | embedl INT8 ctx=2048 |
|---|---|---|---|
| Geomean WQL / Seasonal-Naive | 0.549 | 0.634 | 0.618 |
| Geomean WQL / FP32 | 1.000 | 1.156Γ | 1.126Γ |
| Median WQL / FP32 | 1.000 | 1.074Γ | 1.045Γ |
| Cells within 10 % of FP32 | β | 71 / 125 (57 %) | 79 / 125 (63 %) |
| Cells within 20 % of FP32 | β | 96 / 125 (77 %) | 98 / 125 (78 %) |
| Cells beating FP32 | β | 14 / 125 | 19 / 125 |
How to read the headline number. Geomean WQL/S-Naive 0.634
(ctx=512) and 0.618 (ctx=2048) means the INT8 model retains the
bulk of chronos-2's skill margin over the no-model Seasonal-Naive
baseline. The FP32 model sits at 0.549 by the same convention; the
INT8 versions are 15-16 % closer to S-Naive but still convincingly
beat it on the geomean.
Where the regression concentrates. Worst-case cells are
out-of-distribution low-frequency series (us_births/M,
m4_hourly/{medium,long}) and high-frequency long-horizon
forecasts (solar/10T/{medium,long}). The full per-task CSVs
ship with the artifacts; check them before deploying to a domain
that resembles those outliers.
Creating Your Own Optimized Models
This artifact was produced with embedl-deploy, Embedl's open-source PyTorch β TensorRT deployment library. The same workflow applies to your own models β see the documentation for installation and usage.
License
| Component | License |
|---|---|
| Optimized model artifacts (this repo) | Embedl Models Community Licence v1.0 β no redistribution as a hosted service |
| Upstream architecture and weights | Amazon Chronos-2 License |
Contact
We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.
Model tree for embedl/chronos-2-quantized-trt
Base model
amazon/chronos-2
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js