dann-od's picture
Reworked the infer_trt script
72ba9a3 verified
---
license: other
license_name: embedl-models-community-licence-v1.0
license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
base_model: amazon/chronos-2
quantized_from: amazon/chronos-2
tags:
- time-series
- time-series-forecasting
- chronos
- chronos-2
- int8
- tensorrt
- quantization
- edge
- jetson
- orin
library_name: onnx
pipeline_tag: time-series-forecasting
gated: true
extra_gated_heading: Acknowledge Embedl Models Community Licence v1.0
extra_gated_description: |
By requesting access you agree to the Embedl Models Community
Licence v1.0 (no redistribution as a hosted service) and to the
upstream chronos-2 license terms.
extra_gated_button_content: Request access
---
<!-- embedl-banner:start -->
<style>
.embedl-btn-primary { transition: background 160ms ease, box-shadow 160ms ease; }
.embedl-btn-primary:hover { background: #4FDCE4 !important; box-shadow: 0 8px 22px rgba(45,212,221,0.45) !important; }
.embedl-btn-secondary { transition: background 160ms ease; }
.embedl-btn-secondary:hover { background: rgba(45,212,221,0.15) !important; }
.embedl-headline { font-size: clamp(11px, 2.15vw, 15px) !important; }
.embedl-btn-primary, .embedl-btn-secondary {
font-size: clamp(11px, 1.65vw, 13px) !important;
padding: clamp(6px, 1.1vw, 9px) clamp(10px, 1.6vw, 14px) !important;
}
</style>
<div style="background:radial-gradient(600px 220px at 0% 50%,rgba(45,212,221,0.22) 0%,rgba(45,212,221,0) 60%),radial-gradient(400px 180px at 100% 100%,rgba(45,212,221,0.10) 0%,rgba(45,212,221,0) 55%),linear-gradient(135deg,#0B1626 0%,#142338 100%);border:1px solid rgba(45,212,221,0.28);border-radius:12px;padding:22px 24px;margin:0 0 24px 0;color:#F2F6FA;box-shadow:0 4px 16px rgba(11,22,38,0.18);overflow:hidden;box-sizing:border-box;max-width:100%;">
<table style="width:100%;border-collapse:collapse;border:0;background:transparent;">
<tr style="background:transparent;">
<td style="vertical-align:middle;border:0;padding:0;background:transparent;">
<div style="display:inline-block;font-size:10px;letter-spacing:0.08em;text-transform:uppercase;font-weight:700;color:#2DD4DD;background:rgba(45,212,221,0.15);border:1px solid rgba(45,212,221,0.35);padding:4px 10px;border-radius:999px;margin-bottom:10px;white-space:nowrap;">Optimized by Embedl</div>
<div class="embedl-headline" style="font-size:15px;font-weight:700;line-height:1.35;color:#F2F6FA;margin-bottom:4px;">Need to <span style="color:#2DD4DD;white-space:nowrap;">fine-tune</span>, hit <span style="color:#2DD4DD;white-space:nowrap;">performance targets</span>, or deploy on <span style="color:#2DD4DD;white-space:nowrap;">specific hardware</span>?</div>
<div style="font-size:13px;color:#9BA7B5;">We've got you covered.</div>
</td>
<td width="1%" style="vertical-align:middle;border:0;padding:0 0 0 18px;white-space:nowrap;text-align:right;background:transparent;">
<a href="https://www.embedl.com/models" class="embedl-btn-secondary" style="display:inline-block;font-size:13px;font-weight:600;padding:9px 14px;border-radius:6px;border:1px solid #2DD4DD;color:#2DD4DD;text-decoration:none;margin-right:8px;">Learn more</a>
<a href="https://www.embedl.com/contact" class="embedl-btn-primary" style="display:inline-block;font-size:13px;font-weight:600;padding:9px 14px;border-radius:6px;border:1px solid #2DD4DD;background:#2DD4DD;color:#0B1626;text-decoration:none;box-shadow:0 6px 18px rgba(45,212,221,0.28);">Get in touch β†’</a>
</td>
</tr>
</table>
</div>
<!-- embedl-banner:end -->
# Embedl Chronos-2 (Quantized for TensorRT)
Deployable INT8-quantized version of
[`amazon/chronos-2`](https://huggingface.co/amazon/chronos-2),
optimized with
[embedl-deploy](https://github.com/embedl/embedl-deploy) for
low-latency NVIDIA TensorRT inference on edge GPUs. Two
static-context variants ship: **ctx=512** for short-history
forecasting and **ctx=2048** for long-history use cases.
## Upstream Model
<a href="https://hfviewer.com/amazon/chronos-2?utm_source=huggingface&amp;utm_medium=embedded_model_card&amp;utm_campaign=amazon__chronos-2_card" target="_blank" rel="noopener">
<img
src="https://hfviewer.com/api/card.svg?source=amazon%2Fchronos-2&amp;v=20260501clipcard"
alt="Open amazon/chronos-2 in hfviewer"
width="100%"
/>
</a>
## Highlights
- **Per-tensor INT8** activations + **per-channel INT8** weights via
embedl-deploy's PTQ flow on top of TensorRT's fused MHA kernel.
No QAT or distillation needed.
- **Drop-in replacement** for `amazon/chronos-2` inference: same
`(context, group_ids) β†’ quantile_preds` signature; 21 evenly
spaced quantile levels with the median at index 10.
- **Validated** on the [GIFT-Eval](https://huggingface.co/datasets/Salesforce/GiftEval)
benchmark across 125 task configurations. See Accuracy below.
- **Two ctx variants** so you can pick the latency/history-window
trade-off that fits your deployment.
## Quick Start
```bash
pip install tensorrt pycuda numpy
python infer_trt.py --ctx 512 # 1.2Γ— faster than FP16 on Orin
python infer_trt.py --ctx 2048 # 1.3Γ— faster than FP16 on Orin
```
The `infer_trt.py` helper script builds a TensorRT engine from the
ONNX on first run (cached as `*.engine` next to the artifact) and
feeds a synthetic seasonal context for demonstration. Replace the
context generator with your own series of the right length.
## Files
| File | Purpose |
|---|---|
| `embedl_chronos_2_ctx512_int8.onnx` | INT8 ONNX with Q/DQ β€” ctx=512, 1024-step horizon. |
| `embedl_chronos_2_ctx2048_int8.onnx` | INT8 ONNX with Q/DQ β€” ctx=2048, 1024-step horizon. |
| `infer_trt.py` | ONNX Runtime / TensorRT inference example. |
Both artifacts emit a `(1, 21, 1024)` quantile tensor (21 quantile
levels Γ— 64 output patches Γ— 16 steps-per-patch = 1024 horizon
steps). Slice the median (`preds[0, 10]`) for a point forecast and
clip to your needed prediction length.
## Performance
Latency measured with TensorRT + `trtexec`, GPU compute time only
(`--noDataTransfers`), CUDA Graph + Spin Wait enabled, clocks locked
(`nvpmodel -m 0 && jetson_clocks` on Jetson).
### Jetson AGX Orin (MAXN)
#### ctx=512
<p align="center">
<img src="https://huggingface.co/datasets/embedl/documentation-images/resolve/main/chronos-2-quantized-trt/chronos-2-quantized-trt__orin-mountain-view__latency_ctx512.svg" alt="Chronos-2 INT8 latency, ctx=512" width="640">
</p>
| Build | Mean latency (ms) |
|---|---|
| TensorRT FP16 | **2.977** |
| TensorRT `--best` | 2.974 |
| **embedl INT8** | **2.432** |
| Speedup (FP16 β†’ embedl INT8) | **1.22Γ—** |
#### ctx=2048
<p align="center">
<img src="https://huggingface.co/datasets/embedl/documentation-images/resolve/main/chronos-2-quantized-trt/chronos-2-quantized-trt__orin-mountain-view__latency_ctx2048.svg" alt="Chronos-2 INT8 latency, ctx=2048" width="640">
</p>
| Build | Mean latency (ms) |
|---|---|
| TensorRT FP16 | **4.482** |
| TensorRT `--best` | 4.482 |
| **embedl INT8** | **3.482** |
| Speedup (FP16 β†’ embedl INT8) | **1.29Γ—** |
## Accuracy
Evaluated on the
[GIFT-Eval](https://huggingface.co/datasets/Salesforce/GiftEval)
benchmark β€” 125 task configurations spanning 50 datasets Γ—
{short, medium, long} horizons. Aggregate WQL (weighted quantile
loss, lower is better) reported using the
[TIME-paper normalization](https://arxiv.org/html/2602.12147v2):
geomean of per-task ratio against the Seasonal-Naive baseline.
| Metric | FP32 baseline | **embedl INT8 ctx=512** | **embedl INT8 ctx=2048** |
|---|---|---|---|
| Geomean WQL / Seasonal-Naive | 0.549 | **0.634** | **0.618** |
| Geomean WQL / FP32 | 1.000 | **1.156Γ—** | **1.126Γ—** |
| Median WQL / FP32 | 1.000 | 1.074Γ— | 1.045Γ— |
| Cells within 10 % of FP32 | β€” | 71 / 125 (57 %) | 79 / 125 (63 %) |
| Cells within 20 % of FP32 | β€” | 96 / 125 (77 %) | 98 / 125 (78 %) |
| Cells beating FP32 | β€” | 14 / 125 | 19 / 125 |
**How to read the headline number.** Geomean WQL/S-Naive 0.634
(ctx=512) and 0.618 (ctx=2048) means the INT8 model retains the
bulk of `chronos-2`'s skill margin over the no-model Seasonal-Naive
baseline. The FP32 model sits at 0.549 by the same convention; the
INT8 versions are 15-16 % closer to S-Naive but still convincingly
beat it on the geomean.
**Where the regression concentrates.** Worst-case cells are
out-of-distribution low-frequency series (`us_births/M`,
`m4_hourly/{medium,long}`) and high-frequency long-horizon
forecasts (`solar/10T/{medium,long}`). The full per-task CSVs
ship with the artifacts; check them before deploying to a domain
that resembles those outliers.
## Creating Your Own Optimized Models
This artifact was produced with
[embedl-deploy](https://github.com/embedl/embedl-deploy), Embedl's
open-source PyTorch β†’ TensorRT deployment library. The same workflow
applies to your own models β€” see
[the documentation](https://github.com/embedl/embedl-deploy#readme)
for installation and usage.
## License
| Component | License |
|---|---|
| Optimized model artifacts (this repo) | [Embedl Models Community Licence v1.0](https://github.com/embedl/embedl-models/blob/main/LICENSE) β€” no redistribution as a hosted service |
| Upstream architecture and weights | [Amazon Chronos-2 License](https://huggingface.co/amazon/chronos-2/blob/main/LICENSE) |
## Contact
We offer engineering support for on-prem/edge deployments and partner
co-marketing opportunities. Reach out at
[contact@embedl.com](mailto:contact@embedl.com), or open an issue on
[GitHub](https://github.com/embedl/embedl-deploy).
<!-- embedl-discord-banner:start -->
<style>
.embedl-discord-btn { transition: background 160ms ease, box-shadow 160ms ease; }
.embedl-discord-btn:hover { background: #6C77F5 !important; box-shadow: 0 8px 22px rgba(88,101,242,0.55) !important; }
</style>
<div style="background:radial-gradient(600px 220px at 0% 50%,rgba(88,101,242,0.22) 0%,rgba(88,101,242,0) 60%),radial-gradient(400px 180px at 100% 100%,rgba(88,101,242,0.10) 0%,rgba(88,101,242,0) 55%),linear-gradient(135deg,#0B1626 0%,#142338 100%);border:1px solid rgba(88,101,242,0.35);border-radius:12px;padding:22px 24px;margin:24px 0 0 0;color:#F2F6FA;box-shadow:0 4px 16px rgba(11,22,38,0.18);overflow:hidden;box-sizing:border-box;max-width:100%;">
<table style="width:100%;border-collapse:collapse;border:0;background:transparent;">
<tr style="background:transparent;">
<td style="vertical-align:middle;border:0;padding:0;background:transparent;">
<div style="display:inline-block;font-size:10px;letter-spacing:0.08em;text-transform:uppercase;font-weight:700;color:#A5B4FC;background:rgba(88,101,242,0.18);border:1px solid rgba(88,101,242,0.45);padding:4px 10px;border-radius:999px;margin-bottom:10px;white-space:nowrap;">Community &amp; support</div>
<div style="font-size:15px;font-weight:700;line-height:1.35;color:#F2F6FA;margin-bottom:4px;">Need help with this model? Chat with the Embedl team and other engineers on <span style="color:#A5B4FC;white-space:nowrap;">Discord</span>.</div>
<div style="font-size:13px;color:#9BA7B5;">Quantization gotchas, hardware questions, fine-tuning tips β€” bring them all.</div>
</td>
<td width="1%" style="vertical-align:middle;border:0;padding:0 0 0 18px;white-space:nowrap;text-align:right;background:transparent;">
<a href="https://discord.gg/MTbMWdKqE" class="embedl-discord-btn" style="display:inline-block;font-size:13px;font-weight:600;padding:9px 14px;border-radius:6px;border:1px solid #5865F2;background:#5865F2;color:#FFFFFF;text-decoration:none;box-shadow:0 6px 18px rgba(88,101,242,0.35);">Join our Discord β†’</a>
</td>
</tr>
</table>
</div>
<!-- embedl-discord-banner:end -->