JL-CSDI ESc1 Foundation Model
Probabilistic order-flow forecaster for E-mini S&P 500 futures (ESc1). Given a 100-window context (roughly 50,000 recent trades), the model produces Monte Carlo samples of the joint distribution of 8 trade-time order-flow features over the next 10 windows.
Architecture
CSDI (Tashiro et al., NeurIPS 2021) with the Jump-Laplace noise model from Baule (2025), v2 architecture variant.
| Parameter | Value |
|---|---|
| Total parameters | 61.93 M |
| Residual blocks | 6 with dual attention (temporal then cross-feature) |
| Channels per block | 528 |
| Attention heads | 6 |
| Feed-forward dim | 2112 |
| Diffusion steps (training) | 800 |
| JL reverse-sampling steps | 50, uniform Δt = 0.12, T_max = 6.0 |
| Noise scale σ | 1.0 (global) |
| Noise schedule | tanh6-1 (Li et al., Nature Machine Intelligence 2024) |
The release weights are the EMA snapshot at the best validation epoch (87). Training auxiliaries (distributional aux head, importance reweighting) are disabled in the release config because they are not used at inference.
Training data
ESc1 trade-by-trade data from LSEG TickHistory, aggressor-labeled (explicit field where present, qualifier-regex recovery otherwise).
| Split | Range | Windows | Trades |
|---|---|---|---|
| Train | 2016-07-01 to 2025-01-01 | 1.33 M | 665 M |
| Validation | 2025-01-01 to 2025-04-01 | 6,855 | filtered subset |
| Held-out test | 2025-04-01 to 2026-04-01 | 23,175 | filtered subset |
The training range covers Volmageddon (February 2018), the COVID crash and recovery (March 2020 onward), the 2023 banking crisis (Silicon Valley Bank, Credit Suisse), and the 2024 volatility-compression regime. Pre-July 2016 data is excluded because LSEG TickHistory does not expose trade aggressor for CME futures in that period.
Backtesting on dates inside the training window introduces leakage. Use 2026-04-01 onward or pre-2016-07 (with a different data source) for clean out-of-sample evaluation.
Input and output
Input : dict with
"observed_data" (B, L=110, K=8)
"observed_mask" (B, L=110, K=8)
"gt_mask" (B, L=110, K=8) 1 over context, 0 over forecast
"timepoints" (B, L=110)
"feature_id" (B, K=8)
Output : samples (B, n_samples, K=8, L=110)
n_samples Monte Carlo realizations of the joint forecast.
The 8 channels are ESc1 trade-time order-flow features (Maitrier and Bouchaud a-family imbalance at a = 0, 0.25, 0.5, 0.75, 1.0; log return; log realized variance; log total volume). Channel order and normalization scalars are in normalization_params.json.
Performance
Evaluated on 100 windows from the held-out test set (2025-04-01 onward), n_samples = 32, RTX 5090. Two production samplers reported.
| Metric | Vanilla JL-SDE | MGD-conditional |
|---|---|---|
| Validation loss (EMA, epoch 87) | 0.0220 | 0.0220 |
| 90% interval empirical coverage on imbalance channels, low / mid / high volatility | 0.93 / 0.93 / 0.94 | 0.88 / 0.89 / 0.88 |
| Relative error on cov(imb_a0, imb_a025) vs realized | 5.8 % | 1.9 % |
| Sample diversity ratio (within-MC std / MC-mean trajectory std) | 4.05 to 4.83 | 4.11 to 4.83 |
| Wall time per 100 windows | 493 s | 470 s |
MGD-conditional is the recommended sampler. It applies a moment-guided correction at sampling time (Lempereur et al., 2026) that enforces calibrated cross-channel coupling without retraining. Coverage and cross-channel results are best at 90 % nominal; sample diversity above 1.0 in every channel confirms no mean collapse.
Quick start
git lfs install
git clone https://huggingface.co/S-teven/jl-csdi-mgd
cd jl-csdi-mgd
pip install -r requirements.txt
python inference.py --sampler sde-mgd --n-samples 32
The inference.py example runs on a synthetic batch and prints the output tensor shape. For real data, normalize raw 8-channel features with normalization_params.json (V5-hybrid-D: T-scaling on imbalance channels, sqrt-T and delta-std on log_ret, MAD plus z-score on log_realized_var and log_tot_vol).
Inference modes
inference.py accepts --sampler {sde, sde-mgd}.
sde: vanilla Jump-Laplace SDE reverse sampling (Baule, 2025).sde-mgd: vanilla SDE with conditional moment-guided correction (recommended).
Programmatic usage is straightforward once the model is loaded:
from safetensors.torch import load_file
import yaml, torch
from main_model import CSDI_Forecasting
cfg = yaml.safe_load(open("config.yaml"))
model = CSDI_Forecasting(cfg, "cuda", cfg["model"]["target_dim"]).to("cuda")
model.load_state_dict(load_file("model.safetensors"), strict=True)
model.eval()
samples = model.impute_jl_sde_mgd(
observed_data, cond_mask, side_info, n_samples=32,
mgd_target_mode="conditional",
)
Files
| File | Purpose |
|---|---|
model.safetensors |
EMA weights, Lightning prefix stripped (198 tensors, 61.93 M params, 248 MB) |
config.yaml |
Architecture hyperparameters; inference-time configuration |
normalization_params.json |
V5-hybrid-D normalization scalars |
mgd_target_moments.npz |
Optional precomputed unconditional MGD trajectory |
main_model.py |
CSDI_Forecasting class |
diff_models_v2.py |
Diffusion network |
jl_noise.py |
Jump-Laplace forward and sampling primitives |
mgd_step_torch.py |
Moment-guided sampling-time correction (centered polynomial moments) |
inference.py |
Working example |
requirements.txt |
Pinned dependencies |
Limitations
The model outputs Monte Carlo samples. Downstream consumers compute the quantities they need (probabilities, quantiles, joint event likelihoods) from those samples. It is trained on ESc1 only; transfer to other instruments has not been evaluated. Backtest on dates outside the training window.
References
Tashiro, Y., Song, J., Song, Y., Ermon, S. (2021). CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. Advances in Neural Information Processing Systems 34, 24804-24816. arXiv:2107.03502.
Baule, A. (2025). Generative modelling with jump-diffusions. arXiv:2503.06558.
Lempereur, E., Cuvelle-Magar, N., Coeurdoux, F., Mallat, S., Vanden-Eijnden, E. (2026). MGD: Moment Guided Diffusion for Maximum Entropy Generation. arXiv:2602.17211.
Maitrier, G., Bouchaud, J.-P. (2025). The Subtle Interplay between Square-root Impact, Order Imbalance and Volatility: A Unifying Framework. arXiv:2506.07711.
Li, T., Biferale, L., Bonaccorso, F., Scarpolini, M. A., Buzzicotti, M. (2024). Synthetic Lagrangian turbulence by generative diffusion models. Nature Machine Intelligence 6(4), 393-403. arXiv:2307.08529.
Yang, Y., Zha, K., Chen, Y.-C., Wang, H., Katabi, D. (2021). Delving into Deep Imbalanced Regression. Proceedings of the 38th International Conference on Machine Learning. arXiv:2102.09554.
License
Internal and research use. Not for redistribution.