JL-CSDI ESc1 Foundation Model

Probabilistic order-flow forecaster for E-mini S&P 500 futures (ESc1). Given a 100-window context (roughly 50,000 recent trades), the model produces Monte Carlo samples of the joint distribution of 8 trade-time order-flow features over the next 10 windows.

Architecture

CSDI (Tashiro et al., NeurIPS 2021) with the Jump-Laplace noise model from Baule (2025), v2 architecture variant.

Parameter Value
Total parameters 61.93 M
Residual blocks 6 with dual attention (temporal then cross-feature)
Channels per block 528
Attention heads 6
Feed-forward dim 2112
Diffusion steps (training) 800
JL reverse-sampling steps 50, uniform Δt = 0.12, T_max = 6.0
Noise scale σ 1.0 (global)
Noise schedule tanh6-1 (Li et al., Nature Machine Intelligence 2024)

The release weights are the EMA snapshot at the best validation epoch (87). Training auxiliaries (distributional aux head, importance reweighting) are disabled in the release config because they are not used at inference.

Training data

ESc1 trade-by-trade data from LSEG TickHistory, aggressor-labeled (explicit field where present, qualifier-regex recovery otherwise).

Split Range Windows Trades
Train 2016-07-01 to 2025-01-01 1.33 M 665 M
Validation 2025-01-01 to 2025-04-01 6,855 filtered subset
Held-out test 2025-04-01 to 2026-04-01 23,175 filtered subset

The training range covers Volmageddon (February 2018), the COVID crash and recovery (March 2020 onward), the 2023 banking crisis (Silicon Valley Bank, Credit Suisse), and the 2024 volatility-compression regime. Pre-July 2016 data is excluded because LSEG TickHistory does not expose trade aggressor for CME futures in that period.

Backtesting on dates inside the training window introduces leakage. Use 2026-04-01 onward or pre-2016-07 (with a different data source) for clean out-of-sample evaluation.

Input and output

Input  : dict with
            "observed_data"  (B, L=110, K=8)
            "observed_mask"  (B, L=110, K=8)
            "gt_mask"        (B, L=110, K=8)   1 over context, 0 over forecast
            "timepoints"     (B, L=110)
            "feature_id"     (B, K=8)

Output : samples (B, n_samples, K=8, L=110)
            n_samples Monte Carlo realizations of the joint forecast.

The 8 channels are ESc1 trade-time order-flow features (Maitrier and Bouchaud a-family imbalance at a = 0, 0.25, 0.5, 0.75, 1.0; log return; log realized variance; log total volume). Channel order and normalization scalars are in normalization_params.json.

Performance

Evaluated on 100 windows from the held-out test set (2025-04-01 onward), n_samples = 32, RTX 5090. Two production samplers reported.

Metric Vanilla JL-SDE MGD-conditional
Validation loss (EMA, epoch 87) 0.0220 0.0220
90% interval empirical coverage on imbalance channels, low / mid / high volatility 0.93 / 0.93 / 0.94 0.88 / 0.89 / 0.88
Relative error on cov(imb_a0, imb_a025) vs realized 5.8 % 1.9 %
Sample diversity ratio (within-MC std / MC-mean trajectory std) 4.05 to 4.83 4.11 to 4.83
Wall time per 100 windows 493 s 470 s

MGD-conditional is the recommended sampler. It applies a moment-guided correction at sampling time (Lempereur et al., 2026) that enforces calibrated cross-channel coupling without retraining. Coverage and cross-channel results are best at 90 % nominal; sample diversity above 1.0 in every channel confirms no mean collapse.

Quick start

git lfs install
git clone https://huggingface.co/S-teven/jl-csdi-mgd
cd jl-csdi-mgd
pip install -r requirements.txt
python inference.py --sampler sde-mgd --n-samples 32

The inference.py example runs on a synthetic batch and prints the output tensor shape. For real data, normalize raw 8-channel features with normalization_params.json (V5-hybrid-D: T-scaling on imbalance channels, sqrt-T and delta-std on log_ret, MAD plus z-score on log_realized_var and log_tot_vol).

Inference modes

inference.py accepts --sampler {sde, sde-mgd}.

  • sde: vanilla Jump-Laplace SDE reverse sampling (Baule, 2025).
  • sde-mgd: vanilla SDE with conditional moment-guided correction (recommended).

Programmatic usage is straightforward once the model is loaded:

from safetensors.torch import load_file
import yaml, torch
from main_model import CSDI_Forecasting

cfg = yaml.safe_load(open("config.yaml"))
model = CSDI_Forecasting(cfg, "cuda", cfg["model"]["target_dim"]).to("cuda")
model.load_state_dict(load_file("model.safetensors"), strict=True)
model.eval()

samples = model.impute_jl_sde_mgd(
    observed_data, cond_mask, side_info, n_samples=32,
    mgd_target_mode="conditional",
)

Files

File Purpose
model.safetensors EMA weights, Lightning prefix stripped (198 tensors, 61.93 M params, 248 MB)
config.yaml Architecture hyperparameters; inference-time configuration
normalization_params.json V5-hybrid-D normalization scalars
mgd_target_moments.npz Optional precomputed unconditional MGD trajectory
main_model.py CSDI_Forecasting class
diff_models_v2.py Diffusion network
jl_noise.py Jump-Laplace forward and sampling primitives
mgd_step_torch.py Moment-guided sampling-time correction (centered polynomial moments)
inference.py Working example
requirements.txt Pinned dependencies

Limitations

The model outputs Monte Carlo samples. Downstream consumers compute the quantities they need (probabilities, quantiles, joint event likelihoods) from those samples. It is trained on ESc1 only; transfer to other instruments has not been evaluated. Backtest on dates outside the training window.

References

  1. Tashiro, Y., Song, J., Song, Y., Ermon, S. (2021). CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. Advances in Neural Information Processing Systems 34, 24804-24816. arXiv:2107.03502.

  2. Baule, A. (2025). Generative modelling with jump-diffusions. arXiv:2503.06558.

  3. Lempereur, E., Cuvelle-Magar, N., Coeurdoux, F., Mallat, S., Vanden-Eijnden, E. (2026). MGD: Moment Guided Diffusion for Maximum Entropy Generation. arXiv:2602.17211.

  4. Maitrier, G., Bouchaud, J.-P. (2025). The Subtle Interplay between Square-root Impact, Order Imbalance and Volatility: A Unifying Framework. arXiv:2506.07711.

  5. Li, T., Biferale, L., Bonaccorso, F., Scarpolini, M. A., Buzzicotti, M. (2024). Synthetic Lagrangian turbulence by generative diffusion models. Nature Machine Intelligence 6(4), 393-403. arXiv:2307.08529.

  6. Yang, Y., Zha, K., Chen, Y.-C., Wang, H., Katabi, D. (2021). Delving into Deep Imbalanced Regression. Proceedings of the 38th International Conference on Machine Learning. arXiv:2102.09554.

License

Internal and research use. Not for redistribution.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for S-teven/jl-csdi-mgd