ROSS Stage-Wise Regression Models
Pre-trained XGBoost regression models for ROSS -- a dual-plane simulator for LLM serving systems.
These models power ROSS's data plane: given a batch descriptor (request IDs, sequence lengths, model architecture features, and platform performance features), they predict per-iteration latency by decomposing each serving iteration into pre-forward, forward, and post-forward stages, explicitly capturing CPU-GPU pipeline overlap.
Model Overview
| Component | Description |
|---|---|
| Algorithm | XGBoost regressor |
| Training data | Sparse profiling traces collected on NVIDIA H200 and B200 GPUs |
| Prediction target | Per-stage iteration latency (ms) for each of pre-forward, forward, and post-forward |
| Input features | Batch shape, model architecture features, platform performance features |
| Supported frameworks | vLLM, SGLang |
Directory Structure
sgl/ # SGLang backend models
dense/
prefill/
pre_forward_trained_models/xgboost_model/
forward_trained_models/xgboost_model/
decode/
pre_forward_trained_models/xgboost_model/
forward_trained_models/xgboost_model/
post_forward_trained_models/xgboost_model/
moe_foward/
prefill/
forward_trained_models/xgboost_model/
decode/
forward_trained_models/xgboost_model/
vllm/ # vLLM backend models
dense/
pre_forward_trained_models/xgboost_model/
forward_trained_models/xgboost_model/
post_forward_trained_models/xgboost_model/
moe/
forward_trained_models/xgboost_model/
Each xgboost_model/ directory contains:
model.json-- the serialized XGBoost modelmodel_metadata.json-- feature names, training metadata
Supported Platforms
| GPU | Status |
|---|---|
| NVIDIA H200 | Pre-trained models included |
| NVIDIA B200 | Pre-trained models included |
New platforms can be added by running the profiling scripts in the ROSS repository's collector/ directory.
Validated LLM Families
| Family | Variants |
|---|---|
| Llama-3.1 | 8B, 70B |
| Qwen2.5 | 72B-Instruct |
| Qwen3 | 32B, 30B-A3B (MoE), 235B-A22B (MoE), QwQ 32B |
| DeepSeek-V3 | 671B (MoE) |
| gpt-oss | 20b (MoE), 120b (MoE) |
The stage-wise regressor takes model configuration features as input rather than per-model kernel calibration, so new models within supported families generally work out of the box.
Usage
1. Download
# Using huggingface-cli
huggingface-cli download CharlesCAOO/ross --local-dir modeling
2. Point ROSS to the downloaded models
In your ROSS config JSON:
{
"modeling_dir": "/path/to/modeling",
...
}
Or via CLI:
python ross/ross_predict.py --modeling-dir /path/to/modeling --config my_config.json
3. Run simulation
python ross/ross_predict.py --config my_config.json --record-path results.csv
ROSS achieves median prediction errors below 6% for E2E latency and TPOT across the validated models and platforms, while sustaining >11x simulation speedup over on-hardware benchmarking.
License
Apache 2.0