--- license: apache-2.0 tags: - ross - llm-serving - simulation - xgboost - performance-prediction --- # ROSS Stage-Wise Regression Models Pre-trained XGBoost regression models for [ROSS](https://github.com/scitix/ross) -- a dual-plane simulator for LLM serving systems. These models power ROSS's **data plane**: given a batch descriptor (request IDs, sequence lengths, model architecture features, and platform performance features), they predict per-iteration latency by decomposing each serving iteration into **pre-forward**, **forward**, and **post-forward** stages, explicitly capturing CPU-GPU pipeline overlap. ## Model Overview | Component | Description | |-----------|-------------| | Algorithm | XGBoost regressor | | Training data | Sparse profiling traces collected on NVIDIA H200 and B200 GPUs | | Prediction target | Per-stage iteration latency (ms) for each of pre-forward, forward, and post-forward | | Input features | Batch shape, model architecture features, platform performance features | | Supported frameworks | vLLM, SGLang | ## Directory Structure ``` sgl/ # SGLang backend models dense/ prefill/ pre_forward_trained_models/xgboost_model/ forward_trained_models/xgboost_model/ decode/ pre_forward_trained_models/xgboost_model/ forward_trained_models/xgboost_model/ post_forward_trained_models/xgboost_model/ moe_foward/ prefill/ forward_trained_models/xgboost_model/ decode/ forward_trained_models/xgboost_model/ vllm/ # vLLM backend models dense/ pre_forward_trained_models/xgboost_model/ forward_trained_models/xgboost_model/ post_forward_trained_models/xgboost_model/ moe/ forward_trained_models/xgboost_model/ ``` Each `xgboost_model/` directory contains: - `model.json` -- the serialized XGBoost model - `model_metadata.json` -- feature names, training metadata ## Supported Platforms | GPU | Status | |-----|--------| | NVIDIA H200 | Pre-trained models included | | NVIDIA B200 | Pre-trained models included | New platforms can be added by running the profiling scripts in the ROSS repository's `collector/` directory. ## Validated LLM Families | Family | Variants | |--------|----------| | Llama-3.1 | 8B, 70B | | Qwen2.5 | 72B-Instruct | | Qwen3 | 32B, 30B-A3B (MoE), 235B-A22B (MoE), QwQ 32B | | DeepSeek-V3 | 671B (MoE) | | gpt-oss | 20b (MoE), 120b (MoE) | The stage-wise regressor takes model configuration features as input rather than per-model kernel calibration, so new models within supported families generally work out of the box. ## Usage ### 1. Download ```bash # Using huggingface-cli huggingface-cli download CharlesCAOO/ross --local-dir modeling ``` ### 2. Point ROSS to the downloaded models In your ROSS config JSON: ```json { "modeling_dir": "/path/to/modeling", ... } ``` Or via CLI: ```bash python ross/ross_predict.py --modeling-dir /path/to/modeling --config my_config.json ``` ### 3. Run simulation ```bash python ross/ross_predict.py --config my_config.json --record-path results.csv ``` ROSS achieves median prediction errors below 6% for E2E latency and TPOT across the validated models and platforms, while sustaining >11x simulation speedup over on-hardware benchmarking. ## License Apache 2.0