---
license: apache-2.0
tags:
  - ross
  - llm-serving
  - simulation
  - xgboost
  - performance-prediction
---

# ROSS Stage-Wise Regression Models

Pre-trained XGBoost regression models for [ROSS](https://github.com/scitix/ross) -- a dual-plane simulator for LLM serving systems.

These models power ROSS's **data plane**: given a batch descriptor (request IDs, sequence lengths, model architecture features, and platform performance features), they predict per-iteration latency by decomposing each serving iteration into **pre-forward**, **forward**, and **post-forward** stages, explicitly capturing CPU-GPU pipeline overlap.

## Model Overview

| Component | Description |
|-----------|-------------|
| Algorithm | XGBoost regressor |
| Training data | Sparse profiling traces collected on NVIDIA H200 and B200 GPUs |
| Prediction target | Per-stage iteration latency (ms) for each of pre-forward, forward, and post-forward |
| Input features | Batch shape, model architecture features, platform performance features |
| Supported frameworks | vLLM, SGLang |

## Directory Structure

```
sgl/                              # SGLang backend models
  dense/
    prefill/
      pre_forward_trained_models/xgboost_model/
      forward_trained_models/xgboost_model/
    decode/
      pre_forward_trained_models/xgboost_model/
      forward_trained_models/xgboost_model/
    post_forward_trained_models/xgboost_model/
  moe_foward/
    prefill/
      forward_trained_models/xgboost_model/
    decode/
      forward_trained_models/xgboost_model/
vllm/                             # vLLM backend models
  dense/
    pre_forward_trained_models/xgboost_model/
    forward_trained_models/xgboost_model/
    post_forward_trained_models/xgboost_model/
  moe/
    forward_trained_models/xgboost_model/
```

Each `xgboost_model/` directory contains:
- `model.json` -- the serialized XGBoost model
- `model_metadata.json` -- feature names, training metadata

## Supported Platforms

| GPU | Status |
|-----|--------|
| NVIDIA H200 | Pre-trained models included |
| NVIDIA B200 | Pre-trained models included |

New platforms can be added by running the profiling scripts in the ROSS repository's `collector/` directory.

## Validated LLM Families

| Family | Variants |
|--------|----------|
| Llama-3.1 | 8B, 70B |
| Qwen2.5 | 72B-Instruct |
| Qwen3 | 32B, 30B-A3B (MoE), 235B-A22B (MoE), QwQ 32B |
| DeepSeek-V3 | 671B (MoE) |
| gpt-oss | 20b (MoE), 120b (MoE) |

The stage-wise regressor takes model configuration features as input rather than per-model kernel calibration, so new models within supported families generally work out of the box.

## Usage

### 1. Download

```bash
# Using huggingface-cli
huggingface-cli download CharlesCAOO/ross --local-dir modeling
```

### 2. Point ROSS to the downloaded models

In your ROSS config JSON:

```json
{
    "modeling_dir": "/path/to/modeling",
    ...
}
```

Or via CLI:

```bash
python ross/ross_predict.py --modeling-dir /path/to/modeling --config my_config.json
```

### 3. Run simulation

```bash
python ross/ross_predict.py --config my_config.json --record-path results.csv
```

ROSS achieves median prediction errors below 6% for E2E latency and TPOT across the validated models and platforms, while sustaining >11x simulation speedup over on-hardware benchmarking.


## License

Apache 2.0