ross / README.md

Update README.md

cf7398c verified 5 days ago

3.34 kB

license: apache-2.0
tags:
  - ross
  - llm-serving
  - simulation
  - xgboost
  - performance-prediction

ROSS Stage-Wise Regression Models

Pre-trained XGBoost regression models for ROSS -- a dual-plane simulator for LLM serving systems.

These models power ROSS's data plane: given a batch descriptor (request IDs, sequence lengths, model architecture features, and platform performance features), they predict per-iteration latency by decomposing each serving iteration into pre-forward, forward, and post-forward stages, explicitly capturing CPU-GPU pipeline overlap.

Model Overview

Component	Description
Algorithm	XGBoost regressor
Training data	Sparse profiling traces collected on NVIDIA H200 and B200 GPUs
Prediction target	Per-stage iteration latency (ms) for each of pre-forward, forward, and post-forward
Input features	Batch shape, model architecture features, platform performance features
Supported frameworks	vLLM, SGLang

Directory Structure

sgl/                              # SGLang backend models
  dense/
    prefill/
      pre_forward_trained_models/xgboost_model/
      forward_trained_models/xgboost_model/
    decode/
      pre_forward_trained_models/xgboost_model/
      forward_trained_models/xgboost_model/
    post_forward_trained_models/xgboost_model/
  moe_foward/
    prefill/
      forward_trained_models/xgboost_model/
    decode/
      forward_trained_models/xgboost_model/
vllm/                             # vLLM backend models
  dense/
    pre_forward_trained_models/xgboost_model/
    forward_trained_models/xgboost_model/
    post_forward_trained_models/xgboost_model/
  moe/
    forward_trained_models/xgboost_model/

Each xgboost_model/ directory contains:

model.json -- the serialized XGBoost model
model_metadata.json -- feature names, training metadata

Supported Platforms

GPU	Status
NVIDIA H200	Pre-trained models included
NVIDIA B200	Pre-trained models included

New platforms can be added by running the profiling scripts in the ROSS repository's collector/ directory.

Validated LLM Families

Family	Variants
Llama-3.1	8B, 70B
Qwen2.5	72B-Instruct
Qwen3	32B, 30B-A3B (MoE), 235B-A22B (MoE), QwQ 32B
DeepSeek-V3	671B (MoE)
gpt-oss	20b (MoE), 120b (MoE)

The stage-wise regressor takes model configuration features as input rather than per-model kernel calibration, so new models within supported families generally work out of the box.

Usage

1. Download

# Using huggingface-cli
huggingface-cli download CharlesCAOO/ross --local-dir modeling

2. Point ROSS to the downloaded models

In your ROSS config JSON:

{
    "modeling_dir": "/path/to/modeling",
    ...
}

Or via CLI:

python ross/ross_predict.py --modeling-dir /path/to/modeling --config my_config.json

3. Run simulation

python ross/ross_predict.py --config my_config.json --record-path results.csv

ROSS achieves median prediction errors below 6% for E2E latency and TPOT across the validated models and platforms, while sustaining >11x simulation speedup over on-hardware benchmarking.

License

Apache 2.0