You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

ROSS Stage-Wise Regression Models

Pre-trained XGBoost regression models for ROSS -- a dual-plane simulator for LLM serving systems.

These models power ROSS's data plane: given a batch descriptor (request IDs, sequence lengths, model architecture features, and platform performance features), they predict per-iteration latency by decomposing each serving iteration into pre-forward, forward, and post-forward stages, explicitly capturing CPU-GPU pipeline overlap.

Model Overview

Component	Description
Algorithm	XGBoost regressor
Training data	Sparse profiling traces collected on NVIDIA H200 and B200 GPUs
Prediction target	Per-stage iteration latency (ms) for each of pre-forward, forward, and post-forward
Input features	Batch shape, model architecture features, platform performance features
Supported frameworks	vLLM, SGLang

Directory Structure

sgl/                              # SGLang backend models
  dense/
    prefill/
      pre_forward_trained_models/xgboost_model/
      forward_trained_models/xgboost_model/
    decode/
      pre_forward_trained_models/xgboost_model/
      forward_trained_models/xgboost_model/
    post_forward_trained_models/xgboost_model/
  moe_foward/
    prefill/
      forward_trained_models/xgboost_model/
    decode/
      forward_trained_models/xgboost_model/
vllm/                             # vLLM backend models
  dense/
    pre_forward_trained_models/xgboost_model/
    forward_trained_models/xgboost_model/
    post_forward_trained_models/xgboost_model/
  moe/
    forward_trained_models/xgboost_model/

Each xgboost_model/ directory contains:

model.json -- the serialized XGBoost model
model_metadata.json -- feature names, training metadata

Supported Platforms

GPU	Status
NVIDIA H200	Pre-trained models included
NVIDIA B200	Pre-trained models included

New platforms can be added by running the profiling scripts in the ROSS repository's collector/ directory.

Validated LLM Families

Family	Variants
Llama-3.1	8B, 70B
Qwen2.5	72B-Instruct
Qwen3	32B, 30B-A3B (MoE), 235B-A22B (MoE), QwQ 32B
DeepSeek-V3	671B (MoE)
gpt-oss	20b (MoE), 120b (MoE)

The stage-wise regressor takes model configuration features as input rather than per-model kernel calibration, so new models within supported families generally work out of the box.

Usage

1. Download

# Using huggingface-cli
huggingface-cli download CharlesCAOO/ross --local-dir modeling

2. Point ROSS to the downloaded models

In your ROSS config JSON:

{
    "modeling_dir": "/path/to/modeling",
    ...
}

Or via CLI:

python ross/ross_predict.py --modeling-dir /path/to/modeling --config my_config.json

3. Run simulation

python ross/ross_predict.py --config my_config.json --record-path results.csv

ROSS achieves median prediction errors below 6% for E2E latency and TPOT across the validated models and platforms, while sustaining >11x simulation speedup over on-hardware benchmarking.

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support