You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

ROSS Stage-Wise Regression Models

Pre-trained XGBoost regression models for ROSS -- a dual-plane simulator for LLM serving systems.

These models power ROSS's data plane: given a batch descriptor (request IDs, sequence lengths, model architecture features, and platform performance features), they predict per-iteration latency by decomposing each serving iteration into pre-forward, forward, and post-forward stages, explicitly capturing CPU-GPU pipeline overlap.

Model Overview

Component Description
Algorithm XGBoost regressor
Training data Sparse profiling traces collected on NVIDIA H200 and B200 GPUs
Prediction target Per-stage iteration latency (ms) for each of pre-forward, forward, and post-forward
Input features Batch shape, model architecture features, platform performance features
Supported frameworks vLLM, SGLang

Directory Structure

sgl/                              # SGLang backend models
  dense/
    prefill/
      pre_forward_trained_models/xgboost_model/
      forward_trained_models/xgboost_model/
    decode/
      pre_forward_trained_models/xgboost_model/
      forward_trained_models/xgboost_model/
    post_forward_trained_models/xgboost_model/
  moe_foward/
    prefill/
      forward_trained_models/xgboost_model/
    decode/
      forward_trained_models/xgboost_model/
vllm/                             # vLLM backend models
  dense/
    pre_forward_trained_models/xgboost_model/
    forward_trained_models/xgboost_model/
    post_forward_trained_models/xgboost_model/
  moe/
    forward_trained_models/xgboost_model/

Each xgboost_model/ directory contains:

  • model.json -- the serialized XGBoost model
  • model_metadata.json -- feature names, training metadata

Supported Platforms

GPU Status
NVIDIA H200 Pre-trained models included
NVIDIA B200 Pre-trained models included

New platforms can be added by running the profiling scripts in the ROSS repository's collector/ directory.

Validated LLM Families

Family Variants
Llama-3.1 8B, 70B
Qwen2.5 72B-Instruct
Qwen3 32B, 30B-A3B (MoE), 235B-A22B (MoE), QwQ 32B
DeepSeek-V3 671B (MoE)
gpt-oss 20b (MoE), 120b (MoE)

The stage-wise regressor takes model configuration features as input rather than per-model kernel calibration, so new models within supported families generally work out of the box.

Usage

1. Download

# Using huggingface-cli
huggingface-cli download CharlesCAOO/ross --local-dir modeling

2. Point ROSS to the downloaded models

In your ROSS config JSON:

{
    "modeling_dir": "/path/to/modeling",
    ...
}

Or via CLI:

python ross/ross_predict.py --modeling-dir /path/to/modeling --config my_config.json

3. Run simulation

python ross/ross_predict.py --config my_config.json --record-path results.csv

ROSS achieves median prediction errors below 6% for E2E latency and TPOT across the validated models and platforms, while sustaining >11x simulation speedup over on-hardware benchmarking.

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support