| --- |
| license: apache-2.0 |
| tags: |
| - llm-routing |
| - model-selection |
| - budget-optimization |
| - nearest-neighbor |
| language: |
| - en |
| library_name: sklearn |
| pipeline_tag: text-classification |
| --- |
| |
| # R2-Router: A New Paradigm for LLM Routing with Reasoning |
|
|
| **R2-Router** intelligently routes each query to the optimal (LLM, token budget) pair, jointly optimizing accuracy and inference cost. Ranked **#1** on the [RouterArena](https://routerarena.github.io/) leaderboard. |
|
|
| **Paper**: [R2-Router (arxiv)](https://arxiv.org/abs/2602.02823) |
|
|
| ## RouterArena Performance |
|
|
|  |
|
|
| Official leaderboard results on 8,400 queries: |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Accuracy | 71.23% | |
| | Cost per 1K Queries | $0.061 | |
| | Arena Score (beta=0.1) | **71.60** | |
| | Robustness Score | 45.71% | |
| | Rank | **#1** | |
|
|
| ## Quick Start |
|
|
| ### Installation |
|
|
| We recommend using [uv](https://docs.astral.sh/uv/) for fast, reliable environment setup: |
|
|
| ```bash |
| # Install uv (if not already installed) |
| curl -LsSf https://astral.sh/uv/install.sh | sh |
| |
| # Create environment and install dependencies |
| uv venv .venv && source .venv/bin/activate |
| uv pip install scikit-learn numpy joblib huggingface_hub vllm |
| ``` |
|
|
| ### With vLLM Server (Recommended) |
|
|
| Start the embedding server once, then route from any process without reloading the model: |
|
|
| ```bash |
| # Terminal 1: Start vLLM embedding server (runs once, stays alive) |
| uv pip install vllm |
| vllm serve Qwen/Qwen3-0.6B --runner pooling --port 8000 |
| ``` |
|
|
| ```python |
| # Terminal 2: Route queries (connects to the running server) |
| from huggingface_hub import snapshot_download |
| import sys |
| |
| path = snapshot_download("JiaqiXue/r2-router") |
| sys.path.insert(0, path) |
| |
| from router import R2Router |
| |
| router = R2Router.from_pretrained(path, embed_url="http://localhost:8000") |
| result = router.route_text("Solve this integral") |
| print(f"Model: {result['model_full_name']}, Budget: {result['token_limit']}") |
| print(f"Estimated Quality: {result['predicted_quality']:.3f}, Estimated Cost: ${result['predicted_cost']:.6f}") |
| ``` |
|
|
| ### Adjusting Lambda (Cost-Accuracy Tradeoff) |
|
|
| The `lambda` parameter controls the tradeoff between accuracy and cost: |
| - **lambda → 1.0**: Minimize cost (routes to cheaper models) |
| - **lambda → 0.0**: Maximize accuracy (routes to the best model regardless of cost) |
| - **Default: 0.999** (strongly cost-sensitive, as used in our RouterArena submission) |
|
|
| ```python |
| # Cost-sensitive (default, as submitted to RouterArena) |
| router = R2Router.from_pretrained(path, lambda_val=0.999) |
| |
| # Balanced accuracy vs cost |
| router = R2Router.from_pretrained(path, lambda_val=0.5) |
| |
| # Accuracy-first (ignores cost, always picks highest quality) |
| router = R2Router.from_pretrained(path, lambda_val=0.0) |
| |
| # Override lambda per query |
| result = router.route_text("Solve this integral", lambda_val=0.5) |
| ``` |
|
|
| ### Train from Scratch |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| import sys |
| |
| path = snapshot_download("JiaqiXue/r2-router") |
| sys.path.insert(0, path) |
| |
| from router import R2Router |
| |
| # Train predictors with custom hyperparameters |
| router = R2Router.from_training_data(path, k=80, lambda_val=0.999) |
| ``` |
|
|
| ## Architecture |
|
|
| R2-Router jointly optimizes **which model** to use and **how many tokens** to allocate per query. |
|
|
| ### Routing Formula |
|
|
| ``` |
| risk(M, b) = (1 - lambda) * predicted_quality(query, M, b) - lambda * predicted_tokens(query, M) * price_M / 1e6 |
| (M*, b*) = argmax risk |
| ``` |
|
|
| ### Pipeline |
|
|
| ``` |
| Input Query |
| | |
| [1] Embed with Qwen3-0.6B -> 1024-dim vector |
| | |
| [2] For each (model, budget) pair: |
| - Predict quality (accuracy) |
| - Predict output token count |
| - Compute risk = (1-lambda) * quality - lambda * cost |
| | |
| [3] Select (model, budget) with highest risk |
| | |
| Output: (model_name, token_budget) |
| ``` |
|
|
| ### Model Pool (6 LLMs) |
|
|
| | Model | Output $/M tokens | |
| |-------|------------------| |
| | Qwen3-235B-A22B | $0.463 | |
| | Qwen3-Next-80B-A3B | $1.10 | |
| | Qwen3-30B-A3B | $0.33 | |
| | Qwen3-Coder-Next | $0.30 | |
| | Gemini 2.5 Flash | $2.50 | |
| | Claude 3 Haiku | $1.25 | |
|
|
| ### Token Budgets |
|
|
| 4 output token limits: **100, 200, 400, 800** tokens. |
|
|
| ### Key Parameters |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | K (neighbors) | 80 | |
| | Lambda | 0.999 | |
| | Distance Metric | Cosine | |
| | Weights | Distance-weighted | |
| | Embedding Dim | 1024 | |
|
|
| ## Repository Contents |
|
|
| ``` |
| config.json # Router configuration (models, budgets, prices, hyperparams) |
| router.py # Self-contained inference code (embed + route) |
| training_data/ |
| embeddings.npy # Sub_10 training embeddings (809 x 1024) |
| labels.json # Per-(model, budget) accuracy & token labels |
| checkpoints/ |
| quality_knn_*.joblib # Pre-fitted quality predictors (18 total) |
| token_knn_*.joblib # Pre-fitted token predictors (6 total) |
| ``` |
|
|
| ### Ways to Use |
|
|
| | Method | GPU? | Description | |
| |--------|------|-------------| |
| | `route_text()` + vLLM server | Yes (server) | Start `vllm serve` once, route from anywhere via HTTP | |
| | `route_text()` + local vLLM | Yes (local) | Auto-loads Qwen3-0.6B on first call, caches it | |
| | `route(embedding)` | No | Route from pre-computed 1024-dim embedding | |
| | `from_training_data(path)` | No | Train your own predictors with custom hyperparameters | |
|
|
| ## Training Details |
|
|
| Following [chayan](https://huggingface.co/adaptive-classifier/chayan), we only use the official **sub_10 split** (809 queries, 10% of the full 8,400) for training. No full-set data is used during training or hyperparameter tuning. |
| |
| - **Training Data**: RouterArena sub_10 split (809 queries) |
| - **Method**: Nearest-neighbor regression with cosine distance, distance-weighted |
| - **Evaluation**: Full 8,400 RouterArena queries (no data leakage) |
| - **Training Time**: < 1 second |
| |
| ## Citation |
| |
| ```bibtex |
| @article{xue2026r2, |
| title={R2-Router: A New Paradigm for LLM Routing with Reasoning}, |
| author={Xue, Jiaqi and Lou, Qian and Xing, Jiarong and Huang, Heng}, |
| journal={arXiv preprint arXiv:2602.02823}, |
| year={2026} |
| } |
| ``` |
| |
| ## License |
| |
| Apache 2.0 |
| |