docs: note sub_10 only training following chayan

771fd80 verified about 2 months ago

5.99 kB

	---
	license: apache-2.0
	tags:
	- llm-routing
	- model-selection
	- budget-optimization
	- nearest-neighbor
	language:
	- en
	library_name: sklearn
	pipeline_tag: text-classification
	---

	# R2-Router: A New Paradigm for LLM Routing with Reasoning

	R2-Router intelligently routes each query to the optimal (LLM, token budget) pair, jointly optimizing accuracy and inference cost. Ranked #1 on the [RouterArena](https://routerarena.github.io/) leaderboard.

	Paper: [R2-Router (arxiv)](https://arxiv.org/abs/2602.02823)

	## RouterArena Performance

	![RouterArena Leaderboard](leaderboard.png)

	Official leaderboard results on 8,400 queries:

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Accuracy \| 71.23% \|
	\| Cost per 1K Queries \| $0.061 \|
	\| Arena Score (beta=0.1) \| 71.60 \|
	\| Robustness Score \| 45.71% \|
	\| Rank \| #1 \|

	## Quick Start

	### Installation

	We recommend using [uv](https://docs.astral.sh/uv/) for fast, reliable environment setup:

	```bash
	# Install uv (if not already installed)
	curl -LsSf https://astral.sh/uv/install.sh \| sh

	# Create environment and install dependencies
	uv venv .venv && source .venv/bin/activate
	uv pip install scikit-learn numpy joblib huggingface_hub vllm
	```

	### With vLLM Server (Recommended)

	Start the embedding server once, then route from any process without reloading the model:

	```bash
	# Terminal 1: Start vLLM embedding server (runs once, stays alive)
	uv pip install vllm
	vllm serve Qwen/Qwen3-0.6B --runner pooling --port 8000
	```

	```python
	# Terminal 2: Route queries (connects to the running server)
	from huggingface_hub import snapshot_download
	import sys

	path = snapshot_download("JiaqiXue/r2-router")
	sys.path.insert(0, path)

	from router import R2Router

	router = R2Router.from_pretrained(path, embed_url="http://localhost:8000")
	result = router.route_text("Solve this integral")
	print(f"Model: {result['model_full_name']}, Budget: {result['token_limit']}")
	print(f"Estimated Quality: {result['predicted_quality']:.3f}, Estimated Cost: ${result['predicted_cost']:.6f}")
	```

	### Adjusting Lambda (Cost-Accuracy Tradeoff)

	The `lambda` parameter controls the tradeoff between accuracy and cost:
	- lambda → 1.0: Minimize cost (routes to cheaper models)
	- lambda → 0.0: Maximize accuracy (routes to the best model regardless of cost)
	- Default: 0.999 (strongly cost-sensitive, as used in our RouterArena submission)

	```python
	# Cost-sensitive (default, as submitted to RouterArena)
	router = R2Router.from_pretrained(path, lambda_val=0.999)

	# Balanced accuracy vs cost
	router = R2Router.from_pretrained(path, lambda_val=0.5)

	# Accuracy-first (ignores cost, always picks highest quality)
	router = R2Router.from_pretrained(path, lambda_val=0.0)

	# Override lambda per query
	result = router.route_text("Solve this integral", lambda_val=0.5)
	```

	### Train from Scratch

	```python
	from huggingface_hub import snapshot_download
	import sys

	path = snapshot_download("JiaqiXue/r2-router")
	sys.path.insert(0, path)

	from router import R2Router

	# Train predictors with custom hyperparameters
	router = R2Router.from_training_data(path, k=80, lambda_val=0.999)
	```

	## Architecture

	R2-Router jointly optimizes which model to use and how many tokens to allocate per query.

	### Routing Formula

	```
	risk(M, b) = (1 - lambda) * predicted_quality(query, M, b) - lambda * predicted_tokens(query, M) * price_M / 1e6
	(M, b) = argmax risk
	```

	### Pipeline

	```
	Input Query
	\|
	[1] Embed with Qwen3-0.6B -> 1024-dim vector
	\|
	[2] For each (model, budget) pair:
	- Predict quality (accuracy)
	- Predict output token count
	- Compute risk = (1-lambda) * quality - lambda * cost
	\|
	[3] Select (model, budget) with highest risk
	\|
	Output: (model_name, token_budget)
	```

	### Model Pool (6 LLMs)

	\| Model \| Output $/M tokens \|
	\|-------\|------------------\|
	\| Qwen3-235B-A22B \| $0.463 \|
	\| Qwen3-Next-80B-A3B \| $1.10 \|
	\| Qwen3-30B-A3B \| $0.33 \|
	\| Qwen3-Coder-Next \| $0.30 \|
	\| Gemini 2.5 Flash \| $2.50 \|
	\| Claude 3 Haiku \| $1.25 \|

	### Token Budgets

	4 output token limits: 100, 200, 400, 800 tokens.

	### Key Parameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| K (neighbors) \| 80 \|
	\| Lambda \| 0.999 \|
	\| Distance Metric \| Cosine \|
	\| Weights \| Distance-weighted \|
	\| Embedding Dim \| 1024 \|

	## Repository Contents

	```
	config.json # Router configuration (models, budgets, prices, hyperparams)
	router.py # Self-contained inference code (embed + route)
	training_data/
	embeddings.npy # Sub_10 training embeddings (809 x 1024)
	labels.json # Per-(model, budget) accuracy & token labels
	checkpoints/
	quality_knn_*.joblib # Pre-fitted quality predictors (18 total)
	token_knn_*.joblib # Pre-fitted token predictors (6 total)
	```

	### Ways to Use

	\| Method \| GPU? \| Description \|
	\|--------\|------\|-------------\|
	\| `route_text()` + vLLM server \| Yes (server) \| Start `vllm serve` once, route from anywhere via HTTP \|
	\| `route_text()` + local vLLM \| Yes (local) \| Auto-loads Qwen3-0.6B on first call, caches it \|
	\| `route(embedding)` \| No \| Route from pre-computed 1024-dim embedding \|
	\| `from_training_data(path)` \| No \| Train your own predictors with custom hyperparameters \|

	## Training Details

	Following [chayan](https://huggingface.co/adaptive-classifier/chayan), we only use the official sub_10 split (809 queries, 10% of the full 8,400) for training. No full-set data is used during training or hyperparameter tuning.

	- Training Data: RouterArena sub_10 split (809 queries)
	- Method: Nearest-neighbor regression with cosine distance, distance-weighted
	- Evaluation: Full 8,400 RouterArena queries (no data leakage)
	- Training Time: < 1 second

	## Citation

	```bibtex
	@article{xue2026r2,
	title={R2-Router: A New Paradigm for LLM Routing with Reasoning},
	author={Xue, Jiaqi and Lou, Qian and Xing, Jiarong and Huang, Heng},
	journal={arXiv preprint arXiv:2602.02823},
	year={2026}
	}
	```

	## License

	Apache 2.0