OSCAR-RotationZoo / README.md
Zhongzhu's picture
Improve model card metadata and documentation (#1)
f258544
---
license: apache-2.0
tags:
- oscar
- int2
- kv-cache
- quantization
- rotation
- sglang
pipeline_tag: text-generation
---
<p align="center">
<img src="https://huggingface.co/Zhongzhu/OSCAR-RotationZoo/resolve/main/oscar_logo_kv_transparent.png" alt="OSCAR INT2 KV-Cache" width="180"/>
</p>
# OSCAR RotationZoo
Precomputed K/V rotation matrices for **OSCAR INT2 KV-cache quantization**.
This repository contains the artifacts for the paper:
**OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization**
*Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu*
- πŸ“„ **Paper** β€” [arXiv:2605.17757](https://arxiv.org/abs/2605.17757)
- 🌐 **Website** β€” https://oscar-quantize.github.io/
- πŸ’» **Code** β€” https://github.com/FutureMLS-Lab/OSCAR
OSCAR captures Q/K/V activations on a small calibration set, estimates attention-aware K/V covariance offline, and derives per-layer orthogonal rotations that align INT2 quantization with the directions attention actually consumes. The result is ~7Γ— compression of the KV-cache memory footprint with single-digit pp accuracy drop on GPQA for dense reasoning models.
This repo packages the rotations as drop-in `.pt` files so you don't need to re-run the Q/K/V dump and eigendecomposition yourself.
## Available rotations
| Model | Calibration | GPQA (BF16) | GPQA (OSCAR INT2) |
|---|---|---:|---:|
| `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt83_group128` | 67.27 | 67.17 |
| `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt85_group128` (fresh re-dump) | 67.27 | β€” |
| `Qwen/Qwen3-8B` | `seq20000_prompt83_group128` | 56.67 | 55.56 |
| `Qwen/Qwen3-32B` | `seq16000_prompt69_group128` | 58.49 | 60.40 |
| `zai-org/GLM-4.7-FP8` | `seq10000_prompt43_group128` | 73.23 | 73.57 |
`seq<T>_prompt<N>_group<G>` notation: `T` = total calibration tokens, `N` = calibration prompt count, `G` = INT2 quant group size along head_dim.
## File format
Each rotation directory contains:
- `k_rotation_qqt_r_h_pbr.pt` β€” K-side rotation `R_K = R Β· H Β· P_br` where `R = eigvec(Ξ£_Q)` is fit on Q's attention-aware covariance, `H` is a head-dim Hadamard, and `P_br` is the eigenvalue-sorted bit-reversal permutation
- `v_rotation_sst_r_h_pbr.pt` β€” V-side rotation built on the score-weighted V covariance `Ξ£_V = V^T diag(K^T (Q^T Q) K) V`
File layout (PyTorch state-dict):
```python
{
"format_version": 1,
"objective": "qqt_r_h_pbr", # or "sst_r_h_pbr" for V
"source_grouping": "layer",
"layers": {
0: {"layer_id": 0, "rotation": tensor(head_dim, head_dim)},
1: {"layer_id": 1, "rotation": tensor(head_dim, head_dim)},
...
}
}
```
## How to use
### 1. Download the rotation for your model
```bash
pip install huggingface_hub
```
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="Zhongzhu/OSCAR-RotationZoo",
allow_patterns="Qwen3-8B/**",
local_dir="./oscar_rotations",
)
# rotations now at ./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128/
```
### 2. Serve with sglang-research using the rotation
Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single `oscar` conda env, then point the eval driver at your downloaded rotation:
```bash
ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \
bash rotation/qwen3-8B/eval_gpqa.sh
```
The driver internally launches sglang with these flags:
```bash
SGLANG_ENABLE_MIXED_KV_WINDOWS=1 \
SGLANG_OSCAR_ROTATION_MODE=oscar \
SGLANG_OSCAR_K_ROTATION_PATH=$ROT_DIR/k_rotation_qqt_r_h_pbr.pt \
SGLANG_OSCAR_V_ROTATION_PATH=$ROT_DIR/v_rotation_sst_r_h_pbr.pt \
SGLANG_OSCAR_K_CLIP_RATIO=0.96 \
SGLANG_OSCAR_V_CLIP_RATIO=0.92 \
SGLANG_OSCAR_ABSORB_V_ROTATION=1 \
SGLANG_MIXED_KV_PREFIX_TOKENS=64 \
SGLANG_MIXED_KV_RECENT_TOKENS=256 \
HADAMARD_ORDER=128 \
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tensor-parallel-size 1 \
--kv-cache-dtype int2 \
--kv-cache-quant-group-size 128 \
--prefill-attention-backend fa3 \
--decode-attention-backend triton \
--disable-radix-cache \
--disable-custom-all-reduce \
--trust-remote-code
```
## Reproducing from scratch
If you want to fit your own rotation on a different calibration set, the OSCAR pipeline is end-to-end reproducible:
```bash
git clone https://github.com/FutureMLS-Lab/OSCAR.git
cd OSCAR
bash rotation/qwen3-8B/save_qkv_8b.sh # phase 1 β€” dump Q/K/V
bash rotation/qwen3-8B/compute_rotation.sh # phase 2 β€” fit R = eigvec(Ξ£_Q)
```
## Citation
```bibtex
@article{zhou2026oscar,
title = {OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization},
author = {Zhou, Zhongzhu and Zhuang, Donglin and Li, Jisen and Chen, Ziyan and Song, Shuaiwen Leon and Athiwaratkun, Ben and Wu, Xiaoxia},
year = {2026},
note = {Together AI; University of Sydney; UIUC},
}
```