File size: 4,924 Bytes
3544876 f258544 3544876 6a0d033 3544876 f258544 34de2d7 3544876 f258544 3544876 f258544 3544876 fcfc760 34de2d7 fcfc760 3544876 f258544 3544876 f258544 3544876 f258544 3544876 f258544 3544876 f258544 3544876 f258544 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | ---
license: apache-2.0
tags:
- oscar
- int2
- kv-cache
- quantization
- rotation
- sglang
pipeline_tag: text-generation
---
<p align="center">
<img src="https://huggingface.co/Zhongzhu/OSCAR-RotationZoo/resolve/main/oscar_logo_kv_transparent.png" alt="OSCAR INT2 KV-Cache" width="180"/>
</p>
# OSCAR RotationZoo
Precomputed K/V rotation matrices for **OSCAR INT2 KV-cache quantization**.
This repository contains the artifacts for the paper:
**OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization**
*Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu*
- 📄 **Paper** — [arXiv:2605.17757](https://arxiv.org/abs/2605.17757)
- 🌐 **Website** — https://oscar-quantize.github.io/
- 💻 **Code** — https://github.com/FutureMLS-Lab/OSCAR
OSCAR captures Q/K/V activations on a small calibration set, estimates attention-aware K/V covariance offline, and derives per-layer orthogonal rotations that align INT2 quantization with the directions attention actually consumes. The result is ~7× compression of the KV-cache memory footprint with single-digit pp accuracy drop on GPQA for dense reasoning models.
This repo packages the rotations as drop-in `.pt` files so you don't need to re-run the Q/K/V dump and eigendecomposition yourself.
## Available rotations
| Model | Calibration | GPQA (BF16) | GPQA (OSCAR INT2) |
|---|---|---:|---:|
| `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt83_group128` | 67.27 | 67.17 |
| `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt85_group128` (fresh re-dump) | 67.27 | — |
| `Qwen/Qwen3-8B` | `seq20000_prompt83_group128` | 56.67 | 55.56 |
| `Qwen/Qwen3-32B` | `seq16000_prompt69_group128` | 58.49 | 60.40 |
| `zai-org/GLM-4.7-FP8` | `seq10000_prompt43_group128` | 73.23 | 73.57 |
`seq<T>_prompt<N>_group<G>` notation: `T` = total calibration tokens, `N` = calibration prompt count, `G` = INT2 quant group size along head_dim.
## File format
Each rotation directory contains:
- `k_rotation_qqt_r_h_pbr.pt` — K-side rotation `R_K = R · H · P_br` where `R = eigvec(Σ_Q)` is fit on Q's attention-aware covariance, `H` is a head-dim Hadamard, and `P_br` is the eigenvalue-sorted bit-reversal permutation
- `v_rotation_sst_r_h_pbr.pt` — V-side rotation built on the score-weighted V covariance `Σ_V = V^T diag(K^T (Q^T Q) K) V`
File layout (PyTorch state-dict):
```python
{
"format_version": 1,
"objective": "qqt_r_h_pbr", # or "sst_r_h_pbr" for V
"source_grouping": "layer",
"layers": {
0: {"layer_id": 0, "rotation": tensor(head_dim, head_dim)},
1: {"layer_id": 1, "rotation": tensor(head_dim, head_dim)},
...
}
}
```
## How to use
### 1. Download the rotation for your model
```bash
pip install huggingface_hub
```
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="Zhongzhu/OSCAR-RotationZoo",
allow_patterns="Qwen3-8B/**",
local_dir="./oscar_rotations",
)
# rotations now at ./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128/
```
### 2. Serve with sglang-research using the rotation
Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single `oscar` conda env, then point the eval driver at your downloaded rotation:
```bash
ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \
bash rotation/qwen3-8B/eval_gpqa.sh
```
The driver internally launches sglang with these flags:
```bash
SGLANG_ENABLE_MIXED_KV_WINDOWS=1 \
SGLANG_OSCAR_ROTATION_MODE=oscar \
SGLANG_OSCAR_K_ROTATION_PATH=$ROT_DIR/k_rotation_qqt_r_h_pbr.pt \
SGLANG_OSCAR_V_ROTATION_PATH=$ROT_DIR/v_rotation_sst_r_h_pbr.pt \
SGLANG_OSCAR_K_CLIP_RATIO=0.96 \
SGLANG_OSCAR_V_CLIP_RATIO=0.92 \
SGLANG_OSCAR_ABSORB_V_ROTATION=1 \
SGLANG_MIXED_KV_PREFIX_TOKENS=64 \
SGLANG_MIXED_KV_RECENT_TOKENS=256 \
HADAMARD_ORDER=128 \
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tensor-parallel-size 1 \
--kv-cache-dtype int2 \
--kv-cache-quant-group-size 128 \
--prefill-attention-backend fa3 \
--decode-attention-backend triton \
--disable-radix-cache \
--disable-custom-all-reduce \
--trust-remote-code
```
## Reproducing from scratch
If you want to fit your own rotation on a different calibration set, the OSCAR pipeline is end-to-end reproducible:
```bash
git clone https://github.com/FutureMLS-Lab/OSCAR.git
cd OSCAR
bash rotation/qwen3-8B/save_qkv_8b.sh # phase 1 — dump Q/K/V
bash rotation/qwen3-8B/compute_rotation.sh # phase 2 — fit R = eigvec(Σ_Q)
```
## Citation
```bibtex
@article{zhou2026oscar,
title = {OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization},
author = {Zhou, Zhongzhu and Zhuang, Donglin and Li, Jisen and Chen, Ziyan and Song, Shuaiwen Leon and Athiwaratkun, Ben and Wu, Xiaoxia},
year = {2026},
note = {Together AI; University of Sydney; UIUC},
}
``` |