File size: 4,924 Bytes
3544876
 
 
 
 
 
 
 
 
9c629f3
3544876
 
6a0d033
 
 
 
3544876
 
 
 
9c629f3
 
 
 
 
34de2d7
 
3544876
9c629f3
3544876
9c629f3
3544876
 
 
 
 
fcfc760
34de2d7
fcfc760
3544876
 
 
9c629f3
3544876
 
 
 
 
9c629f3
 
3544876
 
 
 
 
9c629f3
3544876
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9c629f3
3544876
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9c629f3
3544876
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9c629f3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
license: apache-2.0
tags:
- oscar
- int2
- kv-cache
- quantization
- rotation
- sglang
pipeline_tag: text-generation
---

<p align="center">
  <img src="https://huggingface.co/Zhongzhu/OSCAR-RotationZoo/resolve/main/oscar_logo_kv_transparent.png" alt="OSCAR INT2 KV-Cache" width="180"/>
</p>

# OSCAR RotationZoo

Precomputed K/V rotation matrices for **OSCAR INT2 KV-cache quantization**.

This repository contains the artifacts for the paper:
**OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization**
*Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu*

- 📄 **Paper** — [arXiv:2605.17757](https://arxiv.org/abs/2605.17757)
- 🌐 **Website** — https://oscar-quantize.github.io/
- 💻 **Code** — https://github.com/FutureMLS-Lab/OSCAR

OSCAR captures Q/K/V activations on a small calibration set, estimates attention-aware K/V covariance offline, and derives per-layer orthogonal rotations that align INT2 quantization with the directions attention actually consumes. The result is ~7× compression of the KV-cache memory footprint with single-digit pp accuracy drop on GPQA for dense reasoning models.

This repo packages the rotations as drop-in `.pt` files so you don't need to re-run the Q/K/V dump and eigendecomposition yourself.

## Available rotations

| Model | Calibration | GPQA (BF16) | GPQA (OSCAR INT2) |
|---|---|---:|---:|
| `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt83_group128` | 67.27 | 67.17 |
| `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt85_group128` (fresh re-dump) | 67.27 | — |
| `Qwen/Qwen3-8B`               | `seq20000_prompt83_group128` | 56.67 | 55.56 |
| `Qwen/Qwen3-32B`              | `seq16000_prompt69_group128` | 58.49 | 60.40 |
| `zai-org/GLM-4.7-FP8`         | `seq10000_prompt43_group128` | 73.23 | 73.57 |

`seq<T>_prompt<N>_group<G>` notation: `T` = total calibration tokens, `N` = calibration prompt count, `G` = INT2 quant group size along head_dim.

## File format

Each rotation directory contains:

- `k_rotation_qqt_r_h_pbr.pt` — K-side rotation `R_K = R · H · P_br` where `R = eigvec(Σ_Q)` is fit on Q's attention-aware covariance, `H` is a head-dim Hadamard, and `P_br` is the eigenvalue-sorted bit-reversal permutation
- `v_rotation_sst_r_h_pbr.pt` — V-side rotation built on the score-weighted V covariance `Σ_V = V^T diag(K^T (Q^T Q) K) V`

File layout (PyTorch state-dict):
```python
{
  "format_version": 1,
  "objective":      "qqt_r_h_pbr",      # or "sst_r_h_pbr" for V
  "source_grouping": "layer",
  "layers": {
    0:  {"layer_id": 0,  "rotation": tensor(head_dim, head_dim)},
    1:  {"layer_id": 1,  "rotation": tensor(head_dim, head_dim)},
    ...
  }
}
```

## How to use

### 1. Download the rotation for your model

```bash
pip install huggingface_hub
```

```python
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="Zhongzhu/OSCAR-RotationZoo",
    allow_patterns="Qwen3-8B/**",
    local_dir="./oscar_rotations",
)
# rotations now at ./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128/
```

### 2. Serve with sglang-research using the rotation

Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single `oscar` conda env, then point the eval driver at your downloaded rotation:

```bash
ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \
  bash rotation/qwen3-8B/eval_gpqa.sh
```

The driver internally launches sglang with these flags:

```bash
SGLANG_ENABLE_MIXED_KV_WINDOWS=1 \
SGLANG_OSCAR_ROTATION_MODE=oscar \
SGLANG_OSCAR_K_ROTATION_PATH=$ROT_DIR/k_rotation_qqt_r_h_pbr.pt \
SGLANG_OSCAR_V_ROTATION_PATH=$ROT_DIR/v_rotation_sst_r_h_pbr.pt \
SGLANG_OSCAR_K_CLIP_RATIO=0.96 \
SGLANG_OSCAR_V_CLIP_RATIO=0.92 \
SGLANG_OSCAR_ABSORB_V_ROTATION=1 \
SGLANG_MIXED_KV_PREFIX_TOKENS=64 \
SGLANG_MIXED_KV_RECENT_TOKENS=256 \
HADAMARD_ORDER=128 \
python -m sglang.launch_server \
  --model-path Qwen/Qwen3-8B \
  --tensor-parallel-size 1 \
  --kv-cache-dtype int2 \
  --kv-cache-quant-group-size 128 \
  --prefill-attention-backend fa3 \
  --decode-attention-backend triton \
  --disable-radix-cache \
  --disable-custom-all-reduce \
  --trust-remote-code
```

## Reproducing from scratch

If you want to fit your own rotation on a different calibration set, the OSCAR pipeline is end-to-end reproducible:

```bash
git clone https://github.com/FutureMLS-Lab/OSCAR.git
cd OSCAR
bash rotation/qwen3-8B/save_qkv_8b.sh        # phase 1 — dump Q/K/V
bash rotation/qwen3-8B/compute_rotation.sh   # phase 2 — fit R = eigvec(Σ_Q)
```

## Citation

```bibtex
@article{zhou2026oscar,
  title  = {OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization},
  author = {Zhou, Zhongzhu and Zhuang, Donglin and Li, Jisen and Chen, Ziyan and Song, Shuaiwen Leon and Athiwaratkun, Ben and Wu, Xiaoxia},
  year   = {2026},
  note   = {Together AI; University of Sydney; UIUC},
}
```