Add README
Browse files
README.md
ADDED
|
@@ -0,0 +1,143 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- oscar
|
| 5 |
+
- int2
|
| 6 |
+
- kv-cache
|
| 7 |
+
- quantization
|
| 8 |
+
- rotation
|
| 9 |
+
- sglang
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# OSCAR RotationZoo
|
| 13 |
+
|
| 14 |
+
Precomputed K/V rotation matrices for **OSCAR INT2 KV-cache quantization**.
|
| 15 |
+
|
| 16 |
+
📄 Paper: *OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization*
|
| 17 |
+
💻 Code: https://github.com/FutureMLS-Lab/OSCAR
|
| 18 |
+
|
| 19 |
+
OSCAR captures Q/K/V activations on a small calibration set, estimates
|
| 20 |
+
attention-aware K/V covariance offline, and derives per-layer orthogonal
|
| 21 |
+
rotations that align INT2 quantization with the directions attention actually
|
| 22 |
+
consumes. The result is ~7× compression of the KV-cache memory footprint with
|
| 23 |
+
single-digit pp accuracy drop on GPQA for dense reasoning models.
|
| 24 |
+
|
| 25 |
+
This repo packages the rotations as drop-in `.pt` files so you don't need to
|
| 26 |
+
re-run the Q/K/V dump and eigendecomposition yourself.
|
| 27 |
+
|
| 28 |
+
## Available rotations
|
| 29 |
+
|
| 30 |
+
| Model | Calibration | GPQA (BF16) | GPQA (OSCAR INT2) |
|
| 31 |
+
|---|---|---:|---:|
|
| 32 |
+
| `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt83_group128` | 67.27 | 64.95 |
|
| 33 |
+
| `Qwen/Qwen3-8B` | `seq20000_prompt83_group128` | 56.67 | 55.05 |
|
| 34 |
+
| `Qwen/Qwen3-32B` | `seq16000_prompt69_group128` | 58.49 | 60.40 |
|
| 35 |
+
| `zai-org/GLM-4.7-FP8` | `seq10000_prompt43_group128` | 73.23 | 73.57 |
|
| 36 |
+
|
| 37 |
+
`seq<T>_prompt<N>_group<G>` notation: `T` = total calibration tokens,
|
| 38 |
+
`N` = calibration prompt count, `G` = INT2 quant group size along head_dim.
|
| 39 |
+
|
| 40 |
+
## File format
|
| 41 |
+
|
| 42 |
+
Each rotation directory contains:
|
| 43 |
+
|
| 44 |
+
- `k_rotation_qqt_r_h_pbr.pt` — K-side rotation `R_K = R · H · P_br` where
|
| 45 |
+
`R = eigvec(Σ_Q)` is fit on Q's attention-aware covariance, `H` is a
|
| 46 |
+
head-dim Hadamard, and `P_br` is the eigenvalue-sorted bit-reversal
|
| 47 |
+
permutation
|
| 48 |
+
- `v_rotation_sst_r_h_pbr.pt` — V-side rotation built on the score-weighted
|
| 49 |
+
V covariance `Σ_V = V^T diag(K^T (Q^T Q) K) V`
|
| 50 |
+
|
| 51 |
+
File layout (PyTorch state-dict):
|
| 52 |
+
```python
|
| 53 |
+
{
|
| 54 |
+
"format_version": 1,
|
| 55 |
+
"objective": "qqt_r_h_pbr" # or "sst_r_h_pbr" for V
|
| 56 |
+
"source_grouping": "layer",
|
| 57 |
+
"layers": {
|
| 58 |
+
0: {"layer_id": 0, "rotation": tensor(head_dim, head_dim)},
|
| 59 |
+
1: {"layer_id": 1, "rotation": tensor(head_dim, head_dim)},
|
| 60 |
+
...
|
| 61 |
+
}
|
| 62 |
+
}
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
## How to use
|
| 66 |
+
|
| 67 |
+
### 1. Download the rotation for your model
|
| 68 |
+
|
| 69 |
+
```bash
|
| 70 |
+
pip install huggingface_hub
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
```python
|
| 74 |
+
from huggingface_hub import snapshot_download
|
| 75 |
+
snapshot_download(
|
| 76 |
+
repo_id="Zhongzhu/OSCAR-RotationZoo",
|
| 77 |
+
allow_patterns="Qwen3-8B/**",
|
| 78 |
+
local_dir="./oscar_rotations",
|
| 79 |
+
)
|
| 80 |
+
# rotations now at ./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128/
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
### 2. Serve with sglang-research using the rotation
|
| 84 |
+
|
| 85 |
+
Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single `oscar`
|
| 86 |
+
conda env, then point the eval driver at your downloaded rotation:
|
| 87 |
+
|
| 88 |
+
```bash
|
| 89 |
+
ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \
|
| 90 |
+
bash rotation/qwen3-8B/eval_gpqa.sh
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
The driver internally launches sglang with these flags:
|
| 94 |
+
|
| 95 |
+
```bash
|
| 96 |
+
SGLANG_ENABLE_MIXED_KV_WINDOWS=1 \
|
| 97 |
+
SGLANG_OSCAR_ROTATION_MODE=oscar \
|
| 98 |
+
SGLANG_OSCAR_K_ROTATION_PATH=$ROT_DIR/k_rotation_qqt_r_h_pbr.pt \
|
| 99 |
+
SGLANG_OSCAR_V_ROTATION_PATH=$ROT_DIR/v_rotation_sst_r_h_pbr.pt \
|
| 100 |
+
SGLANG_OSCAR_K_CLIP_RATIO=0.96 \
|
| 101 |
+
SGLANG_OSCAR_V_CLIP_RATIO=0.92 \
|
| 102 |
+
SGLANG_OSCAR_ABSORB_V_ROTATION=1 \
|
| 103 |
+
SGLANG_MIXED_KV_PREFIX_TOKENS=64 \
|
| 104 |
+
SGLANG_MIXED_KV_RECENT_TOKENS=256 \
|
| 105 |
+
HADAMARD_ORDER=128 \
|
| 106 |
+
python -m sglang.launch_server \
|
| 107 |
+
--model-path Qwen/Qwen3-8B \
|
| 108 |
+
--tensor-parallel-size 1 \
|
| 109 |
+
--kv-cache-dtype int2 \
|
| 110 |
+
--kv-cache-quant-group-size 128 \
|
| 111 |
+
--prefill-attention-backend fa3 \
|
| 112 |
+
--decode-attention-backend triton \
|
| 113 |
+
--disable-radix-cache \
|
| 114 |
+
--disable-custom-all-reduce \
|
| 115 |
+
--trust-remote-code
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
Sink (`PREFIX_TOKENS=64`) and recent window (`RECENT_TOKENS=256`) tokens stay
|
| 119 |
+
in BF16; the bulk of the KV cache is INT2-quantized into 128-element groups
|
| 120 |
+
along head_dim using these rotations.
|
| 121 |
+
|
| 122 |
+
## Reproducing from scratch
|
| 123 |
+
|
| 124 |
+
If you want to fit your own rotation on a different calibration set, the
|
| 125 |
+
OSCAR pipeline is end-to-end reproducible:
|
| 126 |
+
|
| 127 |
+
```bash
|
| 128 |
+
git clone https://github.com/FutureMLS-Lab/OSCAR.git
|
| 129 |
+
cd OSCAR
|
| 130 |
+
bash rotation/qwen3-8B/save_qkv_8b.sh # phase 1 — dump Q/K/V
|
| 131 |
+
bash rotation/qwen3-8B/compute_rotation.sh # phase 2 — fit R = eigvec(Σ_Q)
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
## Citation
|
| 135 |
+
|
| 136 |
+
```bibtex
|
| 137 |
+
@article{zhou2026oscar,
|
| 138 |
+
title = {OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization},
|
| 139 |
+
author = {Zhou, Zhongzhu and Zhuang, Donglin and Li, Jisen and Chen, Ziyan and Song, Shuaiwen Leon and Athiwaratkun, Ben and Wu, Xiaoxia},
|
| 140 |
+
year = {2026},
|
| 141 |
+
note = {Together AI; University of Sydney; UIUC},
|
| 142 |
+
}
|
| 143 |
+
```
|