Zhongzhu commited on
Commit
3544876
·
verified ·
1 Parent(s): 6dde274

Add README

Browse files
Files changed (1) hide show
  1. README.md +143 -0
README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - oscar
5
+ - int2
6
+ - kv-cache
7
+ - quantization
8
+ - rotation
9
+ - sglang
10
+ ---
11
+
12
+ # OSCAR RotationZoo
13
+
14
+ Precomputed K/V rotation matrices for **OSCAR INT2 KV-cache quantization**.
15
+
16
+ 📄 Paper: *OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization*
17
+ 💻 Code: https://github.com/FutureMLS-Lab/OSCAR
18
+
19
+ OSCAR captures Q/K/V activations on a small calibration set, estimates
20
+ attention-aware K/V covariance offline, and derives per-layer orthogonal
21
+ rotations that align INT2 quantization with the directions attention actually
22
+ consumes. The result is ~7× compression of the KV-cache memory footprint with
23
+ single-digit pp accuracy drop on GPQA for dense reasoning models.
24
+
25
+ This repo packages the rotations as drop-in `.pt` files so you don't need to
26
+ re-run the Q/K/V dump and eigendecomposition yourself.
27
+
28
+ ## Available rotations
29
+
30
+ | Model | Calibration | GPQA (BF16) | GPQA (OSCAR INT2) |
31
+ |---|---|---:|---:|
32
+ | `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt83_group128` | 67.27 | 64.95 |
33
+ | `Qwen/Qwen3-8B` | `seq20000_prompt83_group128` | 56.67 | 55.05 |
34
+ | `Qwen/Qwen3-32B` | `seq16000_prompt69_group128` | 58.49 | 60.40 |
35
+ | `zai-org/GLM-4.7-FP8` | `seq10000_prompt43_group128` | 73.23 | 73.57 |
36
+
37
+ `seq<T>_prompt<N>_group<G>` notation: `T` = total calibration tokens,
38
+ `N` = calibration prompt count, `G` = INT2 quant group size along head_dim.
39
+
40
+ ## File format
41
+
42
+ Each rotation directory contains:
43
+
44
+ - `k_rotation_qqt_r_h_pbr.pt` — K-side rotation `R_K = R · H · P_br` where
45
+ `R = eigvec(Σ_Q)` is fit on Q's attention-aware covariance, `H` is a
46
+ head-dim Hadamard, and `P_br` is the eigenvalue-sorted bit-reversal
47
+ permutation
48
+ - `v_rotation_sst_r_h_pbr.pt` — V-side rotation built on the score-weighted
49
+ V covariance `Σ_V = V^T diag(K^T (Q^T Q) K) V`
50
+
51
+ File layout (PyTorch state-dict):
52
+ ```python
53
+ {
54
+ "format_version": 1,
55
+ "objective": "qqt_r_h_pbr" # or "sst_r_h_pbr" for V
56
+ "source_grouping": "layer",
57
+ "layers": {
58
+ 0: {"layer_id": 0, "rotation": tensor(head_dim, head_dim)},
59
+ 1: {"layer_id": 1, "rotation": tensor(head_dim, head_dim)},
60
+ ...
61
+ }
62
+ }
63
+ ```
64
+
65
+ ## How to use
66
+
67
+ ### 1. Download the rotation for your model
68
+
69
+ ```bash
70
+ pip install huggingface_hub
71
+ ```
72
+
73
+ ```python
74
+ from huggingface_hub import snapshot_download
75
+ snapshot_download(
76
+ repo_id="Zhongzhu/OSCAR-RotationZoo",
77
+ allow_patterns="Qwen3-8B/**",
78
+ local_dir="./oscar_rotations",
79
+ )
80
+ # rotations now at ./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128/
81
+ ```
82
+
83
+ ### 2. Serve with sglang-research using the rotation
84
+
85
+ Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single `oscar`
86
+ conda env, then point the eval driver at your downloaded rotation:
87
+
88
+ ```bash
89
+ ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \
90
+ bash rotation/qwen3-8B/eval_gpqa.sh
91
+ ```
92
+
93
+ The driver internally launches sglang with these flags:
94
+
95
+ ```bash
96
+ SGLANG_ENABLE_MIXED_KV_WINDOWS=1 \
97
+ SGLANG_OSCAR_ROTATION_MODE=oscar \
98
+ SGLANG_OSCAR_K_ROTATION_PATH=$ROT_DIR/k_rotation_qqt_r_h_pbr.pt \
99
+ SGLANG_OSCAR_V_ROTATION_PATH=$ROT_DIR/v_rotation_sst_r_h_pbr.pt \
100
+ SGLANG_OSCAR_K_CLIP_RATIO=0.96 \
101
+ SGLANG_OSCAR_V_CLIP_RATIO=0.92 \
102
+ SGLANG_OSCAR_ABSORB_V_ROTATION=1 \
103
+ SGLANG_MIXED_KV_PREFIX_TOKENS=64 \
104
+ SGLANG_MIXED_KV_RECENT_TOKENS=256 \
105
+ HADAMARD_ORDER=128 \
106
+ python -m sglang.launch_server \
107
+ --model-path Qwen/Qwen3-8B \
108
+ --tensor-parallel-size 1 \
109
+ --kv-cache-dtype int2 \
110
+ --kv-cache-quant-group-size 128 \
111
+ --prefill-attention-backend fa3 \
112
+ --decode-attention-backend triton \
113
+ --disable-radix-cache \
114
+ --disable-custom-all-reduce \
115
+ --trust-remote-code
116
+ ```
117
+
118
+ Sink (`PREFIX_TOKENS=64`) and recent window (`RECENT_TOKENS=256`) tokens stay
119
+ in BF16; the bulk of the KV cache is INT2-quantized into 128-element groups
120
+ along head_dim using these rotations.
121
+
122
+ ## Reproducing from scratch
123
+
124
+ If you want to fit your own rotation on a different calibration set, the
125
+ OSCAR pipeline is end-to-end reproducible:
126
+
127
+ ```bash
128
+ git clone https://github.com/FutureMLS-Lab/OSCAR.git
129
+ cd OSCAR
130
+ bash rotation/qwen3-8B/save_qkv_8b.sh # phase 1 — dump Q/K/V
131
+ bash rotation/qwen3-8B/compute_rotation.sh # phase 2 — fit R = eigvec(Σ_Q)
132
+ ```
133
+
134
+ ## Citation
135
+
136
+ ```bibtex
137
+ @article{zhou2026oscar,
138
+ title = {OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization},
139
+ author = {Zhou, Zhongzhu and Zhuang, Donglin and Li, Jisen and Chen, Ziyan and Song, Shuaiwen Leon and Athiwaratkun, Ben and Wu, Xiaoxia},
140
+ year = {2026},
141
+ note = {Together AI; University of Sydney; UIUC},
142
+ }
143
+ ```