Zhongzhu
/

OSCAR-RotationZoo

+---
+license: apache-2.0
+tags:
+- oscar
+- int2
+- kv-cache
+- quantization
+- rotation
+- sglang
+---
+# OSCAR RotationZoo
+Precomputed K/V rotation matrices for **OSCAR INT2 KV-cache quantization**.
+📄 Paper: *OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization*
+💻 Code: https://github.com/FutureMLS-Lab/OSCAR
+OSCAR captures Q/K/V activations on a small calibration set, estimates
+attention-aware K/V covariance offline, and derives per-layer orthogonal
+rotations that align INT2 quantization with the directions attention actually
+consumes. The result is ~7× compression of the KV-cache memory footprint with
+single-digit pp accuracy drop on GPQA for dense reasoning models.
+This repo packages the rotations as drop-in `.pt` files so you don't need to
+re-run the Q/K/V dump and eigendecomposition yourself.
+## Available rotations
+| Model | Calibration | GPQA (BF16) | GPQA (OSCAR INT2) |
+|---|---|---:|---:|
+| `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt83_group128` | 67.27 | 64.95 |
+| `Qwen/Qwen3-8B`               | `seq20000_prompt83_group128` | 56.67 | 55.05 |
+| `Qwen/Qwen3-32B`              | `seq16000_prompt69_group128` | 58.49 | 60.40 |
+| `zai-org/GLM-4.7-FP8`         | `seq10000_prompt43_group128` | 73.23 | 73.57 |
+`seq<T>_prompt<N>_group<G>` notation: `T` = total calibration tokens,
+`N` = calibration prompt count, `G` = INT2 quant group size along head_dim.
+## File format
+Each rotation directory contains:
+- `k_rotation_qqt_r_h_pbr.pt` — K-side rotation `R_K = R · H · P_br` where
+  `R = eigvec(Σ_Q)` is fit on Q's attention-aware covariance, `H` is a
+  head-dim Hadamard, and `P_br` is the eigenvalue-sorted bit-reversal
+  permutation
+- `v_rotation_sst_r_h_pbr.pt` — V-side rotation built on the score-weighted
+  V covariance `Σ_V = V^T diag(K^T (Q^T Q) K) V`
+File layout (PyTorch state-dict):
+```python
+{
+  "format_version": 1,
+  "objective":      "qqt_r_h_pbr"      # or "sst_r_h_pbr" for V
+  "source_grouping": "layer",
+  "layers": {
+    0:  {"layer_id": 0,  "rotation": tensor(head_dim, head_dim)},
+    1:  {"layer_id": 1,  "rotation": tensor(head_dim, head_dim)},
+    ...
+  }
+}
+```
+## How to use
+### 1. Download the rotation for your model
+```bash
+pip install huggingface_hub
+```
+```python
+from huggingface_hub import snapshot_download
+snapshot_download(
+    repo_id="Zhongzhu/OSCAR-RotationZoo",
+    allow_patterns="Qwen3-8B/**",
+    local_dir="./oscar_rotations",
+)
+# rotations now at ./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128/
+```
+### 2. Serve with sglang-research using the rotation
+Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single `oscar`
+conda env, then point the eval driver at your downloaded rotation:
+```bash
+ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \
+  bash rotation/qwen3-8B/eval_gpqa.sh
+```
+The driver internally launches sglang with these flags:
+```bash
+SGLANG_ENABLE_MIXED_KV_WINDOWS=1 \
+SGLANG_OSCAR_ROTATION_MODE=oscar \
+SGLANG_OSCAR_K_ROTATION_PATH=$ROT_DIR/k_rotation_qqt_r_h_pbr.pt \
+SGLANG_OSCAR_V_ROTATION_PATH=$ROT_DIR/v_rotation_sst_r_h_pbr.pt \
+SGLANG_OSCAR_K_CLIP_RATIO=0.96 \
+SGLANG_OSCAR_V_CLIP_RATIO=0.92 \
+SGLANG_OSCAR_ABSORB_V_ROTATION=1 \
+SGLANG_MIXED_KV_PREFIX_TOKENS=64 \
+SGLANG_MIXED_KV_RECENT_TOKENS=256 \
+HADAMARD_ORDER=128 \
+python -m sglang.launch_server \
+  --model-path Qwen/Qwen3-8B \
+  --tensor-parallel-size 1 \
+  --kv-cache-dtype int2 \
+  --kv-cache-quant-group-size 128 \
+  --prefill-attention-backend fa3 \
+  --decode-attention-backend triton \
+  --disable-radix-cache \
+  --disable-custom-all-reduce \
+  --trust-remote-code
+```
+Sink (`PREFIX_TOKENS=64`) and recent window (`RECENT_TOKENS=256`) tokens stay
+in BF16; the bulk of the KV cache is INT2-quantized into 128-element groups
+along head_dim using these rotations.
+## Reproducing from scratch
+If you want to fit your own rotation on a different calibration set, the
+OSCAR pipeline is end-to-end reproducible:
+```bash
+git clone https://github.com/FutureMLS-Lab/OSCAR.git
+cd OSCAR
+bash rotation/qwen3-8B/save_qkv_8b.sh        # phase 1 — dump Q/K/V
+bash rotation/qwen3-8B/compute_rotation.sh   # phase 2 — fit R = eigvec(Σ_Q)
+```
+## Citation
+```bibtex
+@article{zhou2026oscar,
+  title  = {OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization},
+  author = {Zhou, Zhongzhu and Zhuang, Donglin and Li, Jisen and Chen, Ziyan and Song, Shuaiwen Leon and Athiwaratkun, Ben and Wu, Xiaoxia},
+  year   = {2026},
+  note   = {Together AI; University of Sydney; UIUC},
+}
+```