| --- |
| license: apache-2.0 |
| tags: |
| - oscar |
| - int2 |
| - kv-cache |
| - quantization |
| - rotation |
| - sglang |
| pipeline_tag: text-generation |
| --- |
| |
| <p align="center"> |
| <img src="https://huggingface.co/Zhongzhu/OSCAR-RotationZoo/resolve/main/oscar_logo_kv_transparent.png" alt="OSCAR INT2 KV-Cache" width="180"/> |
| </p> |
|
|
| # OSCAR RotationZoo |
|
|
| Precomputed K/V rotation matrices for **OSCAR INT2 KV-cache quantization**. |
|
|
| This repository contains the artifacts for the paper: |
| **OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization** |
| *Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu* |
|
|
| - π **Paper** β [arXiv:2605.17757](https://arxiv.org/abs/2605.17757) |
| - π **Website** β https://oscar-quantize.github.io/ |
| - π» **Code** β https://github.com/FutureMLS-Lab/OSCAR |
|
|
| OSCAR captures Q/K/V activations on a small calibration set, estimates attention-aware K/V covariance offline, and derives per-layer orthogonal rotations that align INT2 quantization with the directions attention actually consumes. The result is ~7Γ compression of the KV-cache memory footprint with single-digit pp accuracy drop on GPQA for dense reasoning models. |
|
|
| This repo packages the rotations as drop-in `.pt` files so you don't need to re-run the Q/K/V dump and eigendecomposition yourself. |
|
|
| ## Available rotations |
|
|
| | Model | Calibration | GPQA (BF16) | GPQA (OSCAR INT2) | |
| |---|---|---:|---:| |
| | `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt83_group128` | 67.27 | 67.17 | |
| | `Qwen/Qwen3-4B-Thinking-2507` | `seq20000_prompt85_group128` (fresh re-dump) | 67.27 | β | |
| | `Qwen/Qwen3-8B` | `seq20000_prompt83_group128` | 56.67 | 55.56 | |
| | `Qwen/Qwen3-32B` | `seq16000_prompt69_group128` | 58.49 | 60.40 | |
| | `zai-org/GLM-4.7-FP8` | `seq10000_prompt43_group128` | 73.23 | 73.57 | |
|
|
| `seq<T>_prompt<N>_group<G>` notation: `T` = total calibration tokens, `N` = calibration prompt count, `G` = INT2 quant group size along head_dim. |
| |
| ## File format |
| |
| Each rotation directory contains: |
| |
| - `k_rotation_qqt_r_h_pbr.pt` β K-side rotation `R_K = R Β· H Β· P_br` where `R = eigvec(Ξ£_Q)` is fit on Q's attention-aware covariance, `H` is a head-dim Hadamard, and `P_br` is the eigenvalue-sorted bit-reversal permutation |
| - `v_rotation_sst_r_h_pbr.pt` β V-side rotation built on the score-weighted V covariance `Ξ£_V = V^T diag(K^T (Q^T Q) K) V` |
|
|
| File layout (PyTorch state-dict): |
| ```python |
| { |
| "format_version": 1, |
| "objective": "qqt_r_h_pbr", # or "sst_r_h_pbr" for V |
| "source_grouping": "layer", |
| "layers": { |
| 0: {"layer_id": 0, "rotation": tensor(head_dim, head_dim)}, |
| 1: {"layer_id": 1, "rotation": tensor(head_dim, head_dim)}, |
| ... |
| } |
| } |
| ``` |
|
|
| ## How to use |
|
|
| ### 1. Download the rotation for your model |
|
|
| ```bash |
| pip install huggingface_hub |
| ``` |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| snapshot_download( |
| repo_id="Zhongzhu/OSCAR-RotationZoo", |
| allow_patterns="Qwen3-8B/**", |
| local_dir="./oscar_rotations", |
| ) |
| # rotations now at ./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128/ |
| ``` |
|
|
| ### 2. Serve with sglang-research using the rotation |
|
|
| Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single `oscar` conda env, then point the eval driver at your downloaded rotation: |
|
|
| ```bash |
| ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \ |
| bash rotation/qwen3-8B/eval_gpqa.sh |
| ``` |
|
|
| The driver internally launches sglang with these flags: |
|
|
| ```bash |
| SGLANG_ENABLE_MIXED_KV_WINDOWS=1 \ |
| SGLANG_OSCAR_ROTATION_MODE=oscar \ |
| SGLANG_OSCAR_K_ROTATION_PATH=$ROT_DIR/k_rotation_qqt_r_h_pbr.pt \ |
| SGLANG_OSCAR_V_ROTATION_PATH=$ROT_DIR/v_rotation_sst_r_h_pbr.pt \ |
| SGLANG_OSCAR_K_CLIP_RATIO=0.96 \ |
| SGLANG_OSCAR_V_CLIP_RATIO=0.92 \ |
| SGLANG_OSCAR_ABSORB_V_ROTATION=1 \ |
| SGLANG_MIXED_KV_PREFIX_TOKENS=64 \ |
| SGLANG_MIXED_KV_RECENT_TOKENS=256 \ |
| HADAMARD_ORDER=128 \ |
| python -m sglang.launch_server \ |
| --model-path Qwen/Qwen3-8B \ |
| --tensor-parallel-size 1 \ |
| --kv-cache-dtype int2 \ |
| --kv-cache-quant-group-size 128 \ |
| --prefill-attention-backend fa3 \ |
| --decode-attention-backend triton \ |
| --disable-radix-cache \ |
| --disable-custom-all-reduce \ |
| --trust-remote-code |
| ``` |
|
|
| ## Reproducing from scratch |
|
|
| If you want to fit your own rotation on a different calibration set, the OSCAR pipeline is end-to-end reproducible: |
|
|
| ```bash |
| git clone https://github.com/FutureMLS-Lab/OSCAR.git |
| cd OSCAR |
| bash rotation/qwen3-8B/save_qkv_8b.sh # phase 1 β dump Q/K/V |
| bash rotation/qwen3-8B/compute_rotation.sh # phase 2 β fit R = eigvec(Ξ£_Q) |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{zhou2026oscar, |
| title = {OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization}, |
| author = {Zhou, Zhongzhu and Zhuang, Donglin and Li, Jisen and Chen, Ziyan and Song, Shuaiwen Leon and Athiwaratkun, Ben and Wu, Xiaoxia}, |
| year = {2026}, |
| note = {Together AI; University of Sydney; UIUC}, |
| } |
| ``` |