Improve model card metadata and documentation
Browse filesThis PR improves the model card for the OSCAR RotationZoo. Key changes include:
- Adding the `text-generation` pipeline tag to the metadata for better discoverability.
- Adding the paper authors for better attribution.
- Ensuring links to the paper, project page, and code repository are easily accessible.
- Maintaining the detailed usage instructions and precomputed rotation tables.
README.md
CHANGED
|
@@ -7,6 +7,7 @@ tags:
|
|
| 7 |
- quantization
|
| 8 |
- rotation
|
| 9 |
- sglang
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
<p align="center">
|
|
@@ -17,18 +18,17 @@ tags:
|
|
| 17 |
|
| 18 |
Precomputed K/V rotation matrices for **OSCAR INT2 KV-cache quantization**.
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
- π **Website** β https://oscar-quantize.github.io/
|
| 22 |
- π» **Code** β https://github.com/FutureMLS-Lab/OSCAR
|
| 23 |
|
| 24 |
-
OSCAR captures Q/K/V activations on a small calibration set, estimates
|
| 25 |
-
attention-aware K/V covariance offline, and derives per-layer orthogonal
|
| 26 |
-
rotations that align INT2 quantization with the directions attention actually
|
| 27 |
-
consumes. The result is ~7Γ compression of the KV-cache memory footprint with
|
| 28 |
-
single-digit pp accuracy drop on GPQA for dense reasoning models.
|
| 29 |
|
| 30 |
-
This repo packages the rotations as drop-in `.pt` files so you don't need to
|
| 31 |
-
re-run the Q/K/V dump and eigendecomposition yourself.
|
| 32 |
|
| 33 |
## Available rotations
|
| 34 |
|
|
@@ -40,25 +40,20 @@ re-run the Q/K/V dump and eigendecomposition yourself.
|
|
| 40 |
| `Qwen/Qwen3-32B` | `seq16000_prompt69_group128` | 58.49 | 60.40 |
|
| 41 |
| `zai-org/GLM-4.7-FP8` | `seq10000_prompt43_group128` | 73.23 | 73.57 |
|
| 42 |
|
| 43 |
-
`seq<T>_prompt<N>_group<G>` notation: `T` = total calibration tokens,
|
| 44 |
-
`N` = calibration prompt count, `G` = INT2 quant group size along head_dim.
|
| 45 |
|
| 46 |
## File format
|
| 47 |
|
| 48 |
Each rotation directory contains:
|
| 49 |
|
| 50 |
-
- `k_rotation_qqt_r_h_pbr.pt` β K-side rotation `R_K = R Β· H Β· P_br` where
|
| 51 |
-
|
| 52 |
-
head-dim Hadamard, and `P_br` is the eigenvalue-sorted bit-reversal
|
| 53 |
-
permutation
|
| 54 |
-
- `v_rotation_sst_r_h_pbr.pt` β V-side rotation built on the score-weighted
|
| 55 |
-
V covariance `Ξ£_V = V^T diag(K^T (Q^T Q) K) V`
|
| 56 |
|
| 57 |
File layout (PyTorch state-dict):
|
| 58 |
```python
|
| 59 |
{
|
| 60 |
"format_version": 1,
|
| 61 |
-
"objective": "qqt_r_h_pbr" # or "sst_r_h_pbr" for V
|
| 62 |
"source_grouping": "layer",
|
| 63 |
"layers": {
|
| 64 |
0: {"layer_id": 0, "rotation": tensor(head_dim, head_dim)},
|
|
@@ -88,8 +83,7 @@ snapshot_download(
|
|
| 88 |
|
| 89 |
### 2. Serve with sglang-research using the rotation
|
| 90 |
|
| 91 |
-
Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single `oscar`
|
| 92 |
-
conda env, then point the eval driver at your downloaded rotation:
|
| 93 |
|
| 94 |
```bash
|
| 95 |
ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \
|
|
@@ -121,14 +115,9 @@ python -m sglang.launch_server \
|
|
| 121 |
--trust-remote-code
|
| 122 |
```
|
| 123 |
|
| 124 |
-
Sink (`PREFIX_TOKENS=64`) and recent window (`RECENT_TOKENS=256`) tokens stay
|
| 125 |
-
in BF16; the bulk of the KV cache is INT2-quantized into 128-element groups
|
| 126 |
-
along head_dim using these rotations.
|
| 127 |
-
|
| 128 |
## Reproducing from scratch
|
| 129 |
|
| 130 |
-
If you want to fit your own rotation on a different calibration set, the
|
| 131 |
-
OSCAR pipeline is end-to-end reproducible:
|
| 132 |
|
| 133 |
```bash
|
| 134 |
git clone https://github.com/FutureMLS-Lab/OSCAR.git
|
|
@@ -146,4 +135,4 @@ bash rotation/qwen3-8B/compute_rotation.sh # phase 2 β fit R = eigvec(Ξ£_Q)
|
|
| 146 |
year = {2026},
|
| 147 |
note = {Together AI; University of Sydney; UIUC},
|
| 148 |
}
|
| 149 |
-
```
|
|
|
|
| 7 |
- quantization
|
| 8 |
- rotation
|
| 9 |
- sglang
|
| 10 |
+
pipeline_tag: text-generation
|
| 11 |
---
|
| 12 |
|
| 13 |
<p align="center">
|
|
|
|
| 18 |
|
| 19 |
Precomputed K/V rotation matrices for **OSCAR INT2 KV-cache quantization**.
|
| 20 |
|
| 21 |
+
This repository contains the artifacts for the paper:
|
| 22 |
+
**OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization**
|
| 23 |
+
*Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu*
|
| 24 |
+
|
| 25 |
+
- π **Paper** β [arXiv:2605.17757](https://arxiv.org/abs/2605.17757)
|
| 26 |
- π **Website** β https://oscar-quantize.github.io/
|
| 27 |
- π» **Code** β https://github.com/FutureMLS-Lab/OSCAR
|
| 28 |
|
| 29 |
+
OSCAR captures Q/K/V activations on a small calibration set, estimates attention-aware K/V covariance offline, and derives per-layer orthogonal rotations that align INT2 quantization with the directions attention actually consumes. The result is ~7Γ compression of the KV-cache memory footprint with single-digit pp accuracy drop on GPQA for dense reasoning models.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
+
This repo packages the rotations as drop-in `.pt` files so you don't need to re-run the Q/K/V dump and eigendecomposition yourself.
|
|
|
|
| 32 |
|
| 33 |
## Available rotations
|
| 34 |
|
|
|
|
| 40 |
| `Qwen/Qwen3-32B` | `seq16000_prompt69_group128` | 58.49 | 60.40 |
|
| 41 |
| `zai-org/GLM-4.7-FP8` | `seq10000_prompt43_group128` | 73.23 | 73.57 |
|
| 42 |
|
| 43 |
+
`seq<T>_prompt<N>_group<G>` notation: `T` = total calibration tokens, `N` = calibration prompt count, `G` = INT2 quant group size along head_dim.
|
|
|
|
| 44 |
|
| 45 |
## File format
|
| 46 |
|
| 47 |
Each rotation directory contains:
|
| 48 |
|
| 49 |
+
- `k_rotation_qqt_r_h_pbr.pt` β K-side rotation `R_K = R Β· H Β· P_br` where `R = eigvec(Ξ£_Q)` is fit on Q's attention-aware covariance, `H` is a head-dim Hadamard, and `P_br` is the eigenvalue-sorted bit-reversal permutation
|
| 50 |
+
- `v_rotation_sst_r_h_pbr.pt` β V-side rotation built on the score-weighted V covariance `Ξ£_V = V^T diag(K^T (Q^T Q) K) V`
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
File layout (PyTorch state-dict):
|
| 53 |
```python
|
| 54 |
{
|
| 55 |
"format_version": 1,
|
| 56 |
+
"objective": "qqt_r_h_pbr", # or "sst_r_h_pbr" for V
|
| 57 |
"source_grouping": "layer",
|
| 58 |
"layers": {
|
| 59 |
0: {"layer_id": 0, "rotation": tensor(head_dim, head_dim)},
|
|
|
|
| 83 |
|
| 84 |
### 2. Serve with sglang-research using the rotation
|
| 85 |
|
| 86 |
+
Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single `oscar` conda env, then point the eval driver at your downloaded rotation:
|
|
|
|
| 87 |
|
| 88 |
```bash
|
| 89 |
ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \
|
|
|
|
| 115 |
--trust-remote-code
|
| 116 |
```
|
| 117 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
## Reproducing from scratch
|
| 119 |
|
| 120 |
+
If you want to fit your own rotation on a different calibration set, the OSCAR pipeline is end-to-end reproducible:
|
|
|
|
| 121 |
|
| 122 |
```bash
|
| 123 |
git clone https://github.com/FutureMLS-Lab/OSCAR.git
|
|
|
|
| 135 |
year = {2026},
|
| 136 |
note = {Together AI; University of Sydney; UIUC},
|
| 137 |
}
|
| 138 |
+
```
|