nielsr HF Staff commited on
Commit
9c629f3
Β·
verified Β·
1 Parent(s): e2ee6f0

Improve model card metadata and documentation

Browse files

This PR improves the model card for the OSCAR RotationZoo. Key changes include:
- Adding the `text-generation` pipeline tag to the metadata for better discoverability.
- Adding the paper authors for better attribution.
- Ensuring links to the paper, project page, and code repository are easily accessible.
- Maintaining the detailed usage instructions and precomputed rotation tables.

Files changed (1) hide show
  1. README.md +15 -26
README.md CHANGED
@@ -7,6 +7,7 @@ tags:
7
  - quantization
8
  - rotation
9
  - sglang
 
10
  ---
11
 
12
  <p align="center">
@@ -17,18 +18,17 @@ tags:
17
 
18
  Precomputed K/V rotation matrices for **OSCAR INT2 KV-cache quantization**.
19
 
20
- - πŸ“„ **Paper** β€” [*OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization*](https://arxiv.org/pdf/2605.17757)
 
 
 
 
21
  - 🌐 **Website** β€” https://oscar-quantize.github.io/
22
  - πŸ’» **Code** β€” https://github.com/FutureMLS-Lab/OSCAR
23
 
24
- OSCAR captures Q/K/V activations on a small calibration set, estimates
25
- attention-aware K/V covariance offline, and derives per-layer orthogonal
26
- rotations that align INT2 quantization with the directions attention actually
27
- consumes. The result is ~7Γ— compression of the KV-cache memory footprint with
28
- single-digit pp accuracy drop on GPQA for dense reasoning models.
29
 
30
- This repo packages the rotations as drop-in `.pt` files so you don't need to
31
- re-run the Q/K/V dump and eigendecomposition yourself.
32
 
33
  ## Available rotations
34
 
@@ -40,25 +40,20 @@ re-run the Q/K/V dump and eigendecomposition yourself.
40
  | `Qwen/Qwen3-32B` | `seq16000_prompt69_group128` | 58.49 | 60.40 |
41
  | `zai-org/GLM-4.7-FP8` | `seq10000_prompt43_group128` | 73.23 | 73.57 |
42
 
43
- `seq<T>_prompt<N>_group<G>` notation: `T` = total calibration tokens,
44
- `N` = calibration prompt count, `G` = INT2 quant group size along head_dim.
45
 
46
  ## File format
47
 
48
  Each rotation directory contains:
49
 
50
- - `k_rotation_qqt_r_h_pbr.pt` β€” K-side rotation `R_K = R Β· H Β· P_br` where
51
- `R = eigvec(Ξ£_Q)` is fit on Q's attention-aware covariance, `H` is a
52
- head-dim Hadamard, and `P_br` is the eigenvalue-sorted bit-reversal
53
- permutation
54
- - `v_rotation_sst_r_h_pbr.pt` β€” V-side rotation built on the score-weighted
55
- V covariance `Ξ£_V = V^T diag(K^T (Q^T Q) K) V`
56
 
57
  File layout (PyTorch state-dict):
58
  ```python
59
  {
60
  "format_version": 1,
61
- "objective": "qqt_r_h_pbr" # or "sst_r_h_pbr" for V
62
  "source_grouping": "layer",
63
  "layers": {
64
  0: {"layer_id": 0, "rotation": tensor(head_dim, head_dim)},
@@ -88,8 +83,7 @@ snapshot_download(
88
 
89
  ### 2. Serve with sglang-research using the rotation
90
 
91
- Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single `oscar`
92
- conda env, then point the eval driver at your downloaded rotation:
93
 
94
  ```bash
95
  ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \
@@ -121,14 +115,9 @@ python -m sglang.launch_server \
121
  --trust-remote-code
122
  ```
123
 
124
- Sink (`PREFIX_TOKENS=64`) and recent window (`RECENT_TOKENS=256`) tokens stay
125
- in BF16; the bulk of the KV cache is INT2-quantized into 128-element groups
126
- along head_dim using these rotations.
127
-
128
  ## Reproducing from scratch
129
 
130
- If you want to fit your own rotation on a different calibration set, the
131
- OSCAR pipeline is end-to-end reproducible:
132
 
133
  ```bash
134
  git clone https://github.com/FutureMLS-Lab/OSCAR.git
@@ -146,4 +135,4 @@ bash rotation/qwen3-8B/compute_rotation.sh # phase 2 β€” fit R = eigvec(Ξ£_Q)
146
  year = {2026},
147
  note = {Together AI; University of Sydney; UIUC},
148
  }
149
- ```
 
7
  - quantization
8
  - rotation
9
  - sglang
10
+ pipeline_tag: text-generation
11
  ---
12
 
13
  <p align="center">
 
18
 
19
  Precomputed K/V rotation matrices for **OSCAR INT2 KV-cache quantization**.
20
 
21
+ This repository contains the artifacts for the paper:
22
+ **OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization**
23
+ *Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu*
24
+
25
+ - πŸ“„ **Paper** β€” [arXiv:2605.17757](https://arxiv.org/abs/2605.17757)
26
  - 🌐 **Website** β€” https://oscar-quantize.github.io/
27
  - πŸ’» **Code** β€” https://github.com/FutureMLS-Lab/OSCAR
28
 
29
+ OSCAR captures Q/K/V activations on a small calibration set, estimates attention-aware K/V covariance offline, and derives per-layer orthogonal rotations that align INT2 quantization with the directions attention actually consumes. The result is ~7Γ— compression of the KV-cache memory footprint with single-digit pp accuracy drop on GPQA for dense reasoning models.
 
 
 
 
30
 
31
+ This repo packages the rotations as drop-in `.pt` files so you don't need to re-run the Q/K/V dump and eigendecomposition yourself.
 
32
 
33
  ## Available rotations
34
 
 
40
  | `Qwen/Qwen3-32B` | `seq16000_prompt69_group128` | 58.49 | 60.40 |
41
  | `zai-org/GLM-4.7-FP8` | `seq10000_prompt43_group128` | 73.23 | 73.57 |
42
 
43
+ `seq<T>_prompt<N>_group<G>` notation: `T` = total calibration tokens, `N` = calibration prompt count, `G` = INT2 quant group size along head_dim.
 
44
 
45
  ## File format
46
 
47
  Each rotation directory contains:
48
 
49
+ - `k_rotation_qqt_r_h_pbr.pt` β€” K-side rotation `R_K = R Β· H Β· P_br` where `R = eigvec(Ξ£_Q)` is fit on Q's attention-aware covariance, `H` is a head-dim Hadamard, and `P_br` is the eigenvalue-sorted bit-reversal permutation
50
+ - `v_rotation_sst_r_h_pbr.pt` β€” V-side rotation built on the score-weighted V covariance `Ξ£_V = V^T diag(K^T (Q^T Q) K) V`
 
 
 
 
51
 
52
  File layout (PyTorch state-dict):
53
  ```python
54
  {
55
  "format_version": 1,
56
+ "objective": "qqt_r_h_pbr", # or "sst_r_h_pbr" for V
57
  "source_grouping": "layer",
58
  "layers": {
59
  0: {"layer_id": 0, "rotation": tensor(head_dim, head_dim)},
 
83
 
84
  ### 2. Serve with sglang-research using the rotation
85
 
86
+ Clone https://github.com/FutureMLS-Lab/OSCAR and set up the single `oscar` conda env, then point the eval driver at your downloaded rotation:
 
87
 
88
  ```bash
89
  ROT_DIR=./oscar_rotations/Qwen3-8B/seq20000_prompt83_group128 \
 
115
  --trust-remote-code
116
  ```
117
 
 
 
 
 
118
  ## Reproducing from scratch
119
 
120
+ If you want to fit your own rotation on a different calibration set, the OSCAR pipeline is end-to-end reproducible:
 
121
 
122
  ```bash
123
  git clone https://github.com/FutureMLS-Lab/OSCAR.git
 
135
  year = {2026},
136
  note = {Together AI; University of Sydney; UIUC},
137
  }
138
+ ```