Add comprehensive README
Browse files
README.md
ADDED
|
@@ -0,0 +1,209 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# DKM: Differentiable K-Means Clustering Layer for Neural Network Compression
|
| 2 |
+
|
| 3 |
+
**PyTorch implementation of the ICLR 2022 paper by Cho et al.**
|
| 4 |
+
|
| 5 |
+
π [Paper (arXiv:2108.12659)](https://arxiv.org/abs/2108.12659) | ποΈ [ICLR 2022](https://openreview.net/forum?id=J_F_qqCE3Z5)
|
| 6 |
+
|
| 7 |
+
## Overview
|
| 8 |
+
|
| 9 |
+
DKM casts **k-means weight clustering** as a differentiable **attention problem**, enabling joint optimization of DNN parameters and clustering centroids through standard backpropagation. Unlike prior weight-clustering methods that rely on hard assignments and approximated gradients, DKM uses soft attention-based assignment that is fully differentiable.
|
| 10 |
+
|
| 11 |
+
### Key Innovation
|
| 12 |
+
|
| 13 |
+
```
|
| 14 |
+
Traditional: weights β hard k-means assignment β fixed centroids (not differentiable)
|
| 15 |
+
DKM: weights β attention-based soft assignment β differentiable centroids
|
| 16 |
+
```
|
| 17 |
+
|
| 18 |
+
The DKM layer:
|
| 19 |
+
1. Computes a **distance matrix** D between weights W and centroids C
|
| 20 |
+
2. Applies **softmax with temperature Ο** to get attention matrix A = softmax(D/Ο)
|
| 21 |
+
3. Updates centroids: c_j = Ξ£_i(a_ij Γ w_i) / Ξ£_i(a_ij)
|
| 22 |
+
4. Iterates until convergence
|
| 23 |
+
5. Returns compressed weights: WΜ = A Γ C
|
| 24 |
+
|
| 25 |
+
### Paper Results
|
| 26 |
+
|
| 27 |
+
| Model | Config | Top-1 Acc (%) | Size (MB) | Compression |
|
| 28 |
+
|-------|--------|--------------|-----------|-------------|
|
| 29 |
+
| ResNet50 | cv:6/6, fc:6/4 | 74.5 | 3.32 | 29.4Γ |
|
| 30 |
+
| MobileNet-v1 | cv:4/4, fc:4/2 | 63.9 | 0.72 | 22.4Γ |
|
| 31 |
+
| MobileNet-v2 | cv:2/1, fc:4/4 | 68.0 | 0.84 | 15.8Γ |
|
| 32 |
+
| DistilBERT | - | -1.1% acc drop | - | 11.8Γ |
|
| 33 |
+
|
| 34 |
+
## Installation
|
| 35 |
+
|
| 36 |
+
```bash
|
| 37 |
+
git clone https://huggingface.co/syedmohaiminulhoque/dkm-compression
|
| 38 |
+
cd dkm-compression
|
| 39 |
+
pip install torch torchvision
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
## Quick Start
|
| 43 |
+
|
| 44 |
+
```python
|
| 45 |
+
import torch
|
| 46 |
+
import torch.nn as nn
|
| 47 |
+
from dkm import compress_model
|
| 48 |
+
from dkm.utils import print_compression_summary
|
| 49 |
+
|
| 50 |
+
# Load any pre-trained model
|
| 51 |
+
model = torchvision.models.resnet18(weights="DEFAULT")
|
| 52 |
+
|
| 53 |
+
# Compress with DKM (2-bit clustering)
|
| 54 |
+
compressor = compress_model(
|
| 55 |
+
model,
|
| 56 |
+
bits=2, # k = 2^bits = 4 clusters
|
| 57 |
+
dim=1, # scalar clustering (dim=1) or multi-dim
|
| 58 |
+
tau=2e-5, # temperature (controls softness of assignment)
|
| 59 |
+
skip_first_last=True, # skip first/last layers (per paper protocol)
|
| 60 |
+
)
|
| 61 |
+
|
| 62 |
+
# Print compression statistics
|
| 63 |
+
info = compressor.get_compression_info()
|
| 64 |
+
print_compression_summary(info)
|
| 65 |
+
|
| 66 |
+
# Train with standard PyTorch loop (paper: SGD, lr=0.008, momentum=0.9)
|
| 67 |
+
optimizer = torch.optim.SGD(compressor.parameters(), lr=0.008, momentum=0.9)
|
| 68 |
+
criterion = nn.CrossEntropyLoss()
|
| 69 |
+
|
| 70 |
+
compressor.train()
|
| 71 |
+
for images, labels in dataloader:
|
| 72 |
+
optimizer.zero_grad()
|
| 73 |
+
outputs = compressor(images)
|
| 74 |
+
loss = criterion(outputs, labels)
|
| 75 |
+
loss.backward() # Gradients flow through DKM attention layers
|
| 76 |
+
optimizer.step()
|
| 77 |
+
|
| 78 |
+
# Snap to nearest centroids for inference
|
| 79 |
+
compressor.snap_weights()
|
| 80 |
+
|
| 81 |
+
# Export compressed model (codebook + assignments)
|
| 82 |
+
export = compressor.export_compressed()
|
| 83 |
+
torch.save(export, "compressed_model.pt")
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
## Multi-Dimensional Clustering (Section 3.3)
|
| 87 |
+
|
| 88 |
+
DKM supports multi-dimensional weight clustering for higher compression:
|
| 89 |
+
|
| 90 |
+
```python
|
| 91 |
+
# Paper notation: "bits/dim" e.g., "4/4" means 4 bits, 4 dimensions
|
| 92 |
+
# Effective bits-per-weight = bits / dim
|
| 93 |
+
|
| 94 |
+
# Configuration cv:6/8, fc:6/4 (as in Table 3 of the paper)
|
| 95 |
+
compressor = compress_model(
|
| 96 |
+
model,
|
| 97 |
+
bits=6,
|
| 98 |
+
conv_config={"bits": 6, "dim": 8}, # 6 bits, 8 dims β 0.75 bpw
|
| 99 |
+
fc_config={"bits": 6, "dim": 4}, # 6 bits, 4 dims β 1.5 bpw
|
| 100 |
+
tau=2e-5,
|
| 101 |
+
)
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
| Config | Clusters | Dim | Effective BPW |
|
| 105 |
+
|--------|----------|-----|---------------|
|
| 106 |
+
| 3-bit | 8 | 1 | 3.0 |
|
| 107 |
+
| 2-bit | 4 | 1 | 2.0 |
|
| 108 |
+
| 1-bit | 2 | 1 | 1.0 |
|
| 109 |
+
| 4/4 | 16 | 4 | 1.0 |
|
| 110 |
+
| 8/8 | 256 | 8 | 1.0 |
|
| 111 |
+
| 4/8 | 16 | 8 | 0.5 |
|
| 112 |
+
| 8/16 | 256 | 16 | 0.5 |
|
| 113 |
+
|
| 114 |
+
## Temperature Ο Guidelines (Appendix B)
|
| 115 |
+
|
| 116 |
+
The temperature controls the softness of cluster assignment:
|
| 117 |
+
- **Smaller Ο** β harder assignment (near one-hot), closer to standard k-means
|
| 118 |
+
- **Larger Ο** β softer assignment, more gradient flow, better for hard compression tasks
|
| 119 |
+
|
| 120 |
+
| Model | 3-bit | 2-bit | 1-bit | 4/4 | 8/8 |
|
| 121 |
+
|-------|-------|-------|-------|-----|-----|
|
| 122 |
+
| ResNet18 | 8e-6 | 2e-5 | 5e-5 | 5e-5 | 8e-5 |
|
| 123 |
+
| ResNet50 | 8e-6 | 2e-5 | 5e-5 | 4e-5 | OOM |
|
| 124 |
+
| MobileNet-v1 | 5e-5 | 1e-4 | 3e-4 | 1e-4 | 1e-4 |
|
| 125 |
+
| MobileNet-v2 | 5e-5 | 1e-4 | 1.5e-4 | 1e-4 | 1e-4 |
|
| 126 |
+
|
| 127 |
+
## Architecture
|
| 128 |
+
|
| 129 |
+
```
|
| 130 |
+
dkm/
|
| 131 |
+
βββ __init__.py # Package exports
|
| 132 |
+
βββ dkm_layer.py # Core DKM layer (Section 3.2-3.3)
|
| 133 |
+
βββ compressor.py # Model wrapper with DKM layers (Section 4)
|
| 134 |
+
βββ utils.py # Compression analysis utilities
|
| 135 |
+
|
| 136 |
+
tests/
|
| 137 |
+
βββ test_dkm.py # 16 comprehensive test groups (all passing)
|
| 138 |
+
|
| 139 |
+
train.py # Full training pipeline (CIFAR-10 demo)
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
### Core Components
|
| 143 |
+
|
| 144 |
+
- **`DKMLayer`**: The differentiable k-means clustering layer. Implements the iterative attention-based clustering from Fig. 2 of the paper, with k-means++ initialization, warm start across batches, and convergence checking.
|
| 145 |
+
|
| 146 |
+
- **`DKMCompressor`**: Wraps any PyTorch model by inserting DKM layers via forward pre-hooks. Handles per-layer configuration (different bits/dim for conv vs fc), the paper's protocol for small layers (<10K params β 8-bit), and first/last layer skipping.
|
| 147 |
+
|
| 148 |
+
- **`compress_model`**: High-level API matching the paper's notation (cv:bits/dim, fc:bits/dim).
|
| 149 |
+
|
| 150 |
+
## Training Protocol (Section 4)
|
| 151 |
+
|
| 152 |
+
Following the paper exactly:
|
| 153 |
+
- **Optimizer**: SGD with momentum 0.9
|
| 154 |
+
- **Learning rate**: 0.008 (fixed, no per-layer tuning)
|
| 155 |
+
- **Loss**: Original task loss (no regularizers or modifications)
|
| 156 |
+
- **Epochs**: 200 for ImageNet, varies for GLUE
|
| 157 |
+
- **Batch size**: 128 per GPU (paper used 8Γ V100)
|
| 158 |
+
- **Convergence**: Ξ΅ = 1e-4, max 5 DKM iterations per layer
|
| 159 |
+
- **Small layers**: Layers with <10,000 parameters get 8-bit clustering
|
| 160 |
+
|
| 161 |
+
## Compressed Model Format
|
| 162 |
+
|
| 163 |
+
After training, `export_compressed()` returns:
|
| 164 |
+
- **state_dict**: Standard PyTorch state dict (with snapped weights)
|
| 165 |
+
- **codebooks**: Per-layer centroid tensors (k Γ d float32)
|
| 166 |
+
- **assignments**: Per-layer cluster index tensors (N/d integers, b bits each)
|
| 167 |
+
- **layer_configs**: Per-layer DKM configuration
|
| 168 |
+
|
| 169 |
+
The actual compressed size = Ξ£(codebook_bits + assignment_bits) per layer + uncompressed params.
|
| 170 |
+
|
| 171 |
+
## Tests
|
| 172 |
+
|
| 173 |
+
All 16 test groups pass, covering:
|
| 174 |
+
1. Shape preservation (train & eval)
|
| 175 |
+
2. Distance matrix correctness
|
| 176 |
+
3. Attention matrix properties (row-sum=1, temperature effect)
|
| 177 |
+
4. Centroid convergence to cluster means
|
| 178 |
+
5. Gradient flow (differentiability β key paper contribution)
|
| 179 |
+
6. Multi-dimensional clustering
|
| 180 |
+
7. Iterative convergence
|
| 181 |
+
8. Full compressor pipeline
|
| 182 |
+
9. Weight snapping for inference
|
| 183 |
+
10. Model export
|
| 184 |
+
11. Multi-step training stability
|
| 185 |
+
12. Paper configurations (Table 1)
|
| 186 |
+
13. K-means++ initialization
|
| 187 |
+
14. Warm start across batches
|
| 188 |
+
15. Numerical stability (large/small/uniform weights)
|
| 189 |
+
16. ResNet-like model compression
|
| 190 |
+
|
| 191 |
+
```bash
|
| 192 |
+
python tests/test_dkm.py
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
## Citation
|
| 196 |
+
|
| 197 |
+
```bibtex
|
| 198 |
+
@inproceedings{cho2022dkm,
|
| 199 |
+
title={DKM: Differentiable k-Means Clustering Layer for Neural Network Compression},
|
| 200 |
+
author={Cho, Minsik and Alizadeh-Vahid, Keivan and Adya, Saurabh and Rastegari, Mohammad},
|
| 201 |
+
booktitle={International Conference on Learning Representations (ICLR)},
|
| 202 |
+
year={2022},
|
| 203 |
+
url={https://openreview.net/forum?id=J_F_qqCE3Z5}
|
| 204 |
+
}
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
## License
|
| 208 |
+
|
| 209 |
+
This is a research implementation. The original paper is by Apple Research (Cho et al., ICLR 2022).
|