syedmohaiminulhoque commited on
Commit
d18bfc9
Β·
verified Β·
1 Parent(s): 31f3d05

Add comprehensive README

Browse files
Files changed (1) hide show
  1. README.md +209 -0
README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DKM: Differentiable K-Means Clustering Layer for Neural Network Compression
2
+
3
+ **PyTorch implementation of the ICLR 2022 paper by Cho et al.**
4
+
5
+ πŸ“„ [Paper (arXiv:2108.12659)](https://arxiv.org/abs/2108.12659) | πŸ›οΈ [ICLR 2022](https://openreview.net/forum?id=J_F_qqCE3Z5)
6
+
7
+ ## Overview
8
+
9
+ DKM casts **k-means weight clustering** as a differentiable **attention problem**, enabling joint optimization of DNN parameters and clustering centroids through standard backpropagation. Unlike prior weight-clustering methods that rely on hard assignments and approximated gradients, DKM uses soft attention-based assignment that is fully differentiable.
10
+
11
+ ### Key Innovation
12
+
13
+ ```
14
+ Traditional: weights β†’ hard k-means assignment β†’ fixed centroids (not differentiable)
15
+ DKM: weights β†’ attention-based soft assignment β†’ differentiable centroids
16
+ ```
17
+
18
+ The DKM layer:
19
+ 1. Computes a **distance matrix** D between weights W and centroids C
20
+ 2. Applies **softmax with temperature Ο„** to get attention matrix A = softmax(D/Ο„)
21
+ 3. Updates centroids: c_j = Ξ£_i(a_ij Γ— w_i) / Ξ£_i(a_ij)
22
+ 4. Iterates until convergence
23
+ 5. Returns compressed weights: W̃ = A × C
24
+
25
+ ### Paper Results
26
+
27
+ | Model | Config | Top-1 Acc (%) | Size (MB) | Compression |
28
+ |-------|--------|--------------|-----------|-------------|
29
+ | ResNet50 | cv:6/6, fc:6/4 | 74.5 | 3.32 | 29.4Γ— |
30
+ | MobileNet-v1 | cv:4/4, fc:4/2 | 63.9 | 0.72 | 22.4Γ— |
31
+ | MobileNet-v2 | cv:2/1, fc:4/4 | 68.0 | 0.84 | 15.8Γ— |
32
+ | DistilBERT | - | -1.1% acc drop | - | 11.8Γ— |
33
+
34
+ ## Installation
35
+
36
+ ```bash
37
+ git clone https://huggingface.co/syedmohaiminulhoque/dkm-compression
38
+ cd dkm-compression
39
+ pip install torch torchvision
40
+ ```
41
+
42
+ ## Quick Start
43
+
44
+ ```python
45
+ import torch
46
+ import torch.nn as nn
47
+ from dkm import compress_model
48
+ from dkm.utils import print_compression_summary
49
+
50
+ # Load any pre-trained model
51
+ model = torchvision.models.resnet18(weights="DEFAULT")
52
+
53
+ # Compress with DKM (2-bit clustering)
54
+ compressor = compress_model(
55
+ model,
56
+ bits=2, # k = 2^bits = 4 clusters
57
+ dim=1, # scalar clustering (dim=1) or multi-dim
58
+ tau=2e-5, # temperature (controls softness of assignment)
59
+ skip_first_last=True, # skip first/last layers (per paper protocol)
60
+ )
61
+
62
+ # Print compression statistics
63
+ info = compressor.get_compression_info()
64
+ print_compression_summary(info)
65
+
66
+ # Train with standard PyTorch loop (paper: SGD, lr=0.008, momentum=0.9)
67
+ optimizer = torch.optim.SGD(compressor.parameters(), lr=0.008, momentum=0.9)
68
+ criterion = nn.CrossEntropyLoss()
69
+
70
+ compressor.train()
71
+ for images, labels in dataloader:
72
+ optimizer.zero_grad()
73
+ outputs = compressor(images)
74
+ loss = criterion(outputs, labels)
75
+ loss.backward() # Gradients flow through DKM attention layers
76
+ optimizer.step()
77
+
78
+ # Snap to nearest centroids for inference
79
+ compressor.snap_weights()
80
+
81
+ # Export compressed model (codebook + assignments)
82
+ export = compressor.export_compressed()
83
+ torch.save(export, "compressed_model.pt")
84
+ ```
85
+
86
+ ## Multi-Dimensional Clustering (Section 3.3)
87
+
88
+ DKM supports multi-dimensional weight clustering for higher compression:
89
+
90
+ ```python
91
+ # Paper notation: "bits/dim" e.g., "4/4" means 4 bits, 4 dimensions
92
+ # Effective bits-per-weight = bits / dim
93
+
94
+ # Configuration cv:6/8, fc:6/4 (as in Table 3 of the paper)
95
+ compressor = compress_model(
96
+ model,
97
+ bits=6,
98
+ conv_config={"bits": 6, "dim": 8}, # 6 bits, 8 dims β†’ 0.75 bpw
99
+ fc_config={"bits": 6, "dim": 4}, # 6 bits, 4 dims β†’ 1.5 bpw
100
+ tau=2e-5,
101
+ )
102
+ ```
103
+
104
+ | Config | Clusters | Dim | Effective BPW |
105
+ |--------|----------|-----|---------------|
106
+ | 3-bit | 8 | 1 | 3.0 |
107
+ | 2-bit | 4 | 1 | 2.0 |
108
+ | 1-bit | 2 | 1 | 1.0 |
109
+ | 4/4 | 16 | 4 | 1.0 |
110
+ | 8/8 | 256 | 8 | 1.0 |
111
+ | 4/8 | 16 | 8 | 0.5 |
112
+ | 8/16 | 256 | 16 | 0.5 |
113
+
114
+ ## Temperature Ο„ Guidelines (Appendix B)
115
+
116
+ The temperature controls the softness of cluster assignment:
117
+ - **Smaller Ο„** β†’ harder assignment (near one-hot), closer to standard k-means
118
+ - **Larger Ο„** β†’ softer assignment, more gradient flow, better for hard compression tasks
119
+
120
+ | Model | 3-bit | 2-bit | 1-bit | 4/4 | 8/8 |
121
+ |-------|-------|-------|-------|-----|-----|
122
+ | ResNet18 | 8e-6 | 2e-5 | 5e-5 | 5e-5 | 8e-5 |
123
+ | ResNet50 | 8e-6 | 2e-5 | 5e-5 | 4e-5 | OOM |
124
+ | MobileNet-v1 | 5e-5 | 1e-4 | 3e-4 | 1e-4 | 1e-4 |
125
+ | MobileNet-v2 | 5e-5 | 1e-4 | 1.5e-4 | 1e-4 | 1e-4 |
126
+
127
+ ## Architecture
128
+
129
+ ```
130
+ dkm/
131
+ β”œβ”€β”€ __init__.py # Package exports
132
+ β”œβ”€β”€ dkm_layer.py # Core DKM layer (Section 3.2-3.3)
133
+ β”œβ”€β”€ compressor.py # Model wrapper with DKM layers (Section 4)
134
+ └── utils.py # Compression analysis utilities
135
+
136
+ tests/
137
+ └── test_dkm.py # 16 comprehensive test groups (all passing)
138
+
139
+ train.py # Full training pipeline (CIFAR-10 demo)
140
+ ```
141
+
142
+ ### Core Components
143
+
144
+ - **`DKMLayer`**: The differentiable k-means clustering layer. Implements the iterative attention-based clustering from Fig. 2 of the paper, with k-means++ initialization, warm start across batches, and convergence checking.
145
+
146
+ - **`DKMCompressor`**: Wraps any PyTorch model by inserting DKM layers via forward pre-hooks. Handles per-layer configuration (different bits/dim for conv vs fc), the paper's protocol for small layers (<10K params β†’ 8-bit), and first/last layer skipping.
147
+
148
+ - **`compress_model`**: High-level API matching the paper's notation (cv:bits/dim, fc:bits/dim).
149
+
150
+ ## Training Protocol (Section 4)
151
+
152
+ Following the paper exactly:
153
+ - **Optimizer**: SGD with momentum 0.9
154
+ - **Learning rate**: 0.008 (fixed, no per-layer tuning)
155
+ - **Loss**: Original task loss (no regularizers or modifications)
156
+ - **Epochs**: 200 for ImageNet, varies for GLUE
157
+ - **Batch size**: 128 per GPU (paper used 8Γ— V100)
158
+ - **Convergence**: Ξ΅ = 1e-4, max 5 DKM iterations per layer
159
+ - **Small layers**: Layers with <10,000 parameters get 8-bit clustering
160
+
161
+ ## Compressed Model Format
162
+
163
+ After training, `export_compressed()` returns:
164
+ - **state_dict**: Standard PyTorch state dict (with snapped weights)
165
+ - **codebooks**: Per-layer centroid tensors (k Γ— d float32)
166
+ - **assignments**: Per-layer cluster index tensors (N/d integers, b bits each)
167
+ - **layer_configs**: Per-layer DKM configuration
168
+
169
+ The actual compressed size = Ξ£(codebook_bits + assignment_bits) per layer + uncompressed params.
170
+
171
+ ## Tests
172
+
173
+ All 16 test groups pass, covering:
174
+ 1. Shape preservation (train & eval)
175
+ 2. Distance matrix correctness
176
+ 3. Attention matrix properties (row-sum=1, temperature effect)
177
+ 4. Centroid convergence to cluster means
178
+ 5. Gradient flow (differentiability β€” key paper contribution)
179
+ 6. Multi-dimensional clustering
180
+ 7. Iterative convergence
181
+ 8. Full compressor pipeline
182
+ 9. Weight snapping for inference
183
+ 10. Model export
184
+ 11. Multi-step training stability
185
+ 12. Paper configurations (Table 1)
186
+ 13. K-means++ initialization
187
+ 14. Warm start across batches
188
+ 15. Numerical stability (large/small/uniform weights)
189
+ 16. ResNet-like model compression
190
+
191
+ ```bash
192
+ python tests/test_dkm.py
193
+ ```
194
+
195
+ ## Citation
196
+
197
+ ```bibtex
198
+ @inproceedings{cho2022dkm,
199
+ title={DKM: Differentiable k-Means Clustering Layer for Neural Network Compression},
200
+ author={Cho, Minsik and Alizadeh-Vahid, Keivan and Adya, Saurabh and Rastegari, Mohammad},
201
+ booktitle={International Conference on Learning Representations (ICLR)},
202
+ year={2022},
203
+ url={https://openreview.net/forum?id=J_F_qqCE3Z5}
204
+ }
205
+ ```
206
+
207
+ ## License
208
+
209
+ This is a research implementation. The original paper is by Apple Research (Cho et al., ICLR 2022).