syedmohaiminulhoque
/

dkm-compression

Model card Files Files and versions

xet

Community

syedmohaiminulhoque commited on 8 days ago

Commit

d18bfc9

verified ·

1 Parent(s): 31f3d05

Add comprehensive README

Browse files

Files changed (1) hide show

README.md +209 -0

README.md ADDED Viewed

	@@ -0,0 +1,209 @@

+# DKM: Differentiable K-Means Clustering Layer for Neural Network Compression
+**PyTorch implementation of the ICLR 2022 paper by Cho et al.**
+📄 [Paper (arXiv:2108.12659)](https://arxiv.org/abs/2108.12659) | 🏛️ [ICLR 2022](https://openreview.net/forum?id=J_F_qqCE3Z5)
+## Overview
+DKM casts **k-means weight clustering** as a differentiable **attention problem**, enabling joint optimization of DNN parameters and clustering centroids through standard backpropagation. Unlike prior weight-clustering methods that rely on hard assignments and approximated gradients, DKM uses soft attention-based assignment that is fully differentiable.
+### Key Innovation
+```
+Traditional: weights → hard k-means assignment → fixed centroids (not differentiable)
+DKM:         weights → attention-based soft assignment → differentiable centroids
+```
+The DKM layer:
+1. Computes a **distance matrix** D between weights W and centroids C
+2. Applies **softmax with temperature τ** to get attention matrix A = softmax(D/τ)
+3. Updates centroids: c_j = Σ_i(a_ij × w_i) / Σ_i(a_ij)
+4. Iterates until convergence
+5. Returns compressed weights: W̃ = A × C
+### Paper Results
+| Model | Config | Top-1 Acc (%) | Size (MB) | Compression |
+|-------|--------|--------------|-----------|-------------|
+| ResNet50 | cv:6/6, fc:6/4 | 74.5 | 3.32 | 29.4× |
+| MobileNet-v1 | cv:4/4, fc:4/2 | 63.9 | 0.72 | 22.4× |
+| MobileNet-v2 | cv:2/1, fc:4/4 | 68.0 | 0.84 | 15.8× |
+| DistilBERT | - | -1.1% acc drop | - | 11.8× |
+## Installation
+```bash
+git clone https://huggingface.co/syedmohaiminulhoque/dkm-compression
+cd dkm-compression
+pip install torch torchvision
+```
+## Quick Start
+```python
+import torch
+import torch.nn as nn
+from dkm import compress_model
+from dkm.utils import print_compression_summary
+# Load any pre-trained model
+model = torchvision.models.resnet18(weights="DEFAULT")
+# Compress with DKM (2-bit clustering)
+compressor = compress_model(
+    model,
+    bits=2,           # k = 2^bits = 4 clusters
+    dim=1,            # scalar clustering (dim=1) or multi-dim
+    tau=2e-5,         # temperature (controls softness of assignment)
+    skip_first_last=True,  # skip first/last layers (per paper protocol)
+)
+# Print compression statistics
+info = compressor.get_compression_info()
+print_compression_summary(info)
+# Train with standard PyTorch loop (paper: SGD, lr=0.008, momentum=0.9)
+optimizer = torch.optim.SGD(compressor.parameters(), lr=0.008, momentum=0.9)
+criterion = nn.CrossEntropyLoss()
+compressor.train()
+for images, labels in dataloader:
+    optimizer.zero_grad()
+    outputs = compressor(images)
+    loss = criterion(outputs, labels)
+    loss.backward()  # Gradients flow through DKM attention layers
+    optimizer.step()
+# Snap to nearest centroids for inference
+compressor.snap_weights()
+# Export compressed model (codebook + assignments)
+export = compressor.export_compressed()
+torch.save(export, "compressed_model.pt")
+```
+## Multi-Dimensional Clustering (Section 3.3)
+DKM supports multi-dimensional weight clustering for higher compression:
+```python
+# Paper notation: "bits/dim" e.g., "4/4" means 4 bits, 4 dimensions
+# Effective bits-per-weight = bits / dim
+# Configuration cv:6/8, fc:6/4 (as in Table 3 of the paper)
+compressor = compress_model(
+    model,
+    bits=6,
+    conv_config={"bits": 6, "dim": 8},   # 6 bits, 8 dims → 0.75 bpw
+    fc_config={"bits": 6, "dim": 4},      # 6 bits, 4 dims → 1.5 bpw
+    tau=2e-5,
+)
+```
+| Config | Clusters | Dim | Effective BPW |
+|--------|----------|-----|---------------|
+| 3-bit  | 8        | 1   | 3.0           |
+| 2-bit  | 4        | 1   | 2.0           |
+| 1-bit  | 2        | 1   | 1.0           |
+| 4/4    | 16       | 4   | 1.0           |
+| 8/8    | 256      | 8   | 1.0           |
+| 4/8    | 16       | 8   | 0.5           |
+| 8/16   | 256      | 16  | 0.5           |
+## Temperature τ Guidelines (Appendix B)
+The temperature controls the softness of cluster assignment:
+- **Smaller τ** → harder assignment (near one-hot), closer to standard k-means
+- **Larger τ** → softer assignment, more gradient flow, better for hard compression tasks
+| Model | 3-bit | 2-bit | 1-bit | 4/4 | 8/8 |
+|-------|-------|-------|-------|-----|-----|
+| ResNet18 | 8e-6 | 2e-5 | 5e-5 | 5e-5 | 8e-5 |
+| ResNet50 | 8e-6 | 2e-5 | 5e-5 | 4e-5 | OOM |
+| MobileNet-v1 | 5e-5 | 1e-4 | 3e-4 | 1e-4 | 1e-4 |
+| MobileNet-v2 | 5e-5 | 1e-4 | 1.5e-4 | 1e-4 | 1e-4 |
+## Architecture
+```
+dkm/
+├── __init__.py          # Package exports
+├── dkm_layer.py         # Core DKM layer (Section 3.2-3.3)
+├── compressor.py        # Model wrapper with DKM layers (Section 4)
+└── utils.py             # Compression analysis utilities
+tests/
+└── test_dkm.py          # 16 comprehensive test groups (all passing)
+train.py                 # Full training pipeline (CIFAR-10 demo)
+```
+### Core Components
+- **`DKMLayer`**: The differentiable k-means clustering layer. Implements the iterative attention-based clustering from Fig. 2 of the paper, with k-means++ initialization, warm start across batches, and convergence checking.
+- **`DKMCompressor`**: Wraps any PyTorch model by inserting DKM layers via forward pre-hooks. Handles per-layer configuration (different bits/dim for conv vs fc), the paper's protocol for small layers (<10K params → 8-bit), and first/last layer skipping.
+- **`compress_model`**: High-level API matching the paper's notation (cv:bits/dim, fc:bits/dim).
+## Training Protocol (Section 4)
+Following the paper exactly:
+- **Optimizer**: SGD with momentum 0.9
+- **Learning rate**: 0.008 (fixed, no per-layer tuning)
+- **Loss**: Original task loss (no regularizers or modifications)
+- **Epochs**: 200 for ImageNet, varies for GLUE
+- **Batch size**: 128 per GPU (paper used 8× V100)
+- **Convergence**: ε = 1e-4, max 5 DKM iterations per layer
+- **Small layers**: Layers with <10,000 parameters get 8-bit clustering
+## Compressed Model Format
+After training, `export_compressed()` returns:
+- **state_dict**: Standard PyTorch state dict (with snapped weights)
+- **codebooks**: Per-layer centroid tensors (k × d float32)
+- **assignments**: Per-layer cluster index tensors (N/d integers, b bits each)
+- **layer_configs**: Per-layer DKM configuration
+The actual compressed size = Σ(codebook_bits + assignment_bits) per layer + uncompressed params.
+## Tests
+All 16 test groups pass, covering:
+1. Shape preservation (train & eval)
+2. Distance matrix correctness
+3. Attention matrix properties (row-sum=1, temperature effect)
+4. Centroid convergence to cluster means
+5. Gradient flow (differentiability — key paper contribution)
+6. Multi-dimensional clustering
+7. Iterative convergence
+8. Full compressor pipeline
+9. Weight snapping for inference
+10. Model export
+11. Multi-step training stability
+12. Paper configurations (Table 1)
+13. K-means++ initialization
+14. Warm start across batches
+15. Numerical stability (large/small/uniform weights)
+16. ResNet-like model compression
+```bash
+python tests/test_dkm.py
+```
+## Citation
+```bibtex
+@inproceedings{cho2022dkm,
+  title={DKM: Differentiable k-Means Clustering Layer for Neural Network Compression},
+  author={Cho, Minsik and Alizadeh-Vahid, Keivan and Adya, Saurabh and Rastegari, Mohammad},
+  booktitle={International Conference on Learning Representations (ICLR)},
+  year={2022},
+  url={https://openreview.net/forum?id=J_F_qqCE3Z5}
+}
+```
+## License
+This is a research implementation. The original paper is by Apple Research (Cho et al., ICLR 2022).