Add comprehensive README

d18bfc9 verified 8 days ago

7.54 kB

	# DKM: Differentiable K-Means Clustering Layer for Neural Network Compression

	PyTorch implementation of the ICLR 2022 paper by Cho et al.

	📄 [Paper (arXiv:2108.12659)](https://arxiv.org/abs/2108.12659) \| 🏛️ [ICLR 2022](https://openreview.net/forum?id=J_F_qqCE3Z5)

	## Overview

	DKM casts k-means weight clustering as a differentiable attention problem, enabling joint optimization of DNN parameters and clustering centroids through standard backpropagation. Unlike prior weight-clustering methods that rely on hard assignments and approximated gradients, DKM uses soft attention-based assignment that is fully differentiable.

	### Key Innovation

	```
	Traditional: weights → hard k-means assignment → fixed centroids (not differentiable)
	DKM: weights → attention-based soft assignment → differentiable centroids
	```

	The DKM layer:
	1. Computes a distance matrix D between weights W and centroids C
	2. Applies softmax with temperature τ to get attention matrix A = softmax(D/τ)
	3. Updates centroids: c_j = Σ_i(a_ij × w_i) / Σ_i(a_ij)
	4. Iterates until convergence
	5. Returns compressed weights: W̃ = A × C

	### Paper Results

	\| Model \| Config \| Top-1 Acc (%) \| Size (MB) \| Compression \|
	\|-------\|--------\|--------------\|-----------\|-------------\|
	\| ResNet50 \| cv:6/6, fc:6/4 \| 74.5 \| 3.32 \| 29.4× \|
	\| MobileNet-v1 \| cv:4/4, fc:4/2 \| 63.9 \| 0.72 \| 22.4× \|
	\| MobileNet-v2 \| cv:2/1, fc:4/4 \| 68.0 \| 0.84 \| 15.8× \|
	\| DistilBERT \| - \| -1.1% acc drop \| - \| 11.8× \|

	## Installation

	```bash
	git clone https://huggingface.co/syedmohaiminulhoque/dkm-compression
	cd dkm-compression
	pip install torch torchvision
	```

	## Quick Start

	```python
	import torch
	import torch.nn as nn
	from dkm import compress_model
	from dkm.utils import print_compression_summary

	# Load any pre-trained model
	model = torchvision.models.resnet18(weights="DEFAULT")

	# Compress with DKM (2-bit clustering)
	compressor = compress_model(
	model,
	bits=2, # k = 2^bits = 4 clusters
	dim=1, # scalar clustering (dim=1) or multi-dim
	tau=2e-5, # temperature (controls softness of assignment)
	skip_first_last=True, # skip first/last layers (per paper protocol)
	)

	# Print compression statistics
	info = compressor.get_compression_info()
	print_compression_summary(info)

	# Train with standard PyTorch loop (paper: SGD, lr=0.008, momentum=0.9)
	optimizer = torch.optim.SGD(compressor.parameters(), lr=0.008, momentum=0.9)
	criterion = nn.CrossEntropyLoss()

	compressor.train()
	for images, labels in dataloader:
	optimizer.zero_grad()
	outputs = compressor(images)
	loss = criterion(outputs, labels)
	loss.backward() # Gradients flow through DKM attention layers
	optimizer.step()

	# Snap to nearest centroids for inference
	compressor.snap_weights()

	# Export compressed model (codebook + assignments)
	export = compressor.export_compressed()
	torch.save(export, "compressed_model.pt")
	```

	## Multi-Dimensional Clustering (Section 3.3)

	DKM supports multi-dimensional weight clustering for higher compression:

	```python
	# Paper notation: "bits/dim" e.g., "4/4" means 4 bits, 4 dimensions
	# Effective bits-per-weight = bits / dim

	# Configuration cv:6/8, fc:6/4 (as in Table 3 of the paper)
	compressor = compress_model(
	model,
	bits=6,
	conv_config={"bits": 6, "dim": 8}, # 6 bits, 8 dims → 0.75 bpw
	fc_config={"bits": 6, "dim": 4}, # 6 bits, 4 dims → 1.5 bpw
	tau=2e-5,
	)
	```

	\| Config \| Clusters \| Dim \| Effective BPW \|
	\|--------\|----------\|-----\|---------------\|
	\| 3-bit \| 8 \| 1 \| 3.0 \|
	\| 2-bit \| 4 \| 1 \| 2.0 \|
	\| 1-bit \| 2 \| 1 \| 1.0 \|
	\| 4/4 \| 16 \| 4 \| 1.0 \|
	\| 8/8 \| 256 \| 8 \| 1.0 \|
	\| 4/8 \| 16 \| 8 \| 0.5 \|
	\| 8/16 \| 256 \| 16 \| 0.5 \|

	## Temperature τ Guidelines (Appendix B)

	The temperature controls the softness of cluster assignment:
	- Smaller τ → harder assignment (near one-hot), closer to standard k-means
	- Larger τ → softer assignment, more gradient flow, better for hard compression tasks

	\| Model \| 3-bit \| 2-bit \| 1-bit \| 4/4 \| 8/8 \|
	\|-------\|-------\|-------\|-------\|-----\|-----\|
	\| ResNet18 \| 8e-6 \| 2e-5 \| 5e-5 \| 5e-5 \| 8e-5 \|
	\| ResNet50 \| 8e-6 \| 2e-5 \| 5e-5 \| 4e-5 \| OOM \|
	\| MobileNet-v1 \| 5e-5 \| 1e-4 \| 3e-4 \| 1e-4 \| 1e-4 \|
	\| MobileNet-v2 \| 5e-5 \| 1e-4 \| 1.5e-4 \| 1e-4 \| 1e-4 \|

	## Architecture

	```
	dkm/
	├── __init__.py # Package exports
	├── dkm_layer.py # Core DKM layer (Section 3.2-3.3)
	├── compressor.py # Model wrapper with DKM layers (Section 4)
	└── utils.py # Compression analysis utilities

	tests/
	└── test_dkm.py # 16 comprehensive test groups (all passing)

	train.py # Full training pipeline (CIFAR-10 demo)
	```

	### Core Components

	- `DKMLayer`: The differentiable k-means clustering layer. Implements the iterative attention-based clustering from Fig. 2 of the paper, with k-means++ initialization, warm start across batches, and convergence checking.

	- `DKMCompressor`: Wraps any PyTorch model by inserting DKM layers via forward pre-hooks. Handles per-layer configuration (different bits/dim for conv vs fc), the paper's protocol for small layers (<10K params → 8-bit), and first/last layer skipping.

	- `compress_model`: High-level API matching the paper's notation (cv:bits/dim, fc:bits/dim).

	## Training Protocol (Section 4)

	Following the paper exactly:
	- Optimizer: SGD with momentum 0.9
	- Learning rate: 0.008 (fixed, no per-layer tuning)
	- Loss: Original task loss (no regularizers or modifications)
	- Epochs: 200 for ImageNet, varies for GLUE
	- Batch size: 128 per GPU (paper used 8× V100)
	- Convergence: ε = 1e-4, max 5 DKM iterations per layer
	- Small layers: Layers with <10,000 parameters get 8-bit clustering

	## Compressed Model Format

	After training, `export_compressed()` returns:
	- state_dict: Standard PyTorch state dict (with snapped weights)
	- codebooks: Per-layer centroid tensors (k × d float32)
	- assignments: Per-layer cluster index tensors (N/d integers, b bits each)
	- layer_configs: Per-layer DKM configuration

	The actual compressed size = Σ(codebook_bits + assignment_bits) per layer + uncompressed params.

	## Tests

	All 16 test groups pass, covering:
	1. Shape preservation (train & eval)
	2. Distance matrix correctness
	3. Attention matrix properties (row-sum=1, temperature effect)
	4. Centroid convergence to cluster means
	5. Gradient flow (differentiability — key paper contribution)
	6. Multi-dimensional clustering
	7. Iterative convergence
	8. Full compressor pipeline
	9. Weight snapping for inference
	10. Model export
	11. Multi-step training stability
	12. Paper configurations (Table 1)
	13. K-means++ initialization
	14. Warm start across batches
	15. Numerical stability (large/small/uniform weights)
	16. ResNet-like model compression

	```bash
	python tests/test_dkm.py
	```

	## Citation

	```bibtex
	@inproceedings{cho2022dkm,
	title={DKM: Differentiable k-Means Clustering Layer for Neural Network Compression},
	author={Cho, Minsik and Alizadeh-Vahid, Keivan and Adya, Saurabh and Rastegari, Mohammad},
	booktitle={International Conference on Learning Representations (ICLR)},
	year={2022},
	url={https://openreview.net/forum?id=J_F_qqCE3Z5}
	}
	```

	## License

	This is a research implementation. The original paper is by Apple Research (Cho et al., ICLR 2022).