Shikhar1
/

SpecKV

speculative-decoding

inference-optimization

adaptive-inference language:

en pipeline_tag: other

Model card Files Files and versions

SpecKV / README.md

Shikhar1's picture

Update README.md

70e8fd6 verified 1 day ago

|

history blame contribute delete

2.82 kB

	---
	license: mit
	tags:
	- speculative-decoding
	- inference-optimization
	- llm
	- 'adaptive-inference language:'
	- 'en pipeline_tag: other'
	---

	# SpecKV-MLP16: Adaptive Gamma Selector for Speculative Decoding

	This is the trained acceptance rate predictor from the SpecKV paper. It selects the optimal speculation length (gamma) per step using draft model signals, achieving 56.0% more tokens per speculation step than the fixed gamma=4 default.

	## Quick Start

	```python
	import pickle
	import numpy as np

	# load model
	with open("speckv_mlp16.pkl", "rb") as f:
	model = pickle.load(f)

	# at each speculation step, extract these from draft token distributions:
	draft_entropy = 1.5 # mean entropy across draft tokens
	draft_confidence = 0.72 # mean top-1 confidence
	max_entropy = 2.3 # max entropy in the step
	min_confidence = 0.45 # min confidence in the step
	comp_enc = 0 # 0=fp16, 1=int8, 2=nf4

	# pick best gamma
	best_gamma, best_expected = 2, 0
	for gamma in [2, 4, 6, 8]:
	features = np.array([[draft_entropy, draft_confidence, max_entropy, min_confidence, comp_enc, gamma]])
	pred_ar = np.clip(model.predict(features)[0], 0, 1)
	expected_tokens = pred_ar * gamma + 1
	if expected_tokens > best_expected:
	best_expected = expected_tokens
	best_gamma = gamma

	print(f"Use gamma={best_gamma} (expected {best_expected:.1f} tokens)")
	```

	## Framework-Agnostic Loading

	If you do not want a sklearn dependency, load the raw weights:

	```python
	import numpy as np

	weights = np.load("speckv_mlp16_weights.npz")
	W1, b1 = weights["W1"], weights["b1"] # (6, 16), (16,)
	W2, b2 = weights["W2"], weights["b2"] # (16, 1), (1,)

	def predict(x):
	h = np.maximum(0, x @ W1 + b1) # ReLU
	return float(h @ W2 + b2)
	```

	## Model Details

	\| Property \| Value \|
	\|:---\|:---\|
	\| Architecture \| MLP, 1 hidden layer, 16 units, ReLU \|
	\| Input \| 6 features (entropy, confidence, max/min variants, compression, gamma) \|
	\| Output \| Acceptance rate prediction (0-1) \|
	\| Training data \| 5,112 step-level records \|
	\| Test MSE \| 0.090 \|
	\| Test correlation \| 0.685 \|
	\| Decision overhead \| 0.34ms (4 predictions per decision) \|
	\| Improvement over fixed gamma=4 \| 56.0% \|
	\| Statistical significance \| p < 0.001 \|

	## Files

	- `speckv_mlp16.pkl` - Full scikit-learn model (pickle)
	- `speckv_mlp16_weights.npz` - Raw numpy weights (W1, b1, W2, b2)
	- `config.json` - Model configuration and metadata
	- `requirements.txt` - Python dependencies

	## Citation

	```bibtex
	@article{shukla2026speckv,
	title={SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection},
	author={Shukla, Shikhar},
	journal={arXiv preprint},
	year={2026}
	}
	```

	## Links

	- [Paper (arXiv)](https://arxiv.org/abs/2605.02888)
	- [Code and Data (GitHub)](https://github.com/Amorfati123/SpecKV)