metadata
license: mit
tags:
- speculative-decoding
- inference-optimization
- llm
- 'adaptive-inference language:'
- 'en pipeline_tag: other'
SpecKV-MLP16: Adaptive Gamma Selector for Speculative Decoding
This is the trained acceptance rate predictor from the SpecKV paper. It selects the optimal speculation length (gamma) per step using draft model signals, achieving 56.0% more tokens per speculation step than the fixed gamma=4 default.
Quick Start
import pickle
import numpy as np
# load model
with open("speckv_mlp16.pkl", "rb") as f:
model = pickle.load(f)
# at each speculation step, extract these from draft token distributions:
draft_entropy = 1.5 # mean entropy across draft tokens
draft_confidence = 0.72 # mean top-1 confidence
max_entropy = 2.3 # max entropy in the step
min_confidence = 0.45 # min confidence in the step
comp_enc = 0 # 0=fp16, 1=int8, 2=nf4
# pick best gamma
best_gamma, best_expected = 2, 0
for gamma in [2, 4, 6, 8]:
features = np.array([[draft_entropy, draft_confidence, max_entropy, min_confidence, comp_enc, gamma]])
pred_ar = np.clip(model.predict(features)[0], 0, 1)
expected_tokens = pred_ar * gamma + 1
if expected_tokens > best_expected:
best_expected = expected_tokens
best_gamma = gamma
print(f"Use gamma={best_gamma} (expected {best_expected:.1f} tokens)")
Framework-Agnostic Loading
If you do not want a sklearn dependency, load the raw weights:
import numpy as np
weights = np.load("speckv_mlp16_weights.npz")
W1, b1 = weights["W1"], weights["b1"] # (6, 16), (16,)
W2, b2 = weights["W2"], weights["b2"] # (16, 1), (1,)
def predict(x):
h = np.maximum(0, x @ W1 + b1) # ReLU
return float(h @ W2 + b2)
Model Details
| Property | Value |
|---|---|
| Architecture | MLP, 1 hidden layer, 16 units, ReLU |
| Input | 6 features (entropy, confidence, max/min variants, compression, gamma) |
| Output | Acceptance rate prediction (0-1) |
| Training data | 5,112 step-level records |
| Test MSE | 0.090 |
| Test correlation | 0.685 |
| Decision overhead | 0.34ms (4 predictions per decision) |
| Improvement over fixed gamma=4 | 56.0% |
| Statistical significance | p < 0.001 |
Files
speckv_mlp16.pkl- Full scikit-learn model (pickle)speckv_mlp16_weights.npz- Raw numpy weights (W1, b1, W2, b2)config.json- Model configuration and metadatarequirements.txt- Python dependencies
Citation
@article{shukla2026speckv,
title={SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection},
author={Shukla, Shikhar},
journal={arXiv preprint},
year={2026}
}