| --- |
| license: mit |
| tags: |
| - speculative-decoding |
| - inference-optimization |
| - llm |
| - 'adaptive-inference language:' |
| - 'en pipeline_tag: other' |
| --- |
| |
| # SpecKV-MLP16: Adaptive Gamma Selector for Speculative Decoding |
|
|
| This is the trained acceptance rate predictor from the SpecKV paper. It selects the optimal speculation length (gamma) per step using draft model signals, achieving 56.0% more tokens per speculation step than the fixed gamma=4 default. |
|
|
| ## Quick Start |
|
|
| ```python |
| import pickle |
| import numpy as np |
| |
| # load model |
| with open("speckv_mlp16.pkl", "rb") as f: |
| model = pickle.load(f) |
| |
| # at each speculation step, extract these from draft token distributions: |
| draft_entropy = 1.5 # mean entropy across draft tokens |
| draft_confidence = 0.72 # mean top-1 confidence |
| max_entropy = 2.3 # max entropy in the step |
| min_confidence = 0.45 # min confidence in the step |
| comp_enc = 0 # 0=fp16, 1=int8, 2=nf4 |
| |
| # pick best gamma |
| best_gamma, best_expected = 2, 0 |
| for gamma in [2, 4, 6, 8]: |
| features = np.array([[draft_entropy, draft_confidence, max_entropy, min_confidence, comp_enc, gamma]]) |
| pred_ar = np.clip(model.predict(features)[0], 0, 1) |
| expected_tokens = pred_ar * gamma + 1 |
| if expected_tokens > best_expected: |
| best_expected = expected_tokens |
| best_gamma = gamma |
| |
| print(f"Use gamma={best_gamma} (expected {best_expected:.1f} tokens)") |
| ``` |
|
|
| ## Framework-Agnostic Loading |
|
|
| If you do not want a sklearn dependency, load the raw weights: |
|
|
| ```python |
| import numpy as np |
| |
| weights = np.load("speckv_mlp16_weights.npz") |
| W1, b1 = weights["W1"], weights["b1"] # (6, 16), (16,) |
| W2, b2 = weights["W2"], weights["b2"] # (16, 1), (1,) |
| |
| def predict(x): |
| h = np.maximum(0, x @ W1 + b1) # ReLU |
| return float(h @ W2 + b2) |
| ``` |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |:---|:---| |
| | Architecture | MLP, 1 hidden layer, 16 units, ReLU | |
| | Input | 6 features (entropy, confidence, max/min variants, compression, gamma) | |
| | Output | Acceptance rate prediction (0-1) | |
| | Training data | 5,112 step-level records | |
| | Test MSE | 0.090 | |
| | Test correlation | 0.685 | |
| | Decision overhead | 0.34ms (4 predictions per decision) | |
| | Improvement over fixed gamma=4 | 56.0% | |
| | Statistical significance | p < 0.001 | |
|
|
| ## Files |
|
|
| - `speckv_mlp16.pkl` - Full scikit-learn model (pickle) |
| - `speckv_mlp16_weights.npz` - Raw numpy weights (W1, b1, W2, b2) |
| - `config.json` - Model configuration and metadata |
| - `requirements.txt` - Python dependencies |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{shukla2026speckv, |
| title={SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection}, |
| author={Shukla, Shikhar}, |
| journal={arXiv preprint}, |
| year={2026} |
| } |
| ``` |
|
|
| ## Links |
|
|
| - [Paper (arXiv)](https://arxiv.org/abs/2605.02888) |
| - [Code and Data (GitHub)](https://github.com/Amorfati123/SpecKV) |
|
|