--- license: mit tags: - speculative-decoding - inference-optimization - llm - 'adaptive-inference language:' - 'en pipeline_tag: other' --- # SpecKV-MLP16: Adaptive Gamma Selector for Speculative Decoding This is the trained acceptance rate predictor from the SpecKV paper. It selects the optimal speculation length (gamma) per step using draft model signals, achieving 56.0% more tokens per speculation step than the fixed gamma=4 default. ## Quick Start ```python import pickle import numpy as np # load model with open("speckv_mlp16.pkl", "rb") as f: model = pickle.load(f) # at each speculation step, extract these from draft token distributions: draft_entropy = 1.5 # mean entropy across draft tokens draft_confidence = 0.72 # mean top-1 confidence max_entropy = 2.3 # max entropy in the step min_confidence = 0.45 # min confidence in the step comp_enc = 0 # 0=fp16, 1=int8, 2=nf4 # pick best gamma best_gamma, best_expected = 2, 0 for gamma in [2, 4, 6, 8]: features = np.array([[draft_entropy, draft_confidence, max_entropy, min_confidence, comp_enc, gamma]]) pred_ar = np.clip(model.predict(features)[0], 0, 1) expected_tokens = pred_ar * gamma + 1 if expected_tokens > best_expected: best_expected = expected_tokens best_gamma = gamma print(f"Use gamma={best_gamma} (expected {best_expected:.1f} tokens)") ``` ## Framework-Agnostic Loading If you do not want a sklearn dependency, load the raw weights: ```python import numpy as np weights = np.load("speckv_mlp16_weights.npz") W1, b1 = weights["W1"], weights["b1"] # (6, 16), (16,) W2, b2 = weights["W2"], weights["b2"] # (16, 1), (1,) def predict(x): h = np.maximum(0, x @ W1 + b1) # ReLU return float(h @ W2 + b2) ``` ## Model Details | Property | Value | |:---|:---| | Architecture | MLP, 1 hidden layer, 16 units, ReLU | | Input | 6 features (entropy, confidence, max/min variants, compression, gamma) | | Output | Acceptance rate prediction (0-1) | | Training data | 5,112 step-level records | | Test MSE | 0.090 | | Test correlation | 0.685 | | Decision overhead | 0.34ms (4 predictions per decision) | | Improvement over fixed gamma=4 | 56.0% | | Statistical significance | p < 0.001 | ## Files - `speckv_mlp16.pkl` - Full scikit-learn model (pickle) - `speckv_mlp16_weights.npz` - Raw numpy weights (W1, b1, W2, b2) - `config.json` - Model configuration and metadata - `requirements.txt` - Python dependencies ## Citation ```bibtex @article{shukla2026speckv, title={SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection}, author={Shukla, Shikhar}, journal={arXiv preprint}, year={2026} } ``` ## Links - [Paper (arXiv)](https://arxiv.org/abs/2605.02888) - [Code and Data (GitHub)](https://github.com/Amorfati123/SpecKV)