---
license: mit
tags:
- speculative-decoding
- inference-optimization
- llm
- 'adaptive-inference language:'
- 'en pipeline_tag: other'
---

# SpecKV-MLP16: Adaptive Gamma Selector for Speculative Decoding

This is the trained acceptance rate predictor from the SpecKV paper. It selects the optimal speculation length (gamma) per step using draft model signals, achieving 56.0% more tokens per speculation step than the fixed gamma=4 default.

## Quick Start

```python
import pickle
import numpy as np

# load model
with open("speckv_mlp16.pkl", "rb") as f:
    model = pickle.load(f)

# at each speculation step, extract these from draft token distributions:
draft_entropy = 1.5       # mean entropy across draft tokens
draft_confidence = 0.72   # mean top-1 confidence
max_entropy = 2.3         # max entropy in the step
min_confidence = 0.45     # min confidence in the step
comp_enc = 0              # 0=fp16, 1=int8, 2=nf4

# pick best gamma
best_gamma, best_expected = 2, 0
for gamma in [2, 4, 6, 8]:
    features = np.array([[draft_entropy, draft_confidence, max_entropy, min_confidence, comp_enc, gamma]])
    pred_ar = np.clip(model.predict(features)[0], 0, 1)
    expected_tokens = pred_ar * gamma + 1
    if expected_tokens > best_expected:
        best_expected = expected_tokens
        best_gamma = gamma

print(f"Use gamma={best_gamma} (expected {best_expected:.1f} tokens)")
```

## Framework-Agnostic Loading

If you do not want a sklearn dependency, load the raw weights:

```python
import numpy as np

weights = np.load("speckv_mlp16_weights.npz")
W1, b1 = weights["W1"], weights["b1"]  # (6, 16), (16,)
W2, b2 = weights["W2"], weights["b2"]  # (16, 1), (1,)

def predict(x):
    h = np.maximum(0, x @ W1 + b1)  # ReLU
    return float(h @ W2 + b2)
```

## Model Details

| Property | Value |
|:---|:---|
| Architecture | MLP, 1 hidden layer, 16 units, ReLU |
| Input | 6 features (entropy, confidence, max/min variants, compression, gamma) |
| Output | Acceptance rate prediction (0-1) |
| Training data | 5,112 step-level records |
| Test MSE | 0.090 |
| Test correlation | 0.685 |
| Decision overhead | 0.34ms (4 predictions per decision) |
| Improvement over fixed gamma=4 | 56.0% |
| Statistical significance | p < 0.001 |

## Files

- `speckv_mlp16.pkl` - Full scikit-learn model (pickle)
- `speckv_mlp16_weights.npz` - Raw numpy weights (W1, b1, W2, b2)
- `config.json` - Model configuration and metadata
- `requirements.txt` - Python dependencies

## Citation

```bibtex
@article{shukla2026speckv,
  title={SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection},
  author={Shukla, Shikhar},
  journal={arXiv preprint},
  year={2026}
}
```

## Links

- [Paper (arXiv)](https://arxiv.org/abs/2605.02888)
- [Code and Data (GitHub)](https://github.com/Amorfati123/SpecKV)