SpecKV / README.md

Update README.md

70e8fd6 verified 1 day ago

2.82 kB

license: mit
tags:
  - speculative-decoding
  - inference-optimization
  - llm
  - 'adaptive-inference language:'
  - 'en pipeline_tag: other'

SpecKV-MLP16: Adaptive Gamma Selector for Speculative Decoding

This is the trained acceptance rate predictor from the SpecKV paper. It selects the optimal speculation length (gamma) per step using draft model signals, achieving 56.0% more tokens per speculation step than the fixed gamma=4 default.

Quick Start

import pickle
import numpy as np

# load model
with open("speckv_mlp16.pkl", "rb") as f:
    model = pickle.load(f)

# at each speculation step, extract these from draft token distributions:
draft_entropy = 1.5       # mean entropy across draft tokens
draft_confidence = 0.72   # mean top-1 confidence
max_entropy = 2.3         # max entropy in the step
min_confidence = 0.45     # min confidence in the step
comp_enc = 0              # 0=fp16, 1=int8, 2=nf4

# pick best gamma
best_gamma, best_expected = 2, 0
for gamma in [2, 4, 6, 8]:
    features = np.array([[draft_entropy, draft_confidence, max_entropy, min_confidence, comp_enc, gamma]])
    pred_ar = np.clip(model.predict(features)[0], 0, 1)
    expected_tokens = pred_ar * gamma + 1
    if expected_tokens > best_expected:
        best_expected = expected_tokens
        best_gamma = gamma

print(f"Use gamma={best_gamma} (expected {best_expected:.1f} tokens)")

Framework-Agnostic Loading

If you do not want a sklearn dependency, load the raw weights:

import numpy as np

weights = np.load("speckv_mlp16_weights.npz")
W1, b1 = weights["W1"], weights["b1"]  # (6, 16), (16,)
W2, b2 = weights["W2"], weights["b2"]  # (16, 1), (1,)

def predict(x):
    h = np.maximum(0, x @ W1 + b1)  # ReLU
    return float(h @ W2 + b2)

Model Details

Property	Value
Architecture	MLP, 1 hidden layer, 16 units, ReLU
Input	6 features (entropy, confidence, max/min variants, compression, gamma)
Output	Acceptance rate prediction (0-1)
Training data	5,112 step-level records
Test MSE	0.090
Test correlation	0.685
Decision overhead	0.34ms (4 predictions per decision)
Improvement over fixed gamma=4	56.0%
Statistical significance	p < 0.001

Files

speckv_mlp16.pkl - Full scikit-learn model (pickle)
speckv_mlp16_weights.npz - Raw numpy weights (W1, b1, W2, b2)
config.json - Model configuration and metadata
requirements.txt - Python dependencies

Citation

@article{shukla2026speckv,
  title={SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection},
  author={Shukla, Shikhar},
  journal={arXiv preprint},
  year={2026}
}

Shikhar1
/

SpecKV