Dogacel/paper-attention-drift-llama-post-gated
0.5B • Updated • 57
Models trained as a part of the "Attention Drift: What Speculative Decoding Models Learn" paper, shared for reproducing experiments.
Note Llama 3.1 8B Drafter - Post-norm + Gated Attention
Note Llama 3.1 8B Drafter - Pre-norm Trained with TTT=2
Note Llama 3.1 8B Drafter - Post-norm Trained with TTT=2
Note Llama 3.1 8B Drafter - Pre-norm + Gated Attention
Note Llama 3.1 8B Drafter - Post-norm
Note Qwen3.5 9B Drafter - Pre-norm
Note Qwen 3.5 9B Drafter - Post-norm
Note GPT-oss 20B Drafter - Pre-Norm
Note GPT-oss 20B Drafter - Post-Norm
Note Qwen3 8B Drafter - Pre-norm
Note Qwen3 8B Drafter - Post-norm
Note Llama 3.1 8B Drafter - No-norm (Not used in paper)
Note Llama 3.1 8B Drafter - No-norm + Gated Attention (Not used in paper)