Attention Drift

Dogacel 's Collections

Attention Drift

SpecDrift

gpt-oss / Nemotron Post Training Regen

Qwen3.5 / Nemotron Post Training Regen

Qwen30B-Coder Kubernetes Assistant

updated 3 days ago

Models trained as a part of the "Attention Drift: What Speculative Decoding Models Learn" paper, shared for reproducing experiments.

Upvote

Dogacel/paper-attention-drift-llama-post-gated

0.5B • Updated about 1 month ago • 57

Note Llama 3.1 8B Drafter - Post-norm + Gated Attention
Dogacel/paper-attention-drift-llama-pre-ttt2

0.4B • Updated 18 days ago • 9

Note Llama 3.1 8B Drafter - Pre-norm Trained with TTT=2
Dogacel/paper-attention-drift-llama-post-ttt2

0.4B • Updated Feb 12 • 9

Note Llama 3.1 8B Drafter - Post-norm Trained with TTT=2
Dogacel/paper-attention-drift-llama-pre-gated

0.5B • Updated Apr 8 • 67

Note Llama 3.1 8B Drafter - Pre-norm + Gated Attention
Dogacel/paper-attention-drift-llama-post

0.4B • Updated 18 days ago • 3

Note Llama 3.1 8B Drafter - Post-norm
Dogacel/paper-attention-drift-qwen35-9b-pre

0.4B • Updated 13 days ago • 35

Note Qwen3.5 9B Drafter - Pre-norm
Dogacel/paper-attention-drift-qwen35-9b-post

0.4B • Updated 13 days ago • 12

Note Qwen 3.5 9B Drafter - Post-norm
Dogacel/paper-attention-drift-gpt-20b-pre

0.3B • Updated 15 days ago • 64

Note GPT-oss 20B Drafter - Pre-Norm
Dogacel/paper-attention-drift-gpt-20b-post

0.3B • Updated 15 days ago • 71

Note GPT-oss 20B Drafter - Post-Norm
Dogacel/paper-attention-drift-qwen3-pre

0.4B • Updated 16 days ago • 26

Note Qwen3 8B Drafter - Pre-norm
Dogacel/paper-attention-drift-qwen3-post

0.4B • Updated 16 days ago • 39

Note Qwen3 8B Drafter - Post-norm
Dogacel/paper-attention-drift-llama-no

0.4B • Updated about 1 month ago • 28

Note Llama 3.1 8B Drafter - No-norm (Not used in paper)
Dogacel/paper-attention-drift-llama-no-gated

0.5B • Updated Apr 8 • 42

Note Llama 3.1 8B Drafter - No-norm + Gated Attention (Not used in paper)

Upvote