LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference
Abstract
Layer-aligned distillation and convergence-based early exit are incompatible under standard conditions, but LEAP (Layer-wise Exit-Aware Pretraining) resolves this incompatibility through an auxiliary training objective that aligns intermediate layers with final-layer representations, achieving significant speedup in transformer inference.
Layer-aligned distillation and convergence-based early exit represent two predominant computational efficiency paradigms for transformer inference; yet we establish that they exhibit systematic incompatibility under standard deployment conditions for convergence-based early exit. Distillation objectives that align intermediate student layers to teacher representations suppress the representational convergence that early-exit mechanisms exploit, rendering such mechanisms ineffective on distilled models. We introduce LEAP (Layer-wise Exit-Aware Pretraining), an auxiliary training objective that reconciles this incompatibility. LEAP requires no architectural modifications; it augments standard distillation with a single constraint ensuring intermediate layers approximate final-layer representations. LEAP-MiniLM achieves 1.61times measured wall-clock speedup (batch=1, NVIDIA L4) at θ=0.95, with 91.9% of samples exiting by layer 7 and 1.80times theoretical layer reduction, where standard distilled models achieve zero effective speedup. We validate across sentence similarity (STS-B: 0.760 pm 0.006) and retrieval benchmarks (BEIR), providing operational guidance including latency measurements, decision thresholds, and deployment criteria.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- River-LLM: Large Language Model Seamless Exit Based on KV Share (2026)
- TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference (2026)
- The Diminishing Returns of Early-Exit Decoding in Modern LLMs (2026)
- Two-dimensional early exit optimisation of LLM inference (2026)
- AutoCompress: Critical Layer Isolation for Efficient Transformer Compression (2026)
- MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation (2026)
- SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.01058 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper