CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference
Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose CHESS, an algorithm-system co-design KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only 1\% of the KV cache, delivers low-latency stable inference with up to 4.56times higher throughput, and consistently outperforms other strong baselines. Code is available at https://anonymous.4open.science/r/CHESS-9958/{https://anonymous.4open.science/r/CHESS/}.
