Papers
arxiv:2512.06443

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Published on Dec 6, 2025
Authors:
,
,
,

Abstract

Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-based inference, CPUs run these ultra-low-bit LLMs even faster than NPUs, opening new opportunities for ubiquitous on-device intelligence. However, this paper identifies that LUT-based inference underutilizes memory bandwidth during parallel inference, which is required for prefilling, test-time scaling, and other multi-token scenarios. The root cause is the scalar LUT paradigm, which performs repetitive and non-contiguous memory accesses for each token. To solve the issue, we propose vector LUT, a new lookup paradigm that constructs a unified LUT across parallel tokens, and performs a single 1 rightarrow N lookup per index. To realize it efficiently, we further introduce (1) Vector LUT-Centric Tensor Layout, and (2) Cache-Aware Streamed Lookup techniques. Evaluations on 5 edge devices across 3 LLMs show that Vec-LUT outperforms state-of-the-art baselines by up to 4.2times. Our implementation is integrated into llama.cpp. The code is available at https://github.com/Cipherxzc/vlut.cpp.

Community

Paper author

Vec-LUT designs a fast mpGeMM kernel for parallel inference of ternary (1.58-bit) LLMs, achieving up to 4.2x acceleration with llama.cpp integration on CPU.

We have released the source code on GitHub. Welcome to try vlut.cpp with pre-converted models out-of-the-box. Pure C/C++, no 3rd party dependencies.


Parallel inference scenarios include:

  • Prefilling (parallel input, most common).
  • Serving (mixed parallel input and output).
  • Parallel test-time scaling and speculative decoding (parallel output).

The Vec-LUT kernel is fast with:

  • Lookup table (LUT)-based design that replaces dequantization and multiplication with efficient table lookup.
  • Vector LUT paradigm that performs efficient 1→N lookup and turns random lookup into contiguous vector addition.
  • Vector LUT-centric tensor layout and cache-aware streamed lookup that optimizes the memory access patterns.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2512.06443
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.06443 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.06443 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.