arxiv:2512.06443

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Published on Dec 6, 2025

Upvote

Authors:

Xiangyu Li ,

Chengyu Yin ,

Abstract

Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-based inference, CPUs run these ultra-low-bit LLMs even faster than NPUs, opening new opportunities for ubiquitous on-device intelligence. However, this paper identifies that LUT-based inference underutilizes memory bandwidth during parallel inference, which is required for prefilling, test-time scaling, and other multi-token scenarios. The root cause is the scalar LUT paradigm, which performs repetitive and non-contiguous memory accesses for each token. To solve the issue, we propose vector LUT, a new lookup paradigm that constructs a unified LUT across parallel tokens, and performs a single 1 rightarrow N lookup per index. To realize it efficiently, we further introduce (1) Vector LUT-Centric Tensor Layout, and (2) Cache-Aware Streamed Lookup techniques. Evaluations on 5 edge devices across 3 LLMs show that Vec-LUT outperforms state-of-the-art baselines by up to 4.2times. Our implementation is integrated into llama.cpp. The code is available at https://github.com/Cipherxzc/vlut.cpp.

View arXiv page View PDF GitHub 14 Add to collection

Community

XXXXyu

Paper author 28 days ago

Vec-LUT designs a fast mpGeMM kernel for parallel inference of ternary (1.58-bit) LLMs, achieving up to 4.2x acceleration with llama.cpp integration on CPU.

We have released the source code on GitHub. Welcome to try vlut.cpp with pre-converted models out-of-the-box. Pure C/C++, no 3rd party dependencies.

Parallel inference scenarios include:

Prefilling (parallel input, most common).
Serving (mixed parallel input and output).
Parallel test-time scaling and speculative decoding (parallel output).

The Vec-LUT kernel is fast with:

Lookup table (LUT)-based design that replaces dequantization and multiplication with efficient table lookup.
Vector LUT paradigm that performs efficient 1→N lookup and turns random lookup into contiguous vector addition.
Vector LUT-centric tensor layout and cache-aware streamed lookup that optimizes the memory access patterns.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2512.06443

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.06443 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.06443 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.