is this per-channel int4 quantized?

by anemll - opened May 12, 2025

May 12, 2025

https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf
tech report says

" Based on the most popular open source quantization inference engines (e.g. llama.cpp), we focus on three weight representations: per-channel int4, per-block int4, and switched fp8. In Table 3, we report the memory filled by raw...."

Does it mean this is "per-channel int4" described? Q4_0 is clearly block-32 per
this:
https://github.com/ggml-org/llama.cpp/wiki/Tensor-Encoding-Schemes

lkv

Google org Sep 23, 2025

•

edited Sep 23, 2025

Hi Sorry for late reply,

In the context of llama.cpp, Q4_0 is a specific implementation of a per - block quantization scheme. As you correctly noted the from llama.cpp wiki, it uses a block of 32. Therefore, while the Gemma report uses the general term " per-block int4 " , Q4_0 is a concrete example of this technique with a specific block size.

So report is not saying the Gemma use Q4_0. Its saying that is was evaluated with a general " per-block int4 " method, and Q4_0 is a specific, popular implementation of that concept in llama.cpp.

Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment