is this per-channel int4 quantized?
https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf
tech report says
" Based on the most popular open source quantization inference engines (e.g. llama.cpp), we focus on three weight representations: per-channel int4, per-block int4, and switched fp8. In Table 3, we report the memory filled by raw...."
Does it mean this is "per-channel int4" described? Q4_0 is clearly block-32 per
this:
https://github.com/ggml-org/llama.cpp/wiki/Tensor-Encoding-Schemes
Hi Sorry for late reply,
In the context of llama.cpp, Q4_0 is a specific implementation of a per - block quantization scheme. As you correctly noted the from llama.cpp wiki, it uses a block of 32. Therefore, while the Gemma report uses the general term " per-block int4 " , Q4_0 is a concrete example of this technique with a specific block size.
So report is not saying the Gemma use Q4_0. Its saying that is was evaluated with a general " per-block int4 " method, and Q4_0 is a specific, popular implementation of that concept in llama.cpp.
Thank you.