Good quant!

#1
by qenme - opened

I've been testing various quants for this model and so far this one gave the best output compared to Q8_0 and UD Q8 K XL (in llamacpp and ik_llamacpp). And this does give better output than the INT4 variants across vLLM usage, but that's not too surprising. My test cases were giving a prompt to non typical games written in Rust using the Bevy game engine in an agentic scenario. I tested a few of them per model and this one consistently gave much better quality. E.g. the procedurally generated trees and clouds were much better.

I'd be very curious to see a similar quant for Gemma 4 31B. Can I take your recipe and do that myself? I've never used Intel AutoRound, I've only done some basic quants in llama-cpp by choosing layers and sublayers to leave in BF16. I suspect Gemma 4 might be a better model, whereas Qwen might be more benchmark trained. However, I can't run either in full BF16 precision.

I concur about the quality and would love to see a Gemma-4-31B-INT8-Autoround version as well. It would put Qwen3.6 vs Gemma-4 on par for a better comparison.

Thank you for using my model! I'm really glad to hear that it's performing well.
Regarding a Gemma-4-31B-INT8-AutoRound version, I agree that it would be great for comparison. I actually tried to quantize it, but unfortunately, unlike with Qwen3.6, I ran into some issues. In my environment, the quantization either threw errors or processed at an impractically slow speed, so I haven't been able to successfully create it yet.
If you'd like to try quantizing it yourself, using the auto-round API within llm-compressor might be a good approach. I am actually currently testing that method myself.
Please note that the layer names in Qwen and Gemma 4 are different, so you won't be able to just copy and paste the exact same configuration. If you do attempt it, I highly recommend excluding vision_tower, embed_vision, and lm_head from being quantized. You can find the exact layer names for Gemma 4 by checking the model.safetensors.index.json file in its repository.

TBH I never quantized a model before, so with 0 experience I'd prefer not to make a dog's dinner out of any model. :)

So I gave it a shot but I'm getting OOM with any batch size (2,4,8) and I have 3x3090. 😒

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32,max_non_split_rounding_mb:1024 \
auto-round \
    google_gemma-4-31B-it \
    --device 0,1,2 \
    --enable_torch_compile \
    --data_type 'int' \
    --group_size 128 \
    --batch_size 2 \
    --nsamples 512 \
    --seqlen 2048 \
    --iters 1000 \
    --to_quant_block_names 'model.language_model.layers' \
    --output_dir gemma-4-31B-it-INT8-AutoRound \
    --scheme W8A16 \
    --dataset NeelNanda/pile-10k \
    --format "auto_round:auto_gptq"

I don't know much about the CLI, but I do have a Python script for quantizing the model (which didn't work for me, but it's worth a try):

import re
from collections import Counter
from typing import Optional

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound

MODEL_PATH = "."
OUTPUT_DIR = "./INT8-AutoRound"
DATASET_NAME = "NeelNanda/pile-10k"

FP16_PATTERNS = ("lm_head", "vision_tower",)
INT8_PATTERNS = ("self_attn","mlp")
INT8_PROTECT_FIRST = 0
INT8_PROTECT_LAST = 0
INT4_PATTERNS = ()
INT4_PROTECT_FIRST = 0
INT4_PROTECT_LAST = 0

GROUP_SIZE = 128
SYM = True

MAX_SAMPLES = 512
SEQ_LEN = 2048


def get_layer_idx(module_name: str) -> Optional[int]:
    match = re.search(r"\.layers\.(\d+)\.", module_name)
    return int(match.group(1)) if match else None


def compute_boundary_set(num_layers: int, n_first: int, n_last: int) -> set:
    if n_first <= 0 and n_last <= 0:
        return set()
    head = set(range(min(max(n_first, 0), num_layers)))
    tail = set(range(max(0, num_layers - max(n_last, 0)), num_layers))
    return head | tail


def matches_any(name: str, patterns) -> bool:
    if not patterns:
        return False
    return any(p in name for p in patterns)


def build_layer_config(model: torch.nn.Module) -> dict:
    indices = {get_layer_idx(name) for name, _ in model.named_modules()}
    indices.discard(None)
    num_layers = max(indices) + 1 if indices else 0
    print(f"language_model layer count: {num_layers}")

    int8_boundary = compute_boundary_set(num_layers, INT8_PROTECT_FIRST, INT8_PROTECT_LAST)
    int4_boundary = compute_boundary_set(num_layers, INT4_PROTECT_FIRST, INT4_PROTECT_LAST)
    print(f"INT8 boundary layers (-> FP16): {sorted(int8_boundary)}")
    print(f"INT4 boundary layers (-> INT8): {sorted(int4_boundary)}")

    layer_config = {}
    for name, module in model.named_modules():
        if not isinstance(module, torch.nn.Linear):
            continue

        if matches_any(name, FP16_PATTERNS):
            layer_config[name] = {"bits": 16}
            continue

        if matches_any(name, INT8_PATTERNS):
            idx = get_layer_idx(name)
            if idx is not None and idx in int8_boundary:
                layer_config[name] = {"bits": 16}
            else:
                layer_config[name] = {"bits": 8, "group_size": GROUP_SIZE, "sym": SYM}
            continue

        if matches_any(name, INT4_PATTERNS):
            idx = get_layer_idx(name)
            if idx is not None and idx in int4_boundary:
                layer_config[name] = {"bits": 8, "group_size": GROUP_SIZE, "sym": SYM}
            else:
                layer_config[name] = {"bits": 4, "group_size": GROUP_SIZE, "sym": SYM}
            continue


    return layer_config


def collect_calibration_samples(tokenizer) -> list:
    dataset = load_dataset(DATASET_NAME, split="train")
    samples = []

    for item in dataset:
        tokenized = tokenizer(
            item["text"],
            truncation=True,
            max_length=SEQ_LEN,
            return_tensors="pt",
        )

        if tokenized["input_ids"].shape[-1] >= SEQ_LEN:
            samples.append(tokenized.data)

        if len(samples) >= MAX_SAMPLES:
            break

    return samples


def main():
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        torch_dtype="auto",
        device_map="meta",
        trust_remote_code=True,
    )

    layer_config = build_layer_config(model)

    bits_counter = Counter(cfg["bits"] for cfg in layer_config.values())
    print(f"Layer count by bits (in layer_config): {dict(bits_counter)}")

    del model
    torch.cuda.empty_cache()

    tokens_list = collect_calibration_samples(tokenizer)

    print(f"len(tokens_list) = {len(tokens_list)}")
    print(f"first input_ids shape  = {tokens_list[0]['input_ids'].shape}")
    print(f"last  input_ids shape  = {tokens_list[-1]['input_ids'].shape}")
    print(f"first dtype = {tokens_list[0]['input_ids'].dtype}")


    ar = AutoRound(
        model=MODEL_PATH,
        tokenizer=tokenizer,
        scheme="W8A16",
        enable_torch_compile=True,
        group_size=GROUP_SIZE,
        sym=SYM,
        layer_config=layer_config,
        dataset=tokens_list,
        device_map="auto",
        batch_size=8,
        seqlen=SEQ_LEN,
        iters=1000,
        nsamples=MAX_SAMPLES,
        low_gpu_mem_usage=True,
    )
    ar.quantize_and_save(OUTPUT_DIR, format="auto_round")


if __name__ == "__main__":
    main()

The script above didn't work for me as auto-round got stuck at the cache block inputs phase (I only have 2x 3090s, which is insufficient for loading the full fp16 model). My Qwen quantizations didn't have this problem, as they didn't require much VRAM for that, but it seems the auto-round implementation for Gemma 4 is different. Gemma 4 is about 58GiB, so if you have 3x 3090, I think you can go with this script. If you still get OOM, check this documentation and change the args in the AutoRound function.

You need to install auto-round-nightly.
Also, auto-round doesn't seem to handle multi-GPU processing very well (it gets an OOM error on GPU 0 even when GPUs 1 and 2 have plenty of free VRAM) and requires a lot of bandwidth between GPUs. So if the process is running really slow, you might want to change device_map="auto" to device_map="0" in the AutoRound function.

I was going based on what Intel wrote on their INT4 quant for gemma-4 at https://huggingface.co/Intel/gemma-4-31B-it-int4-AutoRound#generate-the-model and looking at their https://huggingface.co/Intel/gemma-4-31B-it-int4-AutoRound/blob/main/quantization_config.json

Really IDK...with 3 x 3090 + 256GB of RAM it used up every bit of, and then another 42GB of swap when it fully ran out of everything and crashed. It did appear to be using all 3 GPUs as they were all at 97% VRAM capacity and compute was going up and down (in nvtop).

That script above is for your Mixed model not pure INT8 though?

I reused the script for the mixed quant, but I changed the config so that it's essentially the same as pure INT8. I still don't know why it uses that huge amount of RAM, though. I have a 2x 3090 + 64GB RAM setup, and quantizing Qwen 3.6 models used very little VRAM. IIRC, it used only about 16GB VRAM and ~30GB RAM in total. My best guess is that auto-round still doesn't have a optimized implementation for Gemma-4.

Trying your script now, since the auto-round command was giving me OOMs...

No idea what the difference is - technically there shouldn't be. Yet, your script is running where the straight command failed. I even disabled low_gpu_mem_usage. Also changed the format to "auto_round:auto_gptq".

top

nvtop

quant

Uh oh it crashed !

quantized 7/7 layers in the block, loss iter 0: 0.000013 -> iter 954: 0.000009
2026-05-06 18:48:07 INFO device.py L1802: 'peak_ram': 152.55GB, 'peak_vram': {'0': 21.62GB, '1': 23.08GB, '2': 21.72GB}
Quantizing model.language_model.layers.1:   2%|β–ˆβ–Ž                                                                             | 1/60 [19:58<19:38:13, 1198.20s/it]Traceback (most recent call last):
  File "/home/user/models/quant-gemma-mixed.py", line 156, in 
    main()
  File "/home/user/models/quant-gemma-mixed.py", line 152, in main
    ar.quantize_and_save(OUTPUT_DIR, format="auto_round:auto_gptq")
  File "/home/user/Envs/llm/lib/python3.12/site-packages/auto_round/compressors_new/base.py", line 1322, in quantize_and_save
    self.quantize()
  File "/home/user/Envs/llm/lib/python3.12/site-packages/auto_round/compressors_new/calib.py", line 1154, in quantize
    self._quantize_blocks(
  File "/home/user/Envs/llm/lib/python3.12/site-packages/auto_round/compressors_new/calib.py", line 977, in _quantize_blocks
    reference_output = self.quantizer._get_block_outputs(m, input_ids, input_others, bs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Envs/llm/lib/python3.12/site-packages/auto_round/algorithms/quantization/base.py", line 341, in _get_block_outputs
    tmp_output = _bf(
                 ^^^^
  File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1024, in compile_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/user/Envs/llm/lib/python3.12/site-packages/auto_round/compressors_new/utils.py", line 167, in block_forward
    output = block(input_ids, *input_tuple, **input_others)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Envs/llm/lib/python3.12/site-packages/transformers/modeling_layers.py", line 93, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/Envs/llm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Gemma4TextDecoderLayer.forward() got multiple values for argument 'hidden_states'
Quantizing model.language_model.layers.1:   2%|β–ˆβ–Ž                                                                             | 1/60 [19:59<19:39:31, 1199.52s/it]

And why is it so slow ?!

According to intel it should do 70B model in 2 hours https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#quantization-costs

The documentation benchmark used a single A100 GPU, and you are using RTX 3090 GPUs. Also the configuration uses the auto-round-best preset, and I'm pretty sure benchmark didn't use it. That's why it's so slow. IDK about the crash though, but based on the log, this is probably an auto-round issue.

Sign up or log in to comment