Got model loading error
.venv-polarquant/lib/python3.12/site-packages/transformers/modeling_utils.py", line 687, in _get_resolved_checkpoint_files
raise OSError(
OSError: caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5-Vision does not appear to have a file named pytorch_model.bin or model.safetensors.
Run with next command:
polarquant demo caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5-Vision --share
Hi @attashe ! Thanks for trying PolarQuant!
The model weights are stored as model_vision_int4.pt (torchao INT4 format), not as standard model.safetensors. This is because PolarQuant uses a two-step pipeline: PQ5 dequant β torchao INT4.
How to load this model:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import quantize_, Int4WeightOnlyConfig
# 1. Load model structure on CUDA
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-31B-it",
dtype=torch.bfloat16,
device_map={"":"cuda:0"},
attn_implementation="sdpa",
trust_remote_code=True
)
# 2. Apply INT4 quantization (installs kernels)
quantize_(model, Int4WeightOnlyConfig(group_size=128))
# 3. Load PolarQuant INT4 weights
from huggingface_hub import hf_hub_download
path = hf_hub_download("caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5-Vision", "model_vision_int4.pt")
sd = torch.load(path, map_location="cuda:0", weights_only=False)
model.load_state_dict(sd, strict=False, assign=True)
# 4. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5-Vision")
# Generate
inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Requirements:
- GPU with ~22 GB VRAM (RTX 4090, A100, etc.)
pip install torchao transformers accelerate
I know this loading process is not as simple as standard from_pretrained. We have an open issue to add native PolarQuant support to transformers: https://github.com/huggingface/transformers/issues/45203
Let me know if you run into any issues!
@attashe β update: we've migrated most of our models to CompressedTensors format which loads natively in vLLM (zero plugin needed).
This Vision model still uses the older model_int4.pt format β it's on our list to convert. In the meantime, the loading code in my previous reply should work.
For our other models that ARE converted (Qwopus3.5-9B, Qwen3.5-9B/27B, etc.), loading is now just:
vllm serve caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5 --language-model-only --enforce-eager
We'll update this Vision model to the same format soon.
Thank you! I'll try it soon
PS. Qwopus3.5-9B-v3-PolarQuant-Q5 is working on my machine, wait for this one
@attashe β great to hear 9B is working on your setup!
I'll prioritize converting this Gemma 4 31B Vision to a vLLM-native format so you can skip the manual load_state_dict dance. The vllm serve one-liner should work once the conversion lands. Will post here when the updated repo is live.