Got model loading error

by attashe - opened 17 days ago

.venv-polarquant/lib/python3.12/site-packages/transformers/modeling_utils.py", line 687, in _get_resolved_checkpoint_files
    raise OSError(
OSError: caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5-Vision does not appear to have a file named pytorch_model.bin or model.safetensors.

Run with next command:

polarquant demo caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5-Vision --share

caiovicentino1

Owner 17 days ago

Hi @attashe ! Thanks for trying PolarQuant!

The model weights are stored as model_vision_int4.pt (torchao INT4 format), not as standard model.safetensors. This is because PolarQuant uses a two-step pipeline: PQ5 dequant → torchao INT4.

How to load this model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import quantize_, Int4WeightOnlyConfig

# 1. Load model structure on CUDA
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-31B-it",
    dtype=torch.bfloat16,
    device_map={"":"cuda:0"},
    attn_implementation="sdpa",
    trust_remote_code=True
)

# 2. Apply INT4 quantization (installs kernels)
quantize_(model, Int4WeightOnlyConfig(group_size=128))

# 3. Load PolarQuant INT4 weights
from huggingface_hub import hf_hub_download
path = hf_hub_download("caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5-Vision", "model_vision_int4.pt")
sd = torch.load(path, map_location="cuda:0", weights_only=False)
model.load_state_dict(sd, strict=False, assign=True)

# 4. Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Gemma-4-31B-it-PolarQuant-Q5-Vision")

# Generate
inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Requirements:

GPU with ~22 GB VRAM (RTX 4090, A100, etc.)
pip install torchao transformers accelerate

I know this loading process is not as simple as standard from_pretrained. We have an open issue to add native PolarQuant support to transformers: https://github.com/huggingface/transformers/issues/45203

Let me know if you run into any issues!

caiovicentino1

Owner 16 days ago

@attashe — update: we've migrated most of our models to CompressedTensors format which loads natively in vLLM (zero plugin needed).

This Vision model still uses the older model_int4.pt format — it's on our list to convert. In the meantime, the loading code in my previous reply should work.

For our other models that ARE converted (Qwopus3.5-9B, Qwen3.5-9B/27B, etc.), loading is now just:

vllm serve caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5 --language-model-only --enforce-eager

We'll update this Vision model to the same format soon.

attashe

14 days ago

•

edited 14 days ago

Thank you! I'll try it soon

PS. Qwopus3.5-9B-v3-PolarQuant-Q5 is working on my machine, wait for this one

caiovicentino1

Owner 14 days ago

@attashe — great to hear 9B is working on your setup!

I'll prioritize converting this Gemma 4 31B Vision to a vLLM-native format so you can skip the manual load_state_dict dance. The vllm serve one-liner should work once the conversion lands. Will post here when the updated repo is live.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment