Is this AWQ type quant compatible with transformers/vLLM

by jc2375 - opened 24 days ago

Discussion

jc2375

24 days ago

What backend/runtime can we use to test it?

caiovicentino1

Owner 24 days ago

Hi @jc2375 ! Thanks for the interest.

EOQ v3 is not standard AWQ — it's a custom format (PolarQuant + AWQ pre-scaling) so it's not directly compatible with
transformers' AWQ loader or vLLM.

How to use it today:

Dequant to FP16 (easiest): Load with our EOQ codebase, dequantize to FP16, then use normally with transformers. Takes ~5s
on GPU.

eoq-quantization

from core.weight_loader import load_eoq_model
model = load_eoq_model("caiovicentino1/Qwen3.5-35B-A3B-EOQ-v3")

Now it's a standard FP16 model — works with transformers normally

PolarEngine (native quantized inference, WIP): Keeps weights quantized in VRAM (~12 GB instead of 18 GB) with a custom
Triton kernel. Currently at 34 tok/s on a 9B model (74% of FP16 speed). Still in active development.

We're working on a proper integration with transformers/vLLM. For now, option 1 is the simplest path.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment