Is this AWQ type quant compatible with transformers/vLLM

#1
by jc2375 - opened

What backend/runtime can we use to test it?

Hi @jc2375 ! Thanks for the interest.

EOQ v3 is not standard AWQ β€” it's a custom format (PolarQuant + AWQ pre-scaling) so it's not directly compatible with
transformers' AWQ loader or vLLM.

How to use it today:

  1. Dequant to FP16 (easiest): Load with our EOQ codebase, dequantize to FP16, then use normally with transformers. Takes ~5s
    on GPU.

pip install git+https://github.com/caiovicentino/eoq-quantization

from core.weight_loader import load_eoq_model
model = load_eoq_model("caiovicentino1/Qwen3.5-35B-A3B-EOQ-v3")

Now it's a standard FP16 model β€” works with transformers normally

  1. PolarEngine (native quantized inference, WIP): Keeps weights quantized in VRAM (~12 GB instead of 18 GB) with a custom
    Triton kernel. Currently at 34 tok/s on a 9B model (74% of FP16 speed). Still in active development.

We're working on a proper integration with transformers/vLLM. For now, option 1 is the simplest path.

Sign up or log in to comment