Is this AWQ type quant compatible with transformers/vLLM
What backend/runtime can we use to test it?
Hi @jc2375 ! Thanks for the interest.
EOQ v3 is not standard AWQ β it's a custom format (PolarQuant + AWQ pre-scaling) so it's not directly compatible with
transformers' AWQ loader or vLLM.
How to use it today:
- Dequant to FP16 (easiest): Load with our EOQ codebase, dequantize to FP16, then use normally with transformers. Takes ~5s
on GPU.
pip install git+https://github.com/caiovicentino/eoq-quantization
from core.weight_loader import load_eoq_model
model = load_eoq_model("caiovicentino1/Qwen3.5-35B-A3B-EOQ-v3")
Now it's a standard FP16 model β works with transformers normally
- PolarEngine (native quantized inference, WIP): Keeps weights quantized in VRAM (~12 GB instead of 18 GB) with a custom
Triton kernel. Currently at 34 tok/s on a 9B model (74% of FP16 speed). Still in active development.
We're working on a proper integration with transformers/vLLM. For now, option 1 is the simplest path.