Requested 122B-A10B-FP8-abliterated please and thanks

#3
by easonchow0419 - opened

Requested 122B-A10B-FP8-abliterated please and thanks

will do it later this week, thx for asking!

Hi Steve, thanks for the GGUF β€” really appreciate the work on abliterix.

I'm also looking for an FP8 safetensors version specifically for vLLM serving (need parallel tool calling + continuous batching for production).

I've spent a few weeks trying to quantize both heretic and abliterix to FP8 on 2Γ—H100 80GB. Here's what I found:

vLLM on-the-fly --quantization fp8 works perfectly with heretic (good quality, full inference) but not with abliterix
llm-compressor offline FP8 produces a checkpoint but it gives garbled output β€” the MoE gates, GatedDeltaNet attention, and shared experts need to stay in BF16 (there's an 8-pattern ignore list required), plus there's a transformers version conflict (llm-compressor pins ≀4.57 but qwen3_5_moe needs β‰₯5.2)
save_sharded_state can't persist the on-the-fly FP8 to disk β€” vLLM doesn't save the FP8 attention scales (q_scale, k_scale, v_scale)
Qwen has no official self-quantization guide β€” they only provide pre-quantized base (censored) model
I have working scripts, patches, and a full write-up of everything I tried. Would you be open to connecting to discuss this? I'd love to collaborate on getting a proper FP8 abliterix checkpoint out there.

Happy to share everything I have β€” just let me know the best way to reach you.

BTW, I did it, so there is no need anymore.

Sign up or log in to comment