Requested 122B-A10B-FP8-abliterated please and thanks
Requested 122B-A10B-FP8-abliterated please and thanks
will do it later this week, thx for asking!
Hi Steve, thanks for the GGUF β really appreciate the work on abliterix.
I'm also looking for an FP8 safetensors version specifically for vLLM serving (need parallel tool calling + continuous batching for production).
I've spent a few weeks trying to quantize both heretic and abliterix to FP8 on 2ΓH100 80GB. Here's what I found:
vLLM on-the-fly --quantization fp8 works perfectly with heretic (good quality, full inference) but not with abliterix
llm-compressor offline FP8 produces a checkpoint but it gives garbled output β the MoE gates, GatedDeltaNet attention, and shared experts need to stay in BF16 (there's an 8-pattern ignore list required), plus there's a transformers version conflict (llm-compressor pins β€4.57 but qwen3_5_moe needs β₯5.2)
save_sharded_state can't persist the on-the-fly FP8 to disk β vLLM doesn't save the FP8 attention scales (q_scale, k_scale, v_scale)
Qwen has no official self-quantization guide β they only provide pre-quantized base (censored) model
I have working scripts, patches, and a full write-up of everything I tried. Would you be open to connecting to discuss this? I'd love to collaborate on getting a proper FP8 abliterix checkpoint out there.
Happy to share everything I have β just let me know the best way to reach you.
BTW, I did it, so there is no need anymore.