Excellent work (2.57bpw-tuned) ... and a small kind request

#6
by dehnhaide - opened

Hi Mamy! First of all thank you so very much for all the effort and competence you've put into exllamav3 models support, quantization, tuning and not lastly the enjoyable and open documentation you provide at almost every step! Such an inspiration!
I have just deployed the 2.57bpw -tuned version (on 1x409o + 4x3090) and I am so much relieved that the t/s makes the model usage something bearable as opposed to llama.cpp (at least for my setup).

The kind request (please don't be offended!) was related to ... any plans of releasing similarly quantized and tuned versions for Minimax 2.1? This seems to be quite a good upgrade vs v2.0 and would love to add it to my collection of EXL3 models.
If there's any way to buy you a "coffee" to incentivize your hard work, do let me/us know! :)

dehnhaide changed discussion title from Excellent work & small kind request to Excellent work (2.57bpw-tuned) ... and a small kind request

Well following this comment by @remichu https://huggingface.co/mratsim/GLM-4.7-EXL3/discussions/1#694e490e4e963c170ba5ca23, I started looking into Minimax M2.1.

One thing I'm unsure about though is whether tool calls for it work in TabbyAPI?
A coding model without tool calls would be unusable for me.

Because I tried to make tool calls work for GLM-4.7 but I hit a wall without diving deep into TabbyAPI

And I don't want to have the same issue with Minimax M2.1 in EXL3 so I started quantizing it in llmcompressor but it's such a chore and whack in a mole:

Ouch, mama... that terrible barf of errors, that silence in the thread... not funny! πŸ‘Ή
But, pardon my ignorance... it is smth organically not there, not yet there with tabbyAPI for tool calling... or is it more about model initial design & quantization?

not yet there with tabbyAPI for tool calling...

It's that tool calling needs specific config:

That you put there: https://github.com/theroyallab/tabbyAPI/tree/main/templates/tool_calls

while vLLM and SGLang do the heavy lifting for you ... and are actually tested and benchmarked before even releasing the model.

Well following this comment by @remichu https://huggingface.co/mratsim/GLM-4.7-EXL3/discussions/1#694e490e4e963c170ba5ca23, I started looking into Minimax M2.1.

One thing I'm unsure about though is whether tool calls for it work in TabbyAPI?
A coding model without tool calls would be unusable for me.

Because I tried to make tool calls work for GLM-4.7 but I hit a wall without diving deep into TabbyAPI

And I don't want to have the same issue with Minimax M2.1 in EXL3 so I started quantizing it in llmcompressor but it's such a chore and whack in a mole:

the tool calling works perfectly, but i dont know if TabbyAPI implement the parser for it or not.
I just uploaded an 4.0 quant of it and i think it is solidly quite good from initial testing. If you dont have time to quant, can just download my 4.0bpw and give it a try

I just uploaded an 4.0 quant of it and i think it is solidly quite good from initial testing. If you dont have time to quant, can just download my 4.0bpw and give it a try
Maybe too blind, haven't spotted it... care to help where to look for / link? Thanks!

I just uploaded an 4.0 quant of it and i think it is solidly quite good from initial testing. If you dont have time to quant, can just download my 4.0bpw and give it a try
Maybe too blind, haven't spotted it... care to help where to look for / link? Thanks!
Here u go, from my testing the performance is quite good, and it is easier to run for vram poor folk compared to glm https://huggingface.co/remichu/MiniMax-M2.1-exl3

Here u go, from my testing the performance is quite good, and it is easier to run for vram poor folk compared to glm https://huggingface.co/remichu/MiniMax-M2.1-exl3
Many thanks remichu, but I've run into a problem... while trying to load the model in TabbyAPI, i get:
"NotImplementedError: Tensor-parallel is not currently implemented for MiniMaxM2ForCausalLM"

I use indeed "tensor_parallel: true" on my setup (1x4090 + 4x3090) is there something I'm missing! Thanks for the help!

Owner

@dehnhaide

The kind request (please don't be offended!) was related to ... any plans of releasing similarly quantized and tuned versions for Minimax 2.1? This seems to be quite a good upgrade vs v2.0 and would love to add it to my collection of EXL3 models.

I just released https://huggingface.co/mratsim/MiniMax-M2.1-FP8-INT4-AWQ

Not EXL3 though because I really needed to be 100% sure tool calling worked in a timely manner.

Many thanks for the effort! I am 1 GPU shy of running it but nonetheless, great to know that it exists. Any observations, cross "benchmarks" remarks from your side so far?

With 5x24gb I cannot seem to load this with more than about 8k tokens of context. How do you guys achieve the 100k+? I'm using k4v4 cache and no GUI with latest master exllamav3

Sign up or log in to comment