Excellent work (2.57bpw-tuned) ... and a small kind request

by dehnhaide - opened Dec 28, 2025

Dec 28, 2025

•

edited Dec 28, 2025

Hi Mamy! First of all thank you so very much for all the effort and competence you've put into exllamav3 models support, quantization, tuning and not lastly the enjoyable and open documentation you provide at almost every step! Such an inspiration!
I have just deployed the 2.57bpw -tuned version (on 1x409o + 4x3090) and I am so much relieved that the t/s makes the model usage something bearable as opposed to llama.cpp (at least for my setup).

The kind request (please don't be offended!) was related to ... any plans of releasing similarly quantized and tuned versions for Minimax 2.1? This seems to be quite a good upgrade vs v2.0 and would love to add it to my collection of EXL3 models.
If there's any way to buy you a "coffee" to incentivize your hard work, do let me/us know! :)

dehnhaide changed discussion title from Excellent work & small kind request to Excellent work (2.57bpw-tuned) ... and a small kind request Dec 28, 2025

mratsim

Owner Dec 28, 2025

Well following this comment by @remichu https://huggingface.co/mratsim/GLM-4.7-EXL3/discussions/1#694e490e4e963c170ba5ca23, I started looking into Minimax M2.1.

One thing I'm unsure about though is whether tool calls for it work in TabbyAPI?
A coding model without tool calls would be unusable for me.

Because I tried to make tool calls work for GLM-4.7 but I hit a wall without diving deep into TabbyAPI

https://github.com/theroyallab/tabbyAPI/pull/378#issuecomment-3679072283
tried also cajoling the model into making JSON tool calls via the jinja template but didn't work.

And I don't want to have the same issue with Minimax M2.1 in EXL3 so I started quantizing it in llmcompressor but it's such a chore and whack in a mole:

dehnhaide

Dec 28, 2025

whack in a mole: https://github.com/vllm-project/llm-compressor/issues/2172#issuecomment-3694708131

Ouch, mama... that terrible barf of errors, that silence in the thread... not funny! 👹
But, pardon my ignorance... it is smth organically not there, not yet there with tabbyAPI for tool calling... or is it more about model initial design & quantization?

mratsim

Owner Dec 28, 2025

not yet there with tabbyAPI for tool calling...

It's that tool calling needs specific config:

https://github.com/theroyallab/tabbyAPI/blob/main/docs/10.-Tool-Calling.md

That you put there: https://github.com/theroyallab/tabbyAPI/tree/main/templates/tool_calls

while vLLM and SGLang do the heavy lifting for you ... and are actually tested and benchmarked before even releasing the model.

remichu

Dec 28, 2025

Well following this comment by @remichu https://huggingface.co/mratsim/GLM-4.7-EXL3/discussions/1#694e490e4e963c170ba5ca23, I started looking into Minimax M2.1.

One thing I'm unsure about though is whether tool calls for it work in TabbyAPI?
A coding model without tool calls would be unusable for me.

Because I tried to make tool calls work for GLM-4.7 but I hit a wall without diving deep into TabbyAPI

https://github.com/theroyallab/tabbyAPI/pull/378#issuecomment-3679072283

tried also cajoling the model into making JSON tool calls via the jinja template but didn't work.

And I don't want to have the same issue with Minimax M2.1 in EXL3 so I started quantizing it in llmcompressor but it's such a chore and whack in a mole:

https://github.com/vllm-project/llm-compressor/pull/2171

whack in a mole: https://github.com/vllm-project/llm-compressor/issues/2172#issuecomment-3694708131

the tool calling works perfectly, but i dont know if TabbyAPI implement the parser for it or not.
I just uploaded an 4.0 quant of it and i think it is solidly quite good from initial testing. If you dont have time to quant, can just download my 4.0bpw and give it a try

dehnhaide

Dec 29, 2025

I just uploaded an 4.0 quant of it and i think it is solidly quite good from initial testing. If you dont have time to quant, can just download my 4.0bpw and give it a try
Maybe too blind, haven't spotted it... care to help where to look for / link? Thanks!

remichu

Dec 29, 2025

I just uploaded an 4.0 quant of it and i think it is solidly quite good from initial testing. If you dont have time to quant, can just download my 4.0bpw and give it a try
Maybe too blind, haven't spotted it... care to help where to look for / link? Thanks!
Here u go, from my testing the performance is quite good, and it is easier to run for vram poor folk compared to glm https://huggingface.co/remichu/MiniMax-M2.1-exl3

dehnhaide

Dec 29, 2025

Here u go, from my testing the performance is quite good, and it is easier to run for vram poor folk compared to glm https://huggingface.co/remichu/MiniMax-M2.1-exl3
Many thanks remichu, but I've run into a problem... while trying to load the model in TabbyAPI, i get:
"NotImplementedError: Tensor-parallel is not currently implemented for MiniMaxM2ForCausalLM"

I use indeed "tensor_parallel: true" on my setup (1x4090 + 4x3090) is there something I'm missing! Thanks for the help!

mratsim

Owner Jan 3

@dehnhaide

The kind request (please don't be offended!) was related to ... any plans of releasing similarly quantized and tuned versions for Minimax 2.1? This seems to be quite a good upgrade vs v2.0 and would love to add it to my collection of EXL3 models.

I just released https://huggingface.co/mratsim/MiniMax-M2.1-FP8-INT4-AWQ

Not EXL3 though because I really needed to be 100% sure tool calling worked in a timely manner.

dehnhaide

Jan 4

Many thanks for the effort! I am 1 GPU shy of running it but nonetheless, great to know that it exists. Any observations, cross "benchmarks" remarks from your side so far?

llmixer

Feb 17

With 5x24gb I cannot seem to load this with more than about 8k tokens of context. How do you guys achieve the 100k+? I'm using k4v4 cache and no GUI with latest master exllamav3

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment