--- license: apache-2.0 base_model: MiniMaxAI/MiniMax-M2.7 base_model_relation: quantized tags: - auto-round - int4 - w4a16 - quantization - moe library_name: transformers --- # MiniMax-M2.7 INT4 AutoRound 4-bit quantized version of [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7) using [Intel AutoRound](https://github.com/intel/auto-round). ## Quantization Config | Setting | Value | |---|---| | Scheme | W4A16 (INT4 weights, FP16 activations) | | Group size | 128 | | Ignored layers | MoE `gate` layers (kept at full precision) | | Method | RTN (iters=0) | ## Usage ### vLLM ```bash vllm serve Lasimeri/MiniMax-M2.7-int4-AutoRound \ --trust-remote-code \ --tensor-parallel-size 8 \ --enable-auto-tool-choice \ --tool-call-parser minimax_m2 \ --reasoning-parser minimax_m2_append_think ``` ### SGLang ```bash python -m sglang.launch_server \ --model-path Lasimeri/MiniMax-M2.7-int4-AutoRound \ --trust-remote-code \ --tp 8 \ --reasoning-parser minimax-append-think \ --tool-call-parser minimax-m2 ``` ## Quantization Hardware Quantized on a single-node rig: | Component | Spec | |---|---| | CPU | AMD EPYC 7742 (64C / 128T) | | RAM | 251 GB DDR4 | | GPUs | 8× RTX 3080 (20 GB modded) | Peak resource usage during quantization: ~25.6 GB RAM, ~5 GB VRAM on GPU 0, ~1.3 GB on each remaining GPU.