good fit for M2 Max 96gb - it's 104B-A7B4
#1
by ljupco - opened
Thanks for your vids - love you YT channel! :-)
For any other LocalLLM fans. Ling-2.6-flash is 104B-A7B4 so 104B parameters, MoE active 7.4B at any time is great for people 96gb-128gb ram it seems. Being constrained to 96GB (old M2 Max) I'm running the mlx quants from
https://huggingface.co/mlx-community/Ling-2.6-flash-mlx-4bit-DWQ/
The mlx patch that made the model run for me under mlx is
https://github.com/ml-explore/mlx-lm/pull/1227
The patch is single file mlx-lm/mlx_lm/models/bailing_hybrid.py
Getting TG >30 tok/s with a 64K context is a great unlock for me for running local agents incl auto-re/researc-dev agent.
oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
that loaded the model, and run benchmarks
Benchmark Model: Ling-2.6-flash-mlx-4bit-DWQ
Single Request Results
Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem
pp1024/tg128 3513.2 27.31 291.5 tok/s 36.9 tok/s 6.982 165.0 tok/s 55.33 GB
pp4096/tg128 16395.7 25.69 249.8 tok/s 39.2 tok/s 19.659 214.9 tok/s 55.99 GB
pp8192/tg128 32562.9 27.07 251.6 tok/s 37.2 tok/s 36.001 231.1 tok/s 56.57 GB
pp16384/tg128 66354.9 27.73 246.9 tok/s 36.3 tok/s 69.876 236.3 tok/s 57.75 GB
pp32768/tg128 136722.8 28.98 239.7 tok/s 34.8 tok/s 140.403 234.3 tok/s 60.10 GB
pp65536/tg128 296867.5 32.72 220.8 tok/s 30.8 tok/s 301.022 218.1 tok/s 64.80 GB
Continuous Batching
pp1024 / tg128
Batch tg TPS Speedup pp TPS pp TPS/req TTFT(ms) E2E(s)
1x 36.9 tok/s 1.00x 291.5 tok/s 291.5 tok/s 3513.2 6.982
2x 46.1 tok/s 1.25x 212.6 tok/s 106.3 tok/s 9536.4 15.186
4x 51.4 tok/s 1.39x 213.1 tok/s 53.3 tok/s 18897.6 29.179
M2 Max (38c) Ling-2.6-flash4bit-DWQ · 4bit 26-04-30 18:16:04
1k PP 291.5 · TG 36.9 tok/s
4k PP 249.8 · TG 39.2 tok/s
8k PP 251.6 · TG 37.2 tok/s
16k PP 246.9 · TG 36.3 tok/s
32k PP 239.7 · TG 34.8 tok/s
64k PP 220.8 · TG 30.8 tok/s
ljupco changed discussion title from perfect fit for M2 Max 104B-A7B4 to good fit for M2 Max 96gb - it's 104B-A7B4