Worse than (smaller) MiniMax M2.7??
Not a slight, as you can't win 'em all, but this is a larger, more recent model, and ArtificialAnalysis ranks this as worse than the (smaller, but also recent) minimax m2.7.
Is that truly the case, or is something wrong? Anything special in this vs. minimax that the benchmarks aren't reflecting well?
Anything special
A functioning 1M context length that shouldn't require a disgustingly big VRAM reserve (no idea how well Flash holds up against Pro, though).
is something wrong?
Yes. You're letting the benchmarks cloud your vision. The model is as innovative as it gets. Just a year ago, something like that was unheard of, with most users sticking to 32K - 64K context at the same VRAM usage.
Question is, for how long we'll have to wait until llama.cpp supports it properly and GGUFs start to roll out...
Keep it respectful, please; I didn't invite ad hominem, and there's no reason to assume it's acceptable.
I didn't mean to offend you, but it's a real issue these days - people focusing too much on benchmark results.
Not a slight, as you can't win 'em all, but this is a larger, more recent model, and ArtificialAnalysis ranks this as worse than the (smaller, but also recent) minimax m2.7.
Is that truly the case, or is something wrong? Anything special in this vs. minimax that the benchmarks aren't reflecting well?
My dude. It's been out only a few hours. It's not even merged into mainline transformers last I checked.
Just chill.
It's not that far from Minimax 2.7 (AAI 50 vs 47) and it could be still a winner if it can take aggressive quantization better than MM.
Yes. You're letting the benchmarks cloud your vision.
Keep it respectful, please; I didn't invite ad hominem, and there's no reason to assume it's acceptable.
I thought it was very very respectful the way he worded it. Benchmarks do not do justice.
that being said, I have 2X6000 pros and hopefully i can fit this flash version with good context numbers and still have a functioning 1M context model with some overflow to system ram.
If KTransformers adapts to this, that will be beautiful.
it's a preview my dear
MiniMax M2.7 looks overfitting benchmark. I ask Claude to test it via 14 tasks generated by Sonnet, only one passes while 13 fail.
MiniMax M2.7 looks overfitting benchmark. I ask Claude to test it via 14 tasks generated by Sonnet, only one passes while 13 fail.
Just curious, did you compare that same test M2.5?
MiniMax M2.7 looks overfitting benchmark. I ask Claude to test it via 14 tasks generated by Sonnet, only one passes while 13 fail.
Just curious, did you compare that same test M2.5?
No,no resouce and time do that. However, I tested serveral gpt oss 120B related model, they can pass 10 to 14 of 14 at the same test. one of them can be found https://huggingface.co/cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled; these tests are generated by Sonnet 4.6 and then verified by Gemini Pro.
thanks fo
MiniMax M2.7 looks overfitting benchmark. I ask Claude to test it via 14 tasks generated by Sonnet, only one passes while 13 fail.
Just curious, did you compare that same test M2.5?
No,no resouce and time do that. However, I tested serveral gpt oss 120B related model, they can pass 10 to 14 of 14 at the same test. one of them can be found https://huggingface.co/cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled; these tests are generated by Sonnet 4.6 and then verified by Gemini Pro.
thanks for elaborating! I assume if those would've been coding tests, then MM 2.7 would have done pretty well... equal to Claude etc SOTA models.. right?
looking closer there were 5 coding tests, and it failed there too... hmm
code_01 Median of Two Sorted Arrays ✅ PASS 100% 25.0s 35
code_02 Thread-Safe LRU Cache with TTL ✅ PASS 100% 38.9s 36
code_03 Persistent Segment Tree — K-th Query ✅ PASS 100% 45.9s 35
code_04 Multi-Head Attention + RoPE (NumPy) ✅ PASS 100% 38.0s 33
code_05 Dijkstra vs A* on Large Random Graph ✅ PASS 100% 26.1s 34
thanks fo
MiniMax M2.7 looks overfitting benchmark. I ask Claude to test it via 14 tasks generated by Sonnet, only one passes while 13 fail.
Just curious, did you compare that same test M2.5?
No,no resouce and time do that. However, I tested serveral gpt oss 120B related model, they can pass 10 to 14 of 14 at the same test. one of them can be found https://huggingface.co/cloudyu/gpt-oss-120b-Sonnet-Reasoning-Distilled; these tests are generated by Sonnet 4.6 and then verified by Gemini Pro.
thanks for elaborating! I assume if those would've been coding tests, then MM 2.7 would have done pretty well... equal to Claude etc SOTA models.. right?
looking closer there were 5 coding tests, and it failed there too... hmm
code_01 Median of Two Sorted Arrays ✅ PASS 100% 25.0s 35
code_02 Thread-Safe LRU Cache with TTL ✅ PASS 100% 38.9s 36
code_03 Persistent Segment Tree — K-th Query ✅ PASS 100% 45.9s 35
code_04 Multi-Head Attention + RoPE (NumPy) ✅ PASS 100% 38.0s 33
code_05 Dijkstra vs A* on Large Random Graph ✅ PASS 100% 26.1s 34
here, I report the issue, but you need translate https://huggingface.co/MiniMaxAI/MiniMax-M2.7/discussions/16#69de35f3c4cb3fe610074295
Yes. You're letting the benchmarks cloud your vision.
Keep it respectful, please; I didn't invite ad hominem, and there's no reason to assume it's acceptable.
You did not respect DeepSeek first
Yes. You're letting the benchmarks cloud your vision.
Keep it respectful, please; I didn't invite ad hominem, and there's no reason to assume it's acceptable.
That is not ad hominem, they're targeting your argument directly, you were utilizing benchmarks as sole indicator of quality and innovation, they called you out, just that.