This LLM is a test maxer, not a general purpose AI model.

#17

by phil111 - opened about 12 hours ago

•

You should probably add descriptors to your model names, such as STEM, Math, or Code, if you're going to grossly overfit your models to a handful of domains rather than make a balanced general purpose AI model.

For example, the general popular knowledge of this model is extremely low for its size and scores lower on broad knowledge tests, including my own, than much smaller general purpose AI models like Gemma 4 26b-a4b, and even Qwen3.5 34b-a3b.

And it's not only that your corpus was very lopsided for the sake of STEM test maxing (e.g. the MMLU-pro), but you didn't even account for broad popular knowledge when training the model for thinking; causing it to loop endless during thinking following thinking tangents that don't make sense considering the simple knowledge retrieval question being asked of it.

It's perfectly understandable to make a model for specific tasks like STEM, coding, and math, but please don't pass it off as a general purpose AI model because outside of the domains you test maxed for this model is reduced to little more than an hallucination generator which almost always enters an endless thinking loop in response to simple prompts outside of this model's overfit domains.

Edit: You can even see a huge regression compared to v2.5 on the LMsys arena, such as going from a 79 to 108 in creative writing, and 60 to 91 in math. As you well know LLMs are a balancing act. You can't just train hard on coding and advanced math without laying waste to the model's general abilities and knowledge. I'm not trying to single you out since the bulk of the AI community has shifted from making general purpose AI models to test maxing specialist tools as they increased the parameter count. Training on countless coding and math tokens, followed by a large number of COT examples while enabling thinking is an easy way to climb higher on STEM tests like coding, but in the end you're left with an unreliable and hallucinating LLM across most other domains despite using 100s of parameters.

domcx

about 12 hours ago

Blog posts and readme are heavily biased towards software engineering. MiniMax in name is a reference to the MiniMax algorithm.
Materials released with the model are explicit in its use for software engineering, changing the name is pedantic and it should be obvious what this model is for.

phil111

about 11 hours ago

•

edited about 11 hours ago

@domcx That's true. The model card was clear about its focus, as was the tests touting its abilities. However, this doesn't change the fact that the general shift toward test maxing, especially coding and math, has all but ended the era of general purpose OS AI models. Even Gemma 4 has a much smaller knowledge horizon than Gemma 3, which in turn has a smaller horizon than Gemma 2.

Since the OS community is >95% autistic coders there's a strong pressure to trade a notable amount of generalist abilities and knowledge for small gains on select domains (e.g. test maxing coding).

And the end result is Microsoft and Google trying to integrate grossly overfit models into their respective OSes and contributing to the general population's dislike of AI. For example, <95% of the population codes, and those who do wouldn't code on their phones or with an edge model, yet Google wasted tons of parameter capacity on coding when making E4B and E2B, resulting in models that do nothing but hallucinate about humanity's most popular knowledge. So when normal people use E4B on their phones it's going to lead to a very frustrating experience. Point being, this widespread shift from generalist to autistic specialist is doing great damage to how the general population views AI.

domcx

about 11 hours ago

•

edited about 4 hours ago

Domain specific models are a good thing. This provides better relative results for the use case with lower inference costs. A model trained for software engineering will do well at software engineering. You pick the tool for the job. Why hit everything with a hammer?

The rest of your argument does not resonate with me.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment