"Small" model is now 120B and beyond the reach of most people, even in the upper-middle class with multiple consumer GPUs
It's genuinely incredible. A-tier fumble
At least there's Qwen 3.5 27B, it's actually small, and its pretty incredibly at function calling / agentic stuff.
Also, for a modern 120B models, even if benchmarking is mostly for show nowadays, those official stats are plain bad. They are not only really limited, but they also barely beat MS 3.2 the moment reasoning is turned off? How is that even possible?
just wait for mini and tiny
Large models are 2 trillion tokens these days, if you watch the last two nvidia keynotes. They're going to get larger too, as they just figured out how to let models scale to more layers. Really wish they'd get onto the bitnet thing to save space, though.
2 trillion tokens in what tho
Large models are 2 trillion tokens these days, if you watch the last two nvidia keynotes. They're going to get larger too, as they just figured out how to let models scale to more layers. Really wish they'd get onto the bitnet thing to save space, though.
That's tokens used for training, that has nothing to do with the size of the model itself... You have no idea what any of the terms you used mean.
Each new release of mistral small 22b started hallucinating progressively more along the fringes of knowledge. The same thing happened with all other model families. For example, Qwen2.5 70b only has an English SimpleQA score of ~10, vs 18 for Qwen2 70b. And the Qwen3/3.5 families hallucinate like crazy when it comes to very popular knowledge across most domains.
Small models reached their peak (saturated weights) with Llama 3 (broad knowledge) and Gemma 3 (broad abilities beyond just a handful of tasks like coding, such as writing poems). Ever since then small models have just been trading broad knowledge and abilities for incremental gains in select domains. In short, they became little more than hallucination and slop generators for everyone other than autistic coding nerds.
In short, Mistral had no choice but to increase the total parameter count to ~100 when combining abilities into a general purpose AI model. Hopefully in so doing they retained broad knowledge and abilities (haven't personally tested it).
@heni86 "why's it gotta be filtering"
I'm no exactly sure what a bf16 clip error in the Qwen arches means. I'm assuming it means it was trained at a higher precision, perhaps 32 bit, then clipped down.
I don't see how this could cause a sharp cut-off of knowledge across multiple domains (e.g. music and movies), but not across other domains like STEM. Especially since extreme quantization down to Q4_K_M only results in a barely detectable increase in factual hallucinations compared to the full float versions.
Plus this drastic reduction in popular knowledge in unprioritized domains of knowledge existed in the smaller Qwen2s, and all the Qwen2.5, 3 and 3.5 models. And the sudden drop in broad knowledge (e.g. a SimpleQA of ~18 to 10 going from Qwen2 to Qwen2.5 72b was accompanied by a sudden increase in STEM knowledge).
This isn't conjecture, or due to things like a technical error or quantization. Starting with the small Qwen2 models (e.g 7b), and extending to the large ones in Qwen2.5 (e.g. 72b), Alibaba decided to grossly over train on specific domains of knowledge at the expense of broad popular knowledge.
Mix of owl tutor effect and a rounding error that has unequal effects in the mantissa and exponent. Throw in the overtraining and the order of training as you say and there you have it right
Plus this drastic reduction in popular knowledge in unprioritized domains of knowledge existed in the smaller Qwen2s, and all the Qwen2.5, 3 and 3.5 models. And the sudden drop in broad knowledge (e.g. a SimpleQA of ~18 to 10 going from Qwen2 to Qwen2.5 72b was accompanied by a sudden increase in STEM knowledge).
At the end of the day, do you have any evidence of broad knowledge dropping other than your one benchmark or are you guessing about generalization from one benchmark and your personal experience, i.e. a very narrow perspective?
@Fizzarolli I understand skepticism, but within reason. The Simple QA is a non-multiple choice test that asks questions across a broad spectrum of domains, and a 72b parameter dense model scoring 10 is astonishingly bad because additional points get progressive harder, and Llama 3 70b scored 20, and Qwen2 72b was nearly as strong.
Plus I personally asked a diverse set of questions about popular music, movies, shows, books etc. and Qwen2 72b didn't just do a little better than Qwen2.5 72b, but vastly better. Yet when it came to STEM questions Qwen2.5 72b did better.
Trying to dismiss a massive non-multiple choice test that covers broad domains of knowledge, plus the words of someone who has no reason to lie, especially when you can test it yourself (ask about music and movies and other popular knowledge that you're familiar with), isn't healthy skepticism.
I really think that "broad knowledge" is really not a meaningful term for mistral models in particular.
Don't get me wrong, this "small" version, is by all accounts extremely poor in quality, barely at the level of modern models 20% of its size. But that's beside the point. I love Mistral (I'm french, and partial to them), but they are lagging behind on a technological level. I mean if the last year+ of "mistral 3.1 + finetunes + scaffolded thinking" didn't clue you in, i don't know what will. Their only redeeming quality is releasing base models (well, that's another can of worms, because they're not so "base").
Point is, mistral models have always been instruction-following models first and foremost : they'll translate, summarize, and follow your prompt to a T. They're good agents, but they're not dictionaries of knowledge. No small model can be tbh. Small models, to stay vaguely competitive for an "utility" case have to pick a niche. That has always been the Mistral's philosophy imo. Them being semi-decent (well if you ignore the massive structural repetition issues) for creative fine tunes was just a bonus.
@SerialKicked I agree. Companies stopped releasing "base" models. They come pre-aligned, and any attempt to counteract this alignment with fine tuning makes them schizophrenic and degrades their performance.
Anyways, Mistral models like Mixtral 8x7b and the original Mistral Small 22b had ample amount of broad knowledge. Far more than comparably sized current models like Qwen3.5 ~30b MOE & dense. Even tiny little Llama 3.2 3b has more of it than the ~10x larger Qwen3.5's.
Nobody's asking for, nor expecting, smaller LLMs to have depth of knowledge. But there should be a balance of popular knowledge across all domains. Having orders of magnitude less depth of knowledge across popular domains like music, movies, video games... than in STEM, transforms a broadly usable AI model for the general population into a tool for a very small percentage of the population (basically coding obsessed autistic nerds).
Also, this isn't just about acting as a reference, although that's important (e.g. 'who sang the song with these lyrics that are stuck in my head'). Organic tasks like writing stories, or simply chatting, require basic knowledge of the subject matter before it's even possible for the AI model to become reasonably competent at said tasks. Plus a lot of humor and non-literal communication is derived from pop culture, so when an AI model isn't democratically trained on a balance popular knowledge it starts falling of the rails at every turn, takes things too literally, is filled with pockets of ignorance, and so on, resulting in very unreliably and inappropriate responses.
There's simply no excuse for models with 34 BILLION parameters to be so profoundly ignorant of humanities most popular knowledge across most domains when previous generations of smaller models, as well as much smaller fully indexed Wikipedia dumps, have far more of it.
@phil111 There's a hard belief based on just the "magic KL divergence" stat that there's one single point where you can "free" a model without completely degrading it (even if the data was never fed to begin with). KL divergence is at best a vague estimate, not a reflection of reality. The reality being that all weights are interconnected, and saying "no" because it's against censorship, or saying "no" because what you asked was "retarded" will always trigger the same weights. I tested every single heretic / unaligned model there is in the size range I can run, they are all brain dead when you asks questions where "no" would be a decent answer. The only difference being that some can still stay functional outside of those cases, while others can't even do their normal role properly anymore.
I think we talked about your tests before. I don't disagree with you on principle. I'll disagree on this part tho "It has X billions of params SO OF COURSE it should be able to do that". Those numbers to an human scale seem "massive", but they don't really work like that. Yes, you can train a dedicated 32B model (or whatever) on "generic human knowledge" specifically, and it'd probably be quite good at it, ngl. It'd actually be interesting to see, but it's money intensive for no tangible returns. Making the same model able to run tool calling, format json perfectly, obey commands, while writing "thinking" blocks on top, actually leave very little room for overhead at this size range. And if the model can use tools to search the web for knowledge, analyze the results and give me an up to date response instead of vague estimates based on stale training data, why shouldn't it be trained for that in priority?
I don't think your goal is flawed, and there's definitely a "target audience" for what you're testing for (even if i suspect that 90% of this target audience will just go to chatgpt.com by default anyway), but goals changed. I'm sure older models had more wide-knowledge. On the other hand, they couldn't even do 10% of the tasks people who use them professionally would need them for. I mean, let's take Llama 3, and let's assume it scored better on your test (idk, not really the point), good. But it couldn't run tool calls, couldn't do agentic tasks, couldn't edit code, and was barely able to follow clear cut instructions.
It's like a gauge: utility vs trivia, basically. The more you push the utility part, the less room you have for knowledge and language. Same in the other direction. And sadly, for local inference on (very enthusiastic) consumer hardware, i doubt you can do both. I'd love to be proven wrong, ofc, but i doubt I will.
@SerialKicked But I think there's a balance between knowledge and ability. For example, Gemma 3 27b can do a surprisingly wide various tasks, like write original prompted poems that make sense and rhyme, or even re-write poems so that they rhyme or retain the same meaning. It can also follow instructions well, do things like end all sentences with a given word, and so on, and all without thinking tokens.
The so called modern "thinking" LLMs like Qwen3.5 34b perform horribly at the same tasks, especially without burning through an impractical number of thinking tokens. For example, they struggle even to do simple tasks like ending all sentences with a given word w/o thinking tokens.
And doing things like writing a poem that makes sense and adheres to a rhyming scheme is arguably a harder task for AI than writing code, which is just the execution of pre-packaged function calls and rules. You know it's just parroting the code because no human declares all the variables before writing the code. It's impossible to know exactly what will be required in advance. So AI is clearly just parroting back blocks of code and making superficial modifications along the way to better adhere to the user's request.
Point being, I don't think advancements have been made in years, but rather priorities have shifted. Models like Qwen3.5 are horrible at tasks like poetry and have far less broad knowledge, and are only somewhat decent at what they overtrained in like coding and functional calling. If they trained on an equal number of poem tokens and fine-tuning sets they could have achieved the opposite. In other words, the current transformer architecture is 100% about knowledge (pattern matching), not skill or intelligence. They're just recalling nearest match code blocks, math solutions, poems... and making superficial modifications to better align them to the user's prompt. It's because of this that I STRONGLY believe the the best option, and the most efficient use of the limited parameters, is to train base models democratically on a broad corpus of popular and quality tokens, then fine-tune in a balanced way across a broad spectrum of tasks. The diversity would greatly reduce weight scrambling and achieve the highest possible information and skill density at all parameter counts.
