Gemma 4, Compared to G3, Has Notable Improvements And Regressions
Although this model is great overall there are notable regressions compared to Gemma 3 27b which persists across various hosts using the full precision weights, so they don't appear to be due to things like a tokenizer bug or quantization (e.g. Q4_K_M).
Firstly, thinking burns through FAR too many tokens (even when selecting a lower thinking budget). It almost always overthinks and second guesses, and regularly enters infinite loops. But thinking isn't practical anyways since burning through 100s of tokens and waiting about a minute to perform a simple task isn't reasonable.
But with thinking disabled this model performs worse at a large number of tasks than Gemma 3. For example, when asked for 8 synonyms of a word like extra the list often includes said word (extra), repeated words, and so on, but about half the time this model immediately detects the errors and fixes them, which is impressive. However, Gemma 3 27b in response to the same prompts, and with the same settings, quantization etc., rarely includes the search term, repeated words, uses sub-par synonyms, etc.
Another example is ending 8 sentences with a given word. Again, Gemma 3 27b only makes rare mistakes, while Gemma 4 (same settings and quantization), regularly makes errors, but again, often immediately detects and fixes them. The two primary errors are not ending a sentence with the given word, or forcing it to end with it, resulting in a nonsensical sentence.
One more thing I'll mention is Gemma 4 has less broad knowledge. That is, when you start asking about the singers of progressively less popular songs, or the casts of progressively less popular movies and TV shows, the hallucination rate becomes notably higher, such as ending names with the wrong token (e.g. Leroux vs Leroy), swapping the actor and character names, and so on. But since such errors are less common within STEM compared to Gemma 3 it's clear this was due to a design choice (overtraining STEM, while undertraining most other domains of popular knowledge), so this clearly isn't an issue with model itself, which as a whole has more knowledge than Gemma 3.
Anyways, based on your inexplicable recommendation of temp 1 and top-p 0.95 you seem to be aware of these shortcomings and are trying to counteract them by exploring the token pool. That is, those settings result in much larger token pools filled with nonsensical tokens than say temp 0.7 w/ min-p 0.2. Then during thinking you force the token pool to be explored (e.g. 'it could be this, or that') with the hope of finding the right token. But even after burning through a vast number of tokens, and with a bit of luck avoiding an infinite loop, the success rate is only improved by ~10%. And with thinking disabled temp 1 w/ top-p 0.95 has a far higher hallucination rate at the knowledge horizon compared to temp 0.7 w/ min-p 0.2. And even with temp 0.7 w/ min-p 0.2 Gemma 3 27b does much better at the knowledge horizon (basically only worse when it comes to STEM, which you grossly overtrained relative to nearly all other popular domains of knowledge).
But again, this model is still great, and seeing it regularly make, then fix, stupid errors in real-time w/ thinking mode disabled actually gives me hope for the future of LLMs.
This model keeps trying to make up for sub-par knowledge across most non-STEM domains using thinking, but almost always failing to do so.
For example, acarpous (producing no fruit) is a relatively common word found in both my desktop dictionary and WordWeb, and even tiny LLMs reliably define it correctly. However, this model only reliably gets it right when thinking (typically by extracting its meaning from its roots). But without thinking it keeps trying to say no such word exists, and perhaps you meant Acarpus or Apocarpous.
This is actually a perfect example of what happens when you don't balance the training of an AI model, but instead heavily favor select domains, in this case STEM. Instead of recognizing common words found in desktop dictionaries it's saying they don't exist, and then starts to propose more esoteric STEM words typically only found in unabridged or specialty dictionaries.
Similarly, this LLM can work through relatively complex code and math, but fails at extremely simple tasks like ending all sentences with a given word that humans find laughably easy. This discrepancy is due to the mountains of math and coding examples used during training so an easy to modify nearest match is commonly found. The actual "thinking" is cringe nonsense and clearly only serves to eliminate stupid errors and tangents, while never taking things to a higher level. Yann LeCun is undeniably right, the transformer architecture alone is clearly not a path to AGI. Having every token selected by rolling the dice with a pool of probable tokens is not a path forward. Key tokens simply need to be selected in a different way.
Seriously though, you can't combine a temp of 1 with a vocabulary size of over >250k and a model of this size. Even when I decrease the temp to 0.7 and lower top-k the token pool is still filled with far too many nonsensical tokens, and it's the same for the dense version.
Yes, enabling thinking mode can usually rule out the words produced when nonsensical tokens are chosen, but it's not reasonable to burn through tons of tokens and risk looping for every little thing.
For example, 索 in "The owner of the comic book store in The Big Bang Theory was portrayed by actor Mike Mc索 (often credited as Mike McClean)..." (note: It's actually Stuart Bloom, which this model gets right most of the time).
I seriously question the reasoning behind creating a model that hallucinates more than Gemma 3 27b across most (non-STEM) domains, then compensating by suggesting people increase the hallucination rate even further by raising the temp to 1 and enabling thinking so the model can burn through mountains of tokens with the hope that the correct response is discovered amongst the flood of hallucinations.
Edit: Deleted this bug because it was due to a configuration issue.
Hey phil, what do you think about the E4B model?
@userBo Funny you just asked because I was just playing around with it.
Not sure how embedding work, but it's amazing how much they fit into such as small model (e.g. ~150 languages) while maintaining coherency.
However, I personally strongly prefer Llama 3.3 3b because it scored significantly higher on my broad knowledge test. Gemma 4 E4B is both larger and slower, yet started hallucinating about wildly popular music, movies, shows, and other areas of pop culture. For example, it even hallucinated when creating a main character list for one of the most watched and long running TV shows in human history (Friends).
I think in the very least AI models need to, at least somewhat reliably, function as personal assistants/sages, such as answering simple questions and discussing topics. If they can't do that nothing else matters. They'll never be able to do things like write coherent stories that have depth or bug free code. Humans should be coding, writing stories, etc., and AI should be primarily detecting mistakes, bugs in the code, etc.
So in short, while E4B impressed me, it's little more than a hallucination and AI slop generator so I already deleted it. They really need to ship smaller know nothing models with training aware relational databases (just core facts linking singers to albums, actors to movies etc.) that populate the context with key facts before generating tokens. They'll still hallucinate a lot, but at least those hallucination will be constrained to what you're talking about instead of being 100% hallucinations about an entirely different movie, show, album, celebrity etc. Plus once you feed an LLM core skeletal facts about a subject it sparks its memory and greatly mitigates hallucinations about the related info that isn't in the relational database.