Gemma 4 E4B will be as encyclopedically well-read as the 12b model?

#48
by Regrin - opened

Hello!
Gemma is my favorite model family.
I often used the Gemma 3 12b to consult with it when there was no internet connection.
My laptop only has 16 gigabytes of RAM. It won't fit anything larger than the Gemma 3 12b. Even the GPT-OSS-20b barely fit. And that was after I closed everything possible.

But here's the problem! The Gemma 4 family doesn't have a model with 12 billion parameters! There are either significantly more (it won't fit on my computer) or fewer. The largest model that will fit on my computer is the Gemma 4 E4B. But it's slightly smaller than the 12b. However, it beats it in benchmarks.

Please tell me, do you think the Gemma 4 E4B will be as encyclopedically well-read as the 12b model? Will it no longer have hallucinations?

After all, you can cram more information into 12 billion parameters than into 8 billion.

@Regrin I'm obsessed with the hallucination problem of LLMs since they, especially smaller ones, produce a flood of hallucinations across nearly all popular domains of knowledge, and it's getting worse with time. And in my opinion this makes them less than useless.

I personally tested Gemma 4 E4B and it hallucinates like mad about basic facts across most popular domains of knowledge, and frankly, so does Gemma 3 12b. If you want a small AI model that hallucinates less across a broad spectrum of domains you're better off with Gemma 2 9b or Llama 3.1 8b. They're smaller and weren't as grossly overfit to a handful of mainly STEM domains (e.g. coding, math & science) like Gemma 3. Gemma 3 got better at tasks like coding, math, and story writing, but regression significantly in broad knowledge per parameter count.

Gemma 4 is even worse. It's more skilled than Gemma 3, but it's knowledge horizon per parameter count shrunk, and Gemma 3's was pretty bad to start with (had a much lower knowledge horizon per paramter than Mistral Small and Llama 3). Gemma 4 still has more total knowledge than Gemma 3, but they traded the democratic training of equally popular knowledge across all domains of knowledge to grossly overt-train on math, coding, science... tokens, resulting in a notable regression in broad knowledge.

This is even more true for Gemma 4 E4B. They wasted about half its size on embeddings due to the absurdly inappropriate use of 250k unique tokens for an edge LLM. Plus they wasted a lot of parameters to hold things like coding that nobody is going to use (<5% of the population codes, and almost no coder would use an edge model on a phone rather than a PC using a far more powerful model). So in the end Gemma 4 E4B is ignorant as f*ck when it comes to popular English knowledge across nearly every popular domain. It's brains are far better than Llama 3.2 3b (e.g. story writing, coding, and math), but L3 3b has notably more broad knowledge while be significantly smaller and a little faster.

They say that the inline correction of obvious facts, which shouldn't be left up to the probabilistic nature of next token prediction, is very hard to accomplish, even when using a smaller AI model to scan the output of the larger model to identify them, pause next token prediction, and search a compact relational database of core facts to correct/verify the fact before moving on. However, since an immutable fact is obvious to the user there should be the option for the user to select said fact and tell the LLM to use its context to search an included compact relational database from the SSD (not memory) and verify/correct it. Considering the obscene frequency of factual hallucinations about very commonly known popular things, especially by smaller LLMs, implementing such as feature is non-optional.

Hmm... What should I do?
In principle, is it possible to create a model with the same knowledge as more successful models, and the same intelligence as Gemma 4? So that it's the same size.

I tried Gemma 4 31b and Qwen 3.5 27b. I asked about the hero of a book well-known in certain circles.
At first, they didn't understand who it was because I wasn't asking in English.
Then I asked who it was. And they told me. I didn't see a single hallucination in Gemma. Qwen somehow recounted the basic plot, but the details were nonsensical.

@Regrin Based on what I've been reading the short answer is no. It isn't possible to create models that both maximize general knowledge and achieve strong abilities like Gemma 4.

When the English SimpleQA was released (a broad domain non-multiple choice test) there was a reliable correlation between the total parameters of a model and how high the test scores were across both OS and proprietary models. However, once select domains were aggressively over-trained, especially coding, math, STEM knowledge, and creative writing, the scores plummeted (e.g. Qwen2.5 72b dropped to a SimpleQA score of ~10, and Opus dropped to the score of the previous Sonnet, although the updated Opus regained some of the lost broad knowledge).

When it comes to Qwen3.5 and Gemma 4, Qwen3.5 is a full generation behind Gemma 4 in both broad knowledge and abilities. For example, Qwen3.5's stories are highly repetitive, regularly repeating the same phrases, plot lines etc. in response to very different prompts. Plus Qwen3.5's thinking uses a lot more tokens to do an inferior job, so Qwen3.5 is a much less reliable model. Still, Qwen3.5 is far better than Qwen3, and there are pockets of knowledge and skills Qwen3.5 is better at than Gemma 4. But neither model comes close to the minimum required reliability outside of STEM, and to a lesser degree academia in general. For example, the original Mistral Small has far more broad knowledge than both with only 22b parameters, but isn't near as competent across tasks.

So to sum up, the only real hope for greatly reducing the factual hallucination rate of moderate to small models like Gemma 4 31b and Gemma E4B, while retaining their size and abilities, is to seamlessly couple them with a grounding relational database of core facts. Other forms of RAG are far too unstructured, high latency, context hogging, and so on. The probabilistic next token prediction nature of LLMs, coupled with relatively small parameter counts and the gross overfitting of select domains, simply will not allow them to function as general purpose AI models. The factual hallucination rate will always be far too high. They simply must be fused and shipped with a compact relational database of humanity's core popular knowledge to keep them from constantly falling off the rails.

Sign up or log in to comment