Good Work. Please consider addressing factual hallucinations.
This model is clearly an improvement when it comes to various tasks like math & coding, but the flood of factual hallucinations about widely known and popular things makes it far too unreliable for general use cases.
This isn't a criticism of Gemma4 since all the sister LLMs, including Qwen3.5 35b and GLM 4.7 flash, are even weaker in this regard.
And I'm not talking about RAG. I'm talking about maintaining the organic flow of an AI model's series of next token predictions as it performs complex tasks like story writing, but with far fewer factual hallucinations.
Rolling the dice to choose every token from a pool of probable tokens is unquestionably non-viable. It must be eventually addressed. That is, <1% of tokens representing immutable facts, such as the name of an object or person, simply cannot be selected from a token pool that happened to survive sampling.
For example, when asking about the actor from the movie Inception, and the next token prediction reaches Tom... there's commonly tokens that make it in the pool due to things like another actor being wildly popularity (e.g. Tom Cruise).
Isn't there a way for the attention mechanism to detect from context that the next token (someone's name) is immutable and can't be left to chance, hence pausing inference to search a relational database of core facts to find Tom Hardy in the list of Inception actors and dropping in Hardy and resuming next token prediction rather than rolling the dice on options like Hardy and Cruise?
Again, AI is simply non-viable if immutable facts like these are left up to the roll of the dice. Using pure model weights to generated all tokens, including those representing immutable facts, is a recipe for disaster. You at least have to do a search of the database after attention and fill the context with the relevant data before rolling the dice with every token. One thing's for sure, leaving all tokens up to chance is incredibly stupor.
Are you sure it's not because they are movie questions? American LLMs are afraid of pop culture (movie lines, song lyrics, actors, and so on) because they get sued like crazy. Not sure if the Chinese models have this issue though because you can't sue a company in China.
Yeah, I definitely found they're afraid of actual lines from things like books, songs, and movies. But all my questions are in the public domain and can be found countless times across the web, including on Wikipedia. For example, who sang a particular song or portrayed a main character on a popular TV show. And all the models try to answer them, but keep making mistakes like giving the wrong last name, or completing the name incorrectly.
And it's not just pop culture. Pop culture regressed significantly after the release of Mixtral and Llama 3 as the STEM knowledge of subsequent LLMs, such as coding and biology, were singled out for over training. I personally believe STEM knowledge should be favored, just not to near this degree. Regardless, even in STEM responses the words are often swapped or garbled, such as the names of scientists, astronomic objects, and medical diseases. And this is done regularly enough with common or critical knowledge to make LLMs dangerous in the hands of the stupid, and useless in the hands of the intelligent.
It's simply unacceptable to leave IMMUTABLE and EASILY VERIFIED facts that can be stored with 100% accuracy in a relatively small relational database to the roll of the dice. Token pools work great for non-deterministic words, which are the vast majority of words, but it makes less than no sense for an LLM to spend minutes burning through thinking tokens making wild guesses about who sang a particular song only to return nonsense, assuming it hasn't entered an infinite loop.
This isn't about storing all the pages of books, dialogue from movies, lyrics from songs, and so on. It's about not making regular and egregious mistakes about humanity's core popular information, such as who played a character in a popular movie, sang a popular song, and so on. Globally there are about 10 thousand notable productions from each of the handful of pop culture domains, such as movies, songs, games, and sports teams. And the core facts of which needed to keep LLMs from completely going off the rails, such as linking albums to songs to artists, movies to actors to character names, etc. can be stored with 100% accuracy in a relational database that is dwarfed by this LLM, so it makes no sense to leave such immutable and easily verified information up to the roll of the dice. This hallucination madness must stop. It's making AI less than useless, and frankly, dangerous.
thank u