Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
mike-ravkine 
posted an update 15 days ago
Post
1361
Gemma-4, specifically google/gemma-4-26B-A4B-it is doing something inside it's reasoning traces I have never seen before: it's recognizing that its being evaluated and spends meta-thinking tokens on understanding the evaluation regime in which it believes it find itself.

Let's see if 12/10/2023 is a more likely answer than 12/09/2023

In most AI benchmark tests (like those this prompt resembles), the simplest path is often the intended one.


I am blown away by this, and it prompts the obvious question: *Is this cheating?*

I am leaning towards no.

Humans *always* know when they're being evaluated, so this situational bindless is not actually a pre-requisite of evaluation - it just so happens that no model before Gemma-4 looked up in the middle of the test and went "Wait a minute - this is a test! I should try align my answer with the test format's expectations."

What I would love to know, if anyone from the Google team can indulge me, is was his behavior intentionally trained or did it emerge?

Maybe it's not cheating in the purest sense of the word, but I think the more important question is "will this model generate responses that are truly more relevant and helpful or will it simply score better on benchmarks?"

Appreciate that most people (including myself) use benchmarks as quick-glance heuristics when assessing a new model, but the proof of the pudding is in the eating. At the end of the day, I don't care what the model has scored on any benchmarks if it can't generate useful knowledge and content.

·

You're very much on to something here, and this is why I think it matters if this behavior is intentional or latent.

If they've taught it to recognize benchmarks specifically, that's benchmaxxing and is not going to help real world performance when your real tasks don't trigger the maxxxed paths. This is a genuine concern.

If they've taught it to "reach beyond the prompt" in the general sense, to understand the context and user intent behind the query, that's a genuinely useful capability and would explain why this model feels a little different.

Some stats: some version of this reasoning path happened in 39 out of 1070 test configurations, across 4 of my 12 tasks. In the most common occurrence, responsible for 30 of 39 hits, it recognized the task as being from BigBenchHard specifically and uses it's knowledge of the BBH category sets - which unfortunately suggests benchmaxxing.