Open to Collab

Mike Ravkine PRO

mike-ravkine

the-crypt-keeper

AI & ML interests

LLM Research / Development / Evaluation

Recent Activity

liked a model 5 days ago

MiniMaxAI/MiniMax-M2.7

liked a model 8 days ago

LGAI-EXAONE/EXAONE-4.5-33B

liked a model 10 days ago

nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16

View all activity

Organizations

liked a model 5 days ago

MiniMaxAI/MiniMax-M2.7

Text Generation • 229B • Updated about 3 hours ago • 189k • • 902

liked a model 8 days ago

LGAI-EXAONE/EXAONE-4.5-33B

Image-Text-to-Text • 34B • Updated 2 days ago • 8.29k • 145

liked 2 models 10 days ago

nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16

Text Generation • Updated 28 days ago • 101k • 74

nvidia/Nemotron-Cascade-2-30B-A3B

Text Generation • 32B • Updated 8 days ago • 317k • 473

replied to their post 11 days ago

You're very much on to something here, and this is why I think it matters if this behavior is intentional or latent.

If they've taught it to recognize benchmarks specifically, that's benchmaxxing and is not going to help real world performance when your real tasks don't trigger the maxxxed paths. This is a genuine concern.

If they've taught it to "reach beyond the prompt" in the general sense, to understand the context and user intent behind the query, that's a genuinely useful capability and would explain why this model feels a little different.

Some stats: some version of this reasoning path happened in 39 out of 1070 test configurations, across 4 of my 12 tasks. In the most common occurrence, responsible for 30 of 39 hits, it recognized the task as being from BigBenchHard specifically and uses it's knowledge of the BBH category sets - which unfortunately suggests benchmaxxing.

posted an update 13 days ago

Post

1361

Gemma-4, specifically google/gemma-4-26B-A4B-it is doing something inside it's reasoning traces I have never seen before: it's recognizing that its being evaluated and spends meta-thinking tokens on understanding the evaluation regime in which it believes it find itself.

Let's see if 12/10/2023 is a more likely answer than 12/09/2023

In most AI benchmark tests (like those this prompt resembles), the simplest path is often the intended one.

I am blown away by this, and it prompts the obvious question: *Is this cheating?*

I am leaning towards no.

Humans *always* know when they're being evaluated, so this situational bindless is not actually a pre-requisite of evaluation - it just so happens that no model before Gemma-4 looked up in the middle of the test and went "Wait a minute - this is a test! I should try align my answer with the test format's expectations."

What I would love to know, if anyone from the Google team can indulge me, is was his behavior intentionally trained or did it emerge?

3 replies

liked 2 models 15 days ago

google/gemma-4-31B-it

Image-Text-to-Text • 33B • Updated 7 days ago • 3.51M • • 2k

Jackrong/Qwopus3.5-9B-v3

Image-Text-to-Text • 10B • Updated 8 days ago • 206k • 80

upvoted an article 18 days ago

Article

How I contributed a new model to the Transformers library using Codex

18 days ago

•

liked a model 21 days ago

nvidia/gpt-oss-puzzle-88B

Text Generation • 91B • Updated 7 days ago • 20.3k • 90

upvoted a paper 23 days ago

Reasoning as Compression: Unifying Budget Forcing via the Conditional Information Bottleneck

Paper • 2603.08462 • Published Mar 9 • 22

liked a model 24 days ago

Nanbeige/Nanbeige4-3B-Thinking-2511

Text Generation • 4B • Updated Dec 17, 2025 • 1.62k • 204

liked a model 30 days ago

Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2

Image-Text-to-Text • 10B • Updated 11 days ago • 44.9k • 156

liked 5 models about 1 month ago

posted an update about 1 month ago

Post

289

gpt-oss-120b has held on to the ReasonScape crown since it's release on Aug 5, 2025 - 7 months in the LLM space is *impressive*.

With the release of Qwen-3.5 the king has been dethroned by not one but 2 models the mid-dense Qwen/Qwen3.5-27B and the large-MoE Qwen/Qwen3.5-122B-A10B-FP8.

The old king is dead - long live the new king 👑

Note that these rankings are based on r12 - a 27k prompts, 12 task domain 3rd iteration of the ReasonScape evaluation. Compared to the previous m12x ranking this evaluation fixes a slew of test bugs, refines the task set to add table-extraction, and lifts the context ceiling to 16k - so these rankings are quite a bit different vs the previous m12x Leaderboard (which has an 8k context limit).

liked a model about 2 months ago

Qwen/Qwen3.5-35B-A3B

Image-Text-to-Text • 36B • Updated Feb 27 • 3.8M • • 1.39k

Mike Ravkine PRO

AI & ML interests

Recent Activity

Organizations

mike-ravkine's activity

How I contributed a new model to the Transformers library using Codex