What do you mean by "novel"? Google has been doing this for ages

by davidboring - opened Jan 28

Jan 28

This paper keeps insisting the architecture is "novel," but let's be real:

They label it "LM as Vision Encoder", Not quite. You're actually using SAM as the real vision encoder. That "Vision tokenizer" label in the diagram is misleading - it's just the patch embedding layer from that backbone. Call things what they are.
“By customizing the attention mask, visual tokens utilize bidirectional attention while learnable queries adopt causal attention” --> This isn't new. Gemma 3 has been using exactly this kind of split attention-mask setup (bidirectional for visuals, causal for text) for quite some time now. It's practically standard at this point.

Gemma3's bidirection mask:

You're also not breaking new ground by replacing ViT with a different encoder. That move has been done by Gemma-3n - people were already swapping ViT for MobileNetV5.

So... where exactly is the claimed novelty hiding? Because right now it mostly looks like a reshuffling of fairly established ideas.

Everloom

Jan 30

Unrelated question, what do you use to visualize the sliding window mask? looks good

davidboring

Feb 6

They have a code snippet in the blog post: https://developers.googleblog.com/gemma-explained-whats-new-in-gemma-3/

shootstuff

Feb 12

China never does
Novel. They take our
Novels and claim to have written them and written them faster.

davidboring

Feb 12

In Chinese, they call it 旧瓶新酒

fuzicezi

Mar 13

In Chinese, they call it 旧瓶新酒

图像用非因果，文本用因果这个是个人都能想到。这篇文章的关键点在于：把图像的双向非因果表示转化成了因果的序列，这样图文混合处理就方便很多，同时比clip的提前转文字保留更多的信息。

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment