What do you mean by "novel"? Google has been doing this for ages

#9
by davidboring - opened

This paper keeps insisting the architecture is "novel," but let's be real:

  • They label it "LM as Vision Encoder", Not quite. You're actually using SAM as the real vision encoder. That "Vision tokenizer" label in the diagram is misleading - it's just the patch embedding layer from that backbone. Call things what they are.
  • “By customizing the attention mask, visual tokens utilize bidirectional attention while learnable queries adopt causal attention” --> This isn't new. Gemma 3 has been using exactly this kind of split attention-mask setup (bidirectional for visuals, causal for text) for quite some time now. It's practically standard at this point.

image

Gemma3's bidirection mask:

image

You're also not breaking new ground by replacing ViT with a different encoder. That move has been done by Gemma-3n - people were already swapping ViT for MobileNetV5.

So... where exactly is the claimed novelty hiding? Because right now it mostly looks like a reshuffling of fairly established ideas.

Unrelated question, what do you use to visualize the sliding window mask? looks good

China never does
Novel. They take our
Novels and claim to have written them and written them faster.

In Chinese, they call it 旧瓶新酒

In Chinese, they call it 旧瓶新酒

图像用非因果,文本用因果这个是个人都能想到。这篇文章的关键点在于:把图像的双向非因果表示转化成了因果的序列,这样图文混合处理就方便很多,同时比clip的提前转文字保留更多的信息。

Sign up or log in to comment