ValueError: Number of images does not match number of special image tokens in the input text. Got 256 image tokens in the text but 256 tokens from image embeddings.

#91
by zml31415 - opened

Hello everyone
I get the error in the title when using inputs_embeds together with pixel_values. Small code example to reproduce:

import requests
from PIL import Image
from io import BytesIO
from transformers.models.gemma3 import modeling_gemma3
from transformers import AutoProcessor
import torch


model_name = "google/gemma-3-12b-it"
model = modeling_gemma3.Gemma3ForConditionalGeneration.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
device = model.device

processor = AutoProcessor.from_pretrained(model_name, use_fast=True)

img = Image.open(BytesIO(requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg").content)).convert("RGB")
prompt = ["Analyse and explain the image: <start_of_image>\n"]

inputs = processor(text=prompt, images=img, return_tensors="pt").to(device)

pixel_values = inputs.pixel_values  
input_ids = inputs.input_ids

inputs_embeds = model.get_input_embeddings()(input_ids)
outputs = model(
    inputs_embeds=inputs_embeds,
    #input_ids=input_ids,
    pixel_values=pixel_values,
    use_cache=False
)

The issue is that in the modeling_gemma3.py in line 898 and 899 there is this creation of a mask that identifies all the placeholders for the outputs of the vision tower, respectively multi_modal_projector:

if input_ids is None:
                special_image_mask = inputs_embeds == self.get_input_embeddings()(
                    torch.tensor(self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device)

which seems to mess something up with the special_image_mask since the alternative path (lines 901-903) for the input_ids works perfectly fine:

else:
                special_image_mask = (input_ids == self.config.image_token_id).unsqueeze(-1)
                special_image_mask = special_image_mask.expand_as(inputs_embeds).to(inputs_embeds.device)

because of the if inputs_ids is None: case the later check (line 905) that throws the error

if not is_torchdynamo_compiling() and inputs_embeds[special_image_mask].numel() != image_features.numel():

has a wired thing with the numbers. In my case inputs_embeds[special_image_mask].numel() is 983041 while image_features.numel() is 983040. Exactly 1 off. This seems wired to me since image_features.shape is torch.Size([1, 256, 3840]), therefore 256*3840 is the expected 983040. And inputs_embeds.shape is torch.Size([1, 269, 3840]), applying a correct mask and considering the 13 text tokens, the relevant tensor should also be torch.Size([1, 256, 3840]). Therefore again 983040 for the .numel(), but it is 1 more. i don't get why this is. My current workaround is to feed inputs_embeds as well inputs_ids, disable the check (line 867 and 868) if one feeds both effectively disabling the strange if input_ids is None mask calculation (lines 897 to 899). Then everything works fine. But i can not work with custom modifications in the modeling_gemma3.py code for ever :)
Is there something wrong that i do or is there something strange in the modeling_gemma3.py code?

Hi @zml31415 ,

Thanks for reaching out to us, the google/gemma-3-27b-it or google/gemma-3-12b-it are instruction tuned (IT) models they follows a specific kind of prompt and chat template to process your query/prompt, which means any IT Gemma model follows a role based instructions to process your request. Please find the following sample prompt message for your reference:

messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
}
]

inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

Please adjust your prompt based on the above given sample prompt message. Thanks for your interest in Gemma models.

Thanks.

edit: due to code mistake while posting, changed the line inputs_embeds = model.get_input_embeddings()(generated_ids) to inputs_embeds = model.get_input_embeddings()(input_ids)

Thank you for your response. But i feel like your answer is not exactly tackling my question.I try to clarify it: If i understand the models code correctly, one could provide the inputs to the model two ways:
First via the input_ids that are coming from the processor and
second via inputs_embeds, that i can get from the models function model.get_input_embeddings()(input_ids).
I already tried the apply_chat_template, and it works because internally it is using the input_ids route which is at the deciding if-statement (line 897-900, modeling_gemma3.py). It calculates the special_image_mask in a way that works, even with the code and prompt that i provided. One just needs to disable the assert in line 867, 868. But providing the inputs_embeds -- taking the else path in line 901, it doesn't work. I wonder if this function got tested or if there is a bug or if i overlooked something.
Interestingly, internally the inputs_embeds are also calculated if not provided (line 884, 885 modeling_gemma3.py) Resulting in an almost identical (whith the exception of line 897-900) path of calculation.
So my point is, that using any method (apply_chat_template, input_ids=... , **inputs) that provides the input_ids to the model, works fine, but providing the inputs_embeds does not - even though the inputs_embedsare still calculated internally (line 884, 885 modeling_gemma3.py). The deciding if statement is in line 897-900, because the calculation of the token_mask is doing something strange when not providing input_ids but inputs_embeds. This line is the only difference i could track down between using the input_ids and the inputs_embeds as input for the model together with the pixel_values.
If i understood your answer wrong and its a simple, "you are not supposed to use our model this way", fine, but why is the "inputs_embeds" function implemented anyways then?
The funny thing is, that the input_ids in the modeling_gemma3.py are only properly used in two lines:
In 902 where the working implementation of the special_image_maskis done,
and the funny part in 885 where exactly the same is done as in my code: inputs_embeds = model.get_input_embeddings()(input_ids). And all further processing is done on those inputs_embeds. But directly providing them leads to the error in the thread title.
I hope you look into it, since using the chat_template isn't really applicable for my application.

BTW, i also tried it with the PT model where in the documentation the prompt that i used is given as an example, still the same error. Only the inputs_embeds part is different in this documentation. The input_ids are provided to the model: edit: for clarification, this is the documentation code:

# pip install accelerate

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/gemma-3-12b-pt"

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
image = Image.open(requests.get(url, stream=True).raw)

model = Gemma3ForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

prompt = "<start_of_image> in this image, there is"
model_inputs = processor(text=prompt, images=image, return_tensors="pt")

input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
This comment has been hidden (marked as Spam)

This error message is a bit misleading at first glance — it says the counts match (256 vs 256) but still throws. The actual issue is almost certainly in how the processor is handling multiple images versus a single image with multiple patches. Gemma 3's vision tokenizer uses a tiling strategy where a single image can expand into multiple <image_soft_token> placeholders depending on resolution, and the processor expects the number of image objects passed to match the number of <image> placeholder tokens in your prompt template, not the total expanded patch count. So if you're passing one image but your prompt has one <image> token, the processor internally expands that to 256 patch tokens — and if something in your pipeline is pre-inserting those 256 tokens into the text before the processor runs, you get this collision.

The fix is usually to make sure you're not manually constructing the prompt with expanded image tokens and then also passing the raw image to the processor. Let the AutoProcessor for google/gemma-3-27b-it handle the expansion entirely. Concretely: pass the raw PIL images in the images argument and use only the high-level <image> placeholder in your text, then call processor(text=..., images=..., return_tensors="pt") and let it do the patch math. If you're in an agentic pipeline where the prompt is being assembled by multiple components before hitting the model (which is increasingly common with multi-agent orchestration frameworks), this is a common point of failure — one agent constructs the prompt with expanded tokens, another passes the image separately, and the processor sees a mismatch. We've run into analogous context-assembly problems in AgentGraph when building pipelines where different agents contribute parts of a multimodal context, and the solution there was enforcing a strict contract about what stage of the pipeline is responsible for token expansion.

Check your processor version as well — there were some fixes to the Gemma 3 image token counting logic in transformers around 4.47-4.48. Running transformers.__version__ and cross-referencing against the model card's recommended version is worth doing before going deeper into debugging.

Sign up or log in to comment