Fix for large images

#9
by ordisbold - opened

The llm-compressor Tokenizer Bug

When llm-compressor quantizes a model (in this case, to NVFP4), it runs a forward pass on a calibration dataset. To keep memory usage manageable during this calibration phase, the script temporarily caps the processor's sequence length—very commonly to 4096 or 2048 tokens.

The bug in the library is that calling the processor modifies the tokenizer object in place. When the script finishes and saves the quantized model weights to disk, it accidentally permanently bakes this temporary calibration limit into the tokenizer.json file.

This is what triggers the warning in your logs:

Truncation was not explicitly activated but max_length is provided a specific value...

The tokenizer is silently overriding your 262,144 sequence limit and snipping your 11,844 image tokens down to 4,095.

How to Fix tokenizer.json

You need to strip the calibration truncation limit out of the tokenizer file.

  1. Navigate to your downloaded model weights directory.
  2. Open tokenizer.json (Note: this is a different file from the tokenizer_config.json you pasted earlier).
  3. Near the very top of the file, look for a block that looks exactly like this:
"truncation": {
  "direction": "Right",
  "max_length": 4096,
  "strategy": "LongestFirst",
  "stride": 0
},
  1. Change that entire block to simply:
"truncation": null,
  1. Save the file and restart your vLLM server.

This will stop the Hugging Face processor from artificially truncating the sequence, and the [4095] vs [11844] mismatch will disappear.

I believe this is expected behavior. Calibration was run at a sequence length of 4096, and behavior outside that range is not guaranteed.

Sign up or log in to comment