nvidia/nemotron-colembed-vl-4b-v2 · Extend transformers version compatibility 4.57.x through 5.1.x

Extend transformers version compatibility 4.57.x through 5.1.x

by nvidia-oliver-holworthy - opened Feb 13

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

+77

-22

nvidia-oliver-holworthy

NVIDIA org Feb 13

•

edited Feb 13

Summary

Fix extra_special_tokens list-vs-dict crash on transformers <5.0 (fixes #2)
Add rope_scaling to text_config for transformers <5.0 compatibility
Remove unused video_processor from processor attributes to avoid type-check failure on transformers <5.0
Override forward() to return hidden states directly, bypassing lm_head — fixes silent embedding correctness regression on transformers >=5.0.0 and ensures correct results regardless of whether callers use high-level methods or the model
directly

Details

tokenizer_config.json

extra_special_tokens was serialized as a list by transformers 5.0.0rc0. Versions <5.0 call .keys() on it, causing AttributeError. Changed to {} — all 13 tokens are already registered in tokenizer.json added_tokens.

config.json

Added rope_scaling key to text_config alongside existing rope_parameters. Transformers <5.0 reads rope_scaling; >=5.0 reads rope_parameters. Both now find what they need.

processing_qwen3_vl_nemotron_embed.py

Overrode attributes to ["image_processor", "tokenizer"] and set video_processor_class = None. This model doesn't use video — removing it avoids a BaseVideoProcessor type-check failure on transformers <5.0.

modeling_qwen3_vl_nemotron_embed.py

Three related changes:

Override forward() on Qwen3VLNemotronEmbedForConditionalGeneration — calls self.model() directly and returns its output, bypassing the language modeling head (lm_head). This is an embedding model, not a generation model, so returning
logits was misleading. More importantly, this ensures model(**inputs).last_hidden_state gives correct embeddings whether callers use the high-level methods (forward_queries, forward_images) or call the model directly with the processor.
Skip final RMSNorm in Qwen3VLNemotronEmbedTextModel.forward() — this model extracts embeddings from the pre-norm last-layer output (masked and L2-normalized downstream). The norm weights remain in the checkpoint for architecture
compatibility but are not applied.
Simplify _extract_embeddings — uses outputs.last_hidden_state directly instead of a forward hook on the last decoder layer. This was possible after the above two changes made forward() return the correct pre-norm hidden states.

Root cause: In transformers 5.0.0, the @check_model_inputs decorator was replaced by @can_return_tuple, changing the semantics of the hidden_states tuple — the last element became the post-norm output instead of the pre-norm last-layer output.
The original code read hidden_states[-1], causing a silent correctness regression (embeddings were wrong but no error was raised, max diff ~0.6). By overriding forward() to return hidden states directly from the inner model, we bypass the decorator-managed hidden_states entirely.

Tested versions

All produce exact zero diff against golden reference embeddings (both text queries and images):

transformers	Status
4.57.6	PASS
5.0.0rc0	PASS
5.0.0	PASS
5.1.0	PASS

Add rope_scaling to text_config for transformers <5.0 compatibility42eecac9

Fix extra_special_tokens list-vs-dict crash on transformers <5.0e707da79

Remove unused video_processor from processor attributes0bebcc6e

Override forward() to return hidden states directly for embedding extractionf9c15947

nvidia-oliver-holworthy changed pull request status to open Feb 13

Update note on transformers version compatibility in README.md575df4e0

rsbdev

Feb 13

Hey there !Thank you for your work in updating this model's compatibility for newer versions of transformers. If it's not too much to ask, could you do the same for the "Nemotron Parse v1.1" model, to allow serving via transformers in addition to vllm? I made a thread about this which you can find here

sangjay

Feb 15

•

edited Feb 15

Thanks for your pull request. I will close it after merging this pull request.
https://huggingface.co/nvidia/nemotron-colembed-vl-4b-v2/discussions/2

Also, i plan to contribute to vLLM implementing this model after merging.
https://github.com/vllm-project/vllm/pull/34398

sangjay

Feb 16

nemotron-colembed-vl-8b-v2 also need this patch.

nvidia-oliver-holworthy changed pull request status to merged Feb 18

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment