Extend transformers version compatibility 4.57.x through 5.1.x

#3

Summary

  • Fix extra_special_tokens list-vs-dict crash on transformers <5.0 (fixes #2)
  • Add rope_scaling to text_config for transformers <5.0 compatibility
  • Remove unused video_processor from processor attributes to avoid type-check failure on transformers <5.0
  • Override forward() to return hidden states directly, bypassing lm_head — fixes silent embedding correctness regression on transformers >=5.0.0 and ensures correct results regardless of whether callers use high-level methods or the model
    directly

Details

tokenizer_config.json

extra_special_tokens was serialized as a list by transformers 5.0.0rc0. Versions <5.0 call .keys() on it, causing AttributeError. Changed to {} — all 13 tokens are already registered in tokenizer.json added_tokens.

config.json

Added rope_scaling key to text_config alongside existing rope_parameters. Transformers <5.0 reads rope_scaling; >=5.0 reads rope_parameters. Both now find what they need.

processing_qwen3_vl_nemotron_embed.py

Overrode attributes to ["image_processor", "tokenizer"] and set video_processor_class = None. This model doesn't use video — removing it avoids a BaseVideoProcessor type-check failure on transformers <5.0.

modeling_qwen3_vl_nemotron_embed.py

Three related changes:

  1. Override forward() on Qwen3VLNemotronEmbedForConditionalGeneration — calls self.model() directly and returns its output, bypassing the language modeling head (lm_head). This is an embedding model, not a generation model, so returning
    logits was misleading. More importantly, this ensures model(**inputs).last_hidden_state gives correct embeddings whether callers use the high-level methods (forward_queries, forward_images) or call the model directly with the processor.

  2. Skip final RMSNorm in Qwen3VLNemotronEmbedTextModel.forward() — this model extracts embeddings from the pre-norm last-layer output (masked and L2-normalized downstream). The norm weights remain in the checkpoint for architecture
    compatibility but are not applied.

  3. Simplify _extract_embeddings — uses outputs.last_hidden_state directly instead of a forward hook on the last decoder layer. This was possible after the above two changes made forward() return the correct pre-norm hidden states.

Root cause: In transformers 5.0.0, the @check_model_inputs decorator was replaced by @can_return_tuple, changing the semantics of the hidden_states tuple — the last element became the post-norm output instead of the pre-norm last-layer output.
The original code read hidden_states[-1], causing a silent correctness regression (embeddings were wrong but no error was raised, max diff ~0.6). By overriding forward() to return hidden states directly from the inner model, we bypass the decorator-managed hidden_states entirely.

Tested versions

All produce exact zero diff against golden reference embeddings (both text queries and images):

transformers Status
4.57.6 PASS
5.0.0rc0 PASS
5.0.0 PASS
5.1.0 PASS
nvidia-oliver-holworthy changed pull request status to open

Hey there !Thank you for your work in updating this model's compatibility for newer versions of transformers. If it's not too much to ask, could you do the same for the "Nemotron Parse v1.1" model, to allow serving via transformers in addition to vllm? I made a thread about this which you can find here

Thanks for your pull request. I will close it after merging this pull request.
https://huggingface.co/nvidia/nemotron-colembed-vl-4b-v2/discussions/2

Also, i plan to contribute to vLLM implementing this model after merging.
https://github.com/vllm-project/vllm/pull/34398

nemotron-colembed-vl-8b-v2 also need this patch.

nvidia-oliver-holworthy changed pull request status to merged

Sign up or log in to comment