Extend transformers version compatibility 4.57.x through 5.1.x
Summary
- Fix
extra_special_tokenslist-vs-dict crash on transformers <5.0 (fixes #2) - Add
rope_scalingtotext_configfor transformers <5.0 compatibility - Remove unused
video_processorfrom processor attributes to avoid type-check failure on transformers <5.0 - Override
forward()to return hidden states directly, bypassinglm_head— fixes silent embedding correctness regression on transformers >=5.0.0 and ensures correct results regardless of whether callers use high-level methods or the model
directly
Details
tokenizer_config.json
extra_special_tokens was serialized as a list by transformers 5.0.0rc0. Versions <5.0 call .keys() on it, causing AttributeError. Changed to {} — all 13 tokens are already registered in tokenizer.json added_tokens.
config.json
Added rope_scaling key to text_config alongside existing rope_parameters. Transformers <5.0 reads rope_scaling; >=5.0 reads rope_parameters. Both now find what they need.
processing_qwen3_vl_nemotron_embed.py
Overrode attributes to ["image_processor", "tokenizer"] and set video_processor_class = None. This model doesn't use video — removing it avoids a BaseVideoProcessor type-check failure on transformers <5.0.
modeling_qwen3_vl_nemotron_embed.py
Three related changes:
Override
forward()onQwen3VLNemotronEmbedForConditionalGeneration— callsself.model()directly and returns its output, bypassing the language modeling head (lm_head). This is an embedding model, not a generation model, so returning
logits was misleading. More importantly, this ensuresmodel(**inputs).last_hidden_stategives correct embeddings whether callers use the high-level methods (forward_queries,forward_images) or call the model directly with the processor.Skip final RMSNorm in
Qwen3VLNemotronEmbedTextModel.forward()— this model extracts embeddings from the pre-norm last-layer output (masked and L2-normalized downstream). The norm weights remain in the checkpoint for architecture
compatibility but are not applied.Simplify
_extract_embeddings— usesoutputs.last_hidden_statedirectly instead of a forward hook on the last decoder layer. This was possible after the above two changes madeforward()return the correct pre-norm hidden states.
Root cause: In transformers 5.0.0, the @check_model_inputs decorator was replaced by @can_return_tuple, changing the semantics of the hidden_states tuple — the last element became the post-norm output instead of the pre-norm last-layer output.
The original code read hidden_states[-1], causing a silent correctness regression (embeddings were wrong but no error was raised, max diff ~0.6). By overriding forward() to return hidden states directly from the inner model, we bypass the decorator-managed hidden_states entirely.
Tested versions
All produce exact zero diff against golden reference embeddings (both text queries and images):
| transformers | Status |
|---|---|
| 4.57.6 | PASS |
| 5.0.0rc0 | PASS |
| 5.0.0 | PASS |
| 5.1.0 | PASS |
Hey there !Thank you for your work in updating this model's compatibility for newer versions of transformers. If it's not too much to ask, could you do the same for the "Nemotron Parse v1.1" model, to allow serving via transformers in addition to vllm? I made a thread about this which you can find here
Thanks for your pull request. I will close it after merging this pull request.
https://huggingface.co/nvidia/nemotron-colembed-vl-4b-v2/discussions/2
Also, i plan to contribute to vLLM implementing this model after merging.
https://github.com/vllm-project/vllm/pull/34398