[Bug] google/gemma-4-E4B-it tokenizer fails with AttributeError: 'list' object has no attribute 'keys' — extra_special_tokens format incompatibility with transformers v

#17

by AIharv - opened 8 days ago

Summary

Loading the tokenizer for google/gemma-4-E4B-it fails with an AttributeError on transformers v4.x. The model's tokenizer_config.json stores extra_special_tokens as a list, but transformers v4 expects a dict. This is a format introduced in transformers v5, but users on v4 (which is still the default pip install transformers version for many environments) will hit this crash with no clear error message.

Error

AttributeError: 'list' object has no attribute 'keys'

File .../transformers/tokenization_utils_base.py:1181
    self.SPECIAL_TOKENS_ATTRIBUTES = self.SPECIAL_TOKENS_ATTRIBUTES + list(special_tokens.keys())
AttributeError: 'list' object has no attribute 'keys'

Root cause

The tokenizer_config.json for this model contains:

"extra_special_tokens": ["<token1>", "<token2>", ...]

Transformers v4 expects this field to be a dict. Transformers v5 handles both formats, but v5 is not yet the default install for most users.

Environment

transformers==4.x  (any version before v5)
Python 3.12 / 3.13
macOS (Apple Silicon) — also reproducible on Linux

Workaround (patch the cached config)

Run this once before loading the model:

import json
from pathlib import Path
from huggingface_hub import hf_hub_download

model_id = "google/gemma-4-E4B-it"

config_path = Path(hf_hub_download(model_id, "tokenizer_config.json"))

with open(config_path) as f:
    config = json.load(f)

if isinstance(config.get("extra_special_tokens"), list):
    config["extra_special_tokens"] = {}
    with open(config_path, "w") as f:
        json.dump(config, f)
    print("Patched successfully")

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

Proper fix (either of the following)

Update tokenizer_config.json in this repo to store extra_special_tokens as a dict (or remove the field if empty), so v4 users are not broken by default.
Or, add backward-compatible handling in transformers v4 to accept both list and dict formats for extra_special_tokens.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment