[Bug] google/gemma-4-E4B-it tokenizer fails with AttributeError: 'list' object has no attribute 'keys' β extra_special_tokens format incompatibility with transformers v
Summary
Loading the tokenizer for google/gemma-4-E4B-it fails with an AttributeError on transformers v4.x. The model's tokenizer_config.json stores extra_special_tokens as a list, but transformers v4 expects a dict. This is a format introduced in transformers v5, but users on v4 (which is still the default pip install transformers version for many environments) will hit this crash with no clear error message.
Error
AttributeError: 'list' object has no attribute 'keys'
File .../transformers/tokenization_utils_base.py:1181
self.SPECIAL_TOKENS_ATTRIBUTES = self.SPECIAL_TOKENS_ATTRIBUTES + list(special_tokens.keys())
AttributeError: 'list' object has no attribute 'keys'
Root cause
The tokenizer_config.json for this model contains:
"extra_special_tokens": ["<token1>", "<token2>", ...]
Transformers v4 expects this field to be a dict. Transformers v5 handles both formats, but v5 is not yet the default install for most users.
Environment
transformers==4.x (any version before v5)
Python 3.12 / 3.13
macOS (Apple Silicon) β also reproducible on Linux
Workaround (patch the cached config)
Run this once before loading the model:
import json
from pathlib import Path
from huggingface_hub import hf_hub_download
model_id = "google/gemma-4-E4B-it"
config_path = Path(hf_hub_download(model_id, "tokenizer_config.json"))
with open(config_path) as f:
config = json.load(f)
if isinstance(config.get("extra_special_tokens"), list):
config["extra_special_tokens"] = {}
with open(config_path, "w") as f:
json.dump(config, f)
print("Patched successfully")
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
Proper fix (either of the following)
Update
tokenizer_config.jsonin this repo to storeextra_special_tokensas a dict (or remove the field if empty), so v4 users are not broken by default.Or, add backward-compatible handling in transformers v4 to accept both list and dict formats for
extra_special_tokens.