KeyError: 'deepseek_v4'

#1
by lefromage - opened

used PR : https://github.com/ml-explore/mlx-lm/pull/1192
and also : uv pip install 'git+https://github.com/huggingface/transformers.git'

yet still getting error

mlx_lm.server --model mlx-community/DeepSeek-V4-Flash-2bit-DQ --chat-template-args '{"enable_thinking":false}' --host 0.0.0.0 --port 8080 --max-tokens 8192 --temp 0.0

2026-04-26 18:45:13,186 - INFO - HTTP Request: GET https://huggingface.co/api/models/mlx-community/DeepSeek-V4-Flash-2bit-DQ/revision/main "HTTP/1.1 200 OK"
Fetching 25 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 257635.38it/s]
Download complete: : 0.00B [00:00, ?B/s] | 0/25 [00:00<?, ?it/s]
/Users/xjxr170/mlx/mlx-lm-PR-1192-2026-04-26/mlx_lm/server.py:1723: UserWarning: mlx_lm.server is not recommended for production as it only implements basic security checks.
warnings.warn(
2026-04-26 18:45:13,261 - INFO - Starting httpd at 0.0.0.0 on port 8080...
[transformers] You are using a model of type deepseek_v4 to instantiate a model of type ``. This may be expected if you are loading a checkpoint that shares a subset of the architecture (e.g., loading a sam2_video checkpoint into Sam2Model), but is otherwise not supported and can yield errors. Please verify that the checkpoint is compatible with the model you are instantiating.
[transformers] PreTrainedConfig got key=rope_scaling in kwargs but hasn't set it as attribute. For RoPE standardization you need to set self.rope_parameters in model's config.
Exception in thread Thread-1 (_generate):
Traceback (most recent call last):
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 405, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 105, in getitem
raise KeyError(key)
KeyError: 'deepseek_v4'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 686, in from_pretrained
config = AutoConfig.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 407, in from_pretrained
raise ValueError(
ValueError: The checkpoint you are trying to load has model type deepseek_v4 but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command pip install --upgrade transformers. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command pip install git+https://github.com/huggingface/transformers.git

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/xjxr170/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/Users/xjxr170/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/Users/xjxr170/mlx/mlx-lm-PR-1192-2026-04-26/mlx_lm/server.py", line 695, in _generate
self.model_provider.load_default()
File "/Users/xjxr170/mlx/mlx-lm-PR-1192-2026-04-26/mlx_lm/server.py", line 385, in load_default
self.load("default_model", None, "default_model")
File "/Users/xjxr170/mlx/mlx-lm-PR-1192-2026-04-26/mlx_lm/server.py", line 394, in load
self._load(*model_key)
File "/Users/xjxr170/mlx/mlx-lm-PR-1192-2026-04-26/mlx_lm/server.py", line 349, in _load
model, tokenizer = load(
^^^^^
File "/Users/xjxr170/mlx/mlx-lm-PR-1192-2026-04-26/mlx_lm/utils.py", line 542, in load
tokenizer = load_tokenizer(
^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/mlx-lm-PR-1192-2026-04-26/mlx_lm/utils.py", line 493, in load_tokenizer
return _load_tokenizer(
^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/mlx-lm-PR-1192-2026-04-26/mlx_lm/tokenizer_utils.py", line 614, in load
tokenizer = AutoTokenizer.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 690, in from_pretrained
config = PreTrainedConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 653, in from_pretrained
return cls.from_dict(config_dict, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 830, in from_dict
config = cls(**config_dict)
^^^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/huggingface_hub/dataclasses.py", line 275, in init_with_validate
initial_init(self, *args, **kwargs) # type: ignore [call-arg]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/huggingface_hub/dataclasses.py", line 190, in init
self.post_init(**additional_kwargs)
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 276, in post_init
kwargs = self.convert_rope_params_to_dict(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/modeling_rope_utils.py", line 724, in convert_rope_params_to_dict
self.standardize_rope_params()
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/modeling_rope_utils.py", line 758, in standardize_rope_params
self.rope_parameters.setdefault("original_max_position_embeddings", self.max_position_embeddings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 425, in getattribute
return super().getattribute(key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PreTrainedConfig' object has no attribute 'max_position_embeddings'

MLX Community org

Please install this transformers PR from source

pip install git+https://github.com/huggingface/transformers.git@refs/pull/45643/head

MLX Community org

Step — Convert diagnosis into likely GitHub issue framing

Yes — I think this is issue-worthy.

Proposed bug summary

Title
mlx_lm.server fails loading DeepSeek-V4-Flash-2bit-DQ despite PR #1192 due to tokenizer/config path using unsupported AutoConfig ("deepseek_v4")

Core bug claim

PR #1192 appears insufficient for server inference because mlx_lm.server still routes tokenizer loading through:

mlx_lm/tokenizer_utils.py:614
AutoTokenizer.from_pretrained(...)

which triggers:

KeyError: 'deepseek_v4'

and later:

AttributeError:
PreTrainedConfig has no attribute max_position_embeddings

Why it looks like an mlx-lm bug

  • weights download succeeds
  • server starts
  • failure occurs in tokenizer/config dispatch before inference
  • model-side PR applied
  • failure appears in server path, not model fetch

Suspected root cause

mlx_lm.server appears to still rely on unsupported Transformers AutoConfig path for DeepSeek-V4, though DeepSeek indicates V4 is not yet adapted to Transformers. (huggingface.co)

Possible need:

  • DeepSeek-V4-specific tokenizer bypass
  • avoid AutoConfig lookup for deepseek_v4

Minimal repro

mlx_lm.server
--model mlx-community/DeepSeek-V4-Flash-2bit-DQ
--chat-template-args '{"enable_thinking":false}'
--host 0.0.0.0
--port 8080
--max-tokens 8192
--temp 0.0

Suggested maintainer focus

Inspect:

mlx_lm/tokenizer_utils.py line 614
mlx_lm/utils.py tokenizer loading path

Possible dispatch issue:

DeepSeek-V4 model path supported
DeepSeek-V4 tokenizer path not supported

That is a clean bug report.

One extra diagnostic before filing (important)

Run this:

python - <<'PY'
from transformers.models.auto.configuration_auto import CONFIG_MAPPING
print("deepseek_v4" in CONFIG_MAPPING)
PY

If False:

  • strengthens issue materially.

I would include result in bug.

My assessment: high probability real integration bug, not user error.

if you want to test whether there is a likely one-line local patch around line 614.

Step — Likely local patch experiment (diagnostic only)

Known state: probable choke point:

tokenizer_utils.py line 614
AutoTokenizer.from_pretrained(...)

Most likely one-line diagnostic patch to test

Goal: determine if AutoTokenizer dispatch is the sole blocker.

Candidate patch A (lowest-risk diagnostic)

If line resembles:

tokenizer = AutoTokenizer.from_pretrained(path, ...)

test adding:

trust_remote_code=True

so:

tokenizer = AutoTokenizer.from_pretrained(
path,
trust_remote_code=True
)

Why only diagnostic

I do not expect this to fully fix it, but if behavior changes, it narrows failure mode.

Candidate patch B (more interesting)

Bypass config introspection:

if code is doing something like:

AutoTokenizer.from_pretrained(path)

try:

AutoTokenizer.from_pretrained(
path,
use_fast=False,
trust_remote_code=True
)

Sometimes fast tokenizer path trips unsupported configs.

Candidate patch C (most likely real workaround)

Direct tokenizer load from repo artifacts, avoid AutoConfig.

If model repo has:

  • tokenizer.json
  • tokenizer_config.json

then bypass AutoTokenizer logic entirely.

That may be actual fix territory.

What I would test first (single best experiment)

Patch only:

trust_remote_code=True, use_fast=False

If same:

KeyError deepseek_v4

then issue is deeper registration, not tokenizer flavor.

Very informative.

One more thing I noticed

The warning:

instantiate model of type ``

(empty target type)

is abnormal.

That smells like malformed dispatch logic and may be separate bug signal.

My expectation

Probability outcomes:

  • 70% patch changes nothing → confirms architecture registration bug
  • 20% gets farther, fails later → strong bug isolation win
  • 10% unexpectedly works

Worth trying.

Also relevant: MLX maintainers have been actively adding DeepSeek-V4 support around PRs in this area, which makes a server-side integration gap plausible. (github.com)

My recommendation: test patch A/B before filing issue.

after using in requirements.txt:
git+https://github.com/huggingface/transformers.git@refs/pull/45643/head
git+https://github.com/blaizzy/mlx-lm.git@refs/pull/17/head

mlx_lm.server --model mlx-community/DeepSeek-V4-Flash-2bit-DQ --chat-template-args '{"enable_thinking":false}' --host 0.0.0.0 --port 8080 --max-tokens 8192 --temp 0.0

2026-04-27 11:51:59,228 - INFO - HTTP Request: GET https://huggingface.co/api/models/mlx-community/DeepSeek-V4-Flash-2bit-DQ/revision/main "HTTP/1.1 200 OK"
Fetching 25 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 105596.78it/s]
Download complete: : 0.00B [00:00, ?B/s] | 0/25 [00:00<?, ?it/s]
/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/mlx_lm/server.py:1723: UserWarning: mlx_lm.server is not recommended for production as it only implements basic security checks.
warnings.warn(
2026-04-27 11:51:59,380 - INFO - Starting httpd at 0.0.0.0 on port 8080...
[ERROR] n_group is part of DeepseekV4Config.init's signature, but not documented. Make sure to add it to the docstring of the function in /Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/deepseek_v4/configuration_deepseek_v4.py.
[ERROR] first_k_dense_replace is part of DeepseekV4Config.init's signature, but not documented. Make sure to add it to the docstring of the function in /Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/deepseek_v4/configuration_deepseek_v4.py.
[ERROR] rope_interleave is part of DeepseekV4Config.init's signature, but not documented. Make sure to add it to the docstring of the function in /Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/deepseek_v4/configuration_deepseek_v4.py.
[ERROR] rope_theta is part of DeepseekV4Config.init's signature, but not documented. Make sure to add it to the docstring of the function in /Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/deepseek_v4/configuration_deepseek_v4.py.
[ERROR] o_lora_rank is part of DeepseekV4Config.init's signature, but not documented. Make sure to add it to the docstring of the function in /Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/deepseek_v4/configuration_deepseek_v4.py.
[ERROR] index_n_heads is part of DeepseekV4Config.init's signature, but not documented. Make sure to add it to the docstring of the function in /Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/deepseek_v4/configuration_deepseek_v4.py.
[ERROR] index_head_dim is part of DeepseekV4Config.init's signature, but not documented. Make sure to add it to the docstring of the function in /Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/deepseek_v4/configuration_deepseek_v4.py.
[ERROR] index_topk is part of DeepseekV4Config.init's signature, but not documented. Make sure to add it to the docstring of the function in /Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/deepseek_v4/configuration_deepseek_v4.py.
[ERROR] partial_rotary_factor is part of DeepseekV4Config.init's signature, but not documented. Make sure to add it to the docstring of the function in /Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/deepseek_v4/configuration_deepseek_v4.py.
Exception in thread Thread-1 (_generate):
Traceback (most recent call last):
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/huggingface_hub/dataclasses.py", line 144, in strict_setattr
validator(value)
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/huggingface_hub/dataclasses.py", line 625, in validator
type_validator(field.name, value, field.type)
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/huggingface_hub/dataclasses.py", line 472, in type_validator
_validate_simple_type(name, value, expected_type)
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/huggingface_hub/dataclasses.py", line 615, in _validate_simple_type
raise TypeError(
TypeError: Field 'rope_theta' expected float, got int (value: 10000)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/xjxr170/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/Users/xjxr170/.local/share/uv/python/cpython-3.12.11-macos-aarch64-none/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/mlx_lm/server.py", line 695, in _generate
self.model_provider.load_default()
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/mlx_lm/server.py", line 385, in load_default
self.load("default_model", None, "default_model")
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/mlx_lm/server.py", line 394, in load
self._load(*model_key)
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/mlx_lm/server.py", line 349, in _load
model, tokenizer = load(
^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/mlx_lm/utils.py", line 542, in load
tokenizer = load_tokenizer(
^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/mlx_lm/utils.py", line 493, in load_tokenizer
return _load_tokenizer(
^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/mlx_lm/tokenizer_utils.py", line 614, in load
tokenizer = AutoTokenizer.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/auto/tokenization_auto.py", line 687, in from_pretrained
config = AutoConfig.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 417, in from_pretrained
return config_class.from_dict(config_dict, **unused_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 839, in from_dict
config = cls(**config_dict)
^^^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/huggingface_hub/dataclasses.py", line 275, in init_with_validate
initial_init(self, *args, **kwargs) # type: ignore [call-arg]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/transformers/configuration_utils.py", line 99, in init
setattr(self, f.name, standard_kwargs[f.name])
File "/Users/xjxr170/mlx/.venv/lib/python3.12/site-packages/huggingface_hub/dataclasses.py", line 146, in strict_setattr
raise StrictDataclassFieldValidationError(field=name, cause=e) from e
huggingface_hub.errors.StrictDataclassFieldValidationError: Validation error for field 'rope_theta':
TypeError: Field 'rope_theta' expected float, got int (value: 10000)

fixed it by modifying config.json
in $HF_HUB_CACHE/models--mlx-community--DeepSeek-V4-Flash-2bit-DQ/snapshots/722bf559b7de93575b2320973cf2002e05bfe6c9/config.json

  • "compress_rope_theta": 160000,
  • "compress_rope_theta": 160000.0,

...

  • "rope_theta": 10000,
  • "rope_theta": 10000.0,

it is working now for simple queries 😄

longer harder query gives repeating output:

using : mlx_lm.server --model mlx-community/DeepSeek-V4-Flash-2bit-DQ --chat-template-args '{"enable_thinking":false}' --host 0.0.0.0 --port 8080 --max-tokens 8192 --temp 0.0

tricking me into thinking it's okay to compare pears and apples by making them slightly similar but otherwise competing products.
I reason tricking me into thinking it's okay to compare pears and apples by making them slightly similar but otherwise competing products.
I reason tricking me into thinking it's okay to compare pears and apples by making them slightly similar but otherwise competing products.
I reason tricking me into thinking it's okay to compare pears and apples by making them slightly similar but otherwise competing products.
....

generates it fast though ... lol

on M4 Max 128GB:
python3 -m mlx_lm.generate --model mlx-community/DeepSeek-V4-Flash-2bit-DQ --prompt "write a paragraph about quantum computing" --max-tokens 200
...

Quantum computing represents a revolutionary shift in the way we process information, leveraging the principles of quantum mechanics to perform calculations that are fundamentally beyond the reach of classical computers. Unlike traditional bits, which exist as either 0 or 1, quantum bits—or qubits—can exist in superpositions of states, meaning they can be both 0 and 1 simultaneously, and can be entangled with one another, creating a vast computational space. This allows quantum computers to solve certain types of problems exponentially faster than classical ones, particularly in areas like cryptography, optimization, and simulation of quantum systems. While still in its early stages, with significant challenges in building stable, error-corrected hardware, the potential of quantum computing is immense, promising to revolutionize fields from drug discovery to artificial intelligence. The journey from theoretical concept to practical, fault-tolerant machines is long and fraught with technical hurdles, but the promise of unlocking new capabilities in problem-solving and simulation is a powerful driver for continued research and development.

Prompt: 10 tokens, 1.332 tokens-per-sec
Generation: 198 tokens, 39.155 tokens-per-sec (M4 Max 128GB)
Peak memory: 96.739 GB

on M4 Max 128GB:
python3 -m mlx_lm.generate --model mlx-community/DeepSeek-V4-Flash-2bit-DQ --prompt "write a paragraph about quantum computing" --max-tokens 200
...

Quantum computing represents a revolutionary shift in the way we process information, leveraging the principles of quantum mechanics to perform calculations that are fundamentally beyond the reach of classical computers. Unlike traditional bits, which exist as either 0 or 1, quantum bits—or qubits—can exist in superpositions of states, meaning they can be both 0 and 1 simultaneously, and can be entangled with one another, creating a vast computational space. This allows quantum computers to solve certain types of problems exponentially faster than classical ones, particularly in areas like cryptography, optimization, and simulation of quantum systems. While still in its early stages, with significant challenges in building stable, error-corrected hardware, the potential of quantum computing is immense, promising to revolutionize fields from drug discovery to artificial intelligence. The journey from theoretical concept to practical, fault-tolerant machines is long and fraught with technical hurdles, but the promise of unlocking new capabilities in problem-solving and simulation is a powerful driver for continued research and development.

Prompt: 10 tokens, 1.332 tokens-per-sec
Generation: 198 tokens, 39.155 tokens-per-sec (M4 Max 128GB)
Peak memory: 96.739 GB

is that the speed
1.3 tok/sec PP and 40 tok/sec TG ?
does not seem right

Sign up or log in to comment