Instructions to use Intel/DeepSeek-V4-Pro-W4A16-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Intel/DeepSeek-V4-Pro-W4A16-AutoRound with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Intel/DeepSeek-V4-Pro-W4A16-AutoRound")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Intel/DeepSeek-V4-Pro-W4A16-AutoRound")
model = AutoModelForCausalLM.from_pretrained("Intel/DeepSeek-V4-Pro-W4A16-AutoRound")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Intel/DeepSeek-V4-Pro-W4A16-AutoRound with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Intel/DeepSeek-V4-Pro-W4A16-AutoRound"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Intel/DeepSeek-V4-Pro-W4A16-AutoRound",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Intel/DeepSeek-V4-Pro-W4A16-AutoRound

SGLang

How to use Intel/DeepSeek-V4-Pro-W4A16-AutoRound with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Intel/DeepSeek-V4-Pro-W4A16-AutoRound" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Intel/DeepSeek-V4-Pro-W4A16-AutoRound",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Intel/DeepSeek-V4-Pro-W4A16-AutoRound" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Intel/DeepSeek-V4-Pro-W4A16-AutoRound",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Intel/DeepSeek-V4-Pro-W4A16-AutoRound with Docker Model Runner:
```
docker model run hf.co/Intel/DeepSeek-V4-Pro-W4A16-AutoRound
```

xinhe commited on 6 days ago

Commit

0adcbb3

verified ·

1 Parent(s): 4e721c3

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +3 -0
DeepSeek_V4.pdf +3 -0
LICENSE +21 -0
README.md +54 -0
assets/dsv4_performance.png +3 -0
config.json +140 -0
encoding/README.md +156 -0
encoding/encoding_dsv4.py +744 -0
encoding/test_encoding_dsv4.py +89 -0
encoding/tests/test_input_1.json +81 -0
encoding/tests/test_input_2.json +24 -0
encoding/tests/test_input_3.json +159 -0
encoding/tests/test_input_4.json +28 -0
encoding/tests/test_output_1.txt +36 -0
encoding/tests/test_output_2.txt +1 -0
encoding/tests/test_output_3.txt +38 -0
encoding/tests/test_output_4.txt +29 -0
generation_config.json +9 -0
inference/README.md +25 -0
inference/config.json +35 -0
inference/config_w4a16.json +34 -0
inference/convert.py +168 -0
inference/convert_w4a16.py +246 -0
inference/generate.py +159 -0
inference/kernel.py +536 -0
inference/model.py +992 -0
inference/requirements.txt +6 -0
model-00001-of-00064.safetensors +3 -0
model-00002-of-00064.safetensors +3 -0
model-00003-of-00064.safetensors +3 -0
model-00004-of-00064.safetensors +3 -0
model-00005-of-00064.safetensors +3 -0
model-00006-of-00064.safetensors +3 -0
model-00007-of-00064.safetensors +3 -0
model-00008-of-00064.safetensors +3 -0
model-00009-of-00064.safetensors +3 -0
model-00010-of-00064.safetensors +3 -0
model-00011-of-00064.safetensors +3 -0
model-00012-of-00064.safetensors +3 -0
model-00013-of-00064.safetensors +3 -0
model-00014-of-00064.safetensors +3 -0
model-00015-of-00064.safetensors +3 -0
model-00016-of-00064.safetensors +3 -0
model-00017-of-00064.safetensors +3 -0
model-00018-of-00064.safetensors +3 -0
model-00019-of-00064.safetensors +3 -0
model-00020-of-00064.safetensors +3 -0
model-00021-of-00064.safetensors +3 -0
model-00022-of-00064.safetensors +3 -0
model-00023-of-00064.safetensors +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
+*.pdf filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

DeepSeek_V4.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fa4a3490e2dcc03c9da61b04a8be471795e9966ebbbf292a3899fa62683a330e
+size 4479901

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 DeepSeek
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,54 @@

+---
+license: mit
+library_name: transformers
+base_model:
+- deepseek-ai/DeepSeek-V4-Pro
+---
+This model is an int4 model with group_size 128 of [deepseek-ai/DeepSeek-V4-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) generated by [intel/auto-round](https://github.com/intel/auto-round) with RTN mode. Please follow the license of the original model.
+## How to Run Locally
+**vLLM and Sglang is not supported currently: https://huggingface.co/Intel/DeepSeek-V4-Flash-W4A16-AutoRound/discussions/1**
+Please refer to the [inference](inference/README.md) folder for detailed instructions on running DeepSeek-V4 locally, including model weight conversion and interactive chat demos.
+For local deployment, we recommend setting the sampling parameters to `temperature = 1.0, top_p = 1.0`. For the Think Max reasoning mode, we recommend setting the context window to at least **384K** tokens.
+## Generate the Model
+This pr is required: [Support model_free WOQ quantization](https://github.com/intel/auto-round/pull/1699)
+~~~bash
+auto-round  deepseek-ai/DeepSeek-V4-Pro  --model_free --output_dir "./DeepSeek-V4-Pro-W4A16"
+~~~
+## Ethical Considerations and Limitations
+The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
+Therefore, before deploying any applications of the model, developers should perform safety testing.
+## Caveats and Recommendations
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
+Here are a couple of useful links to learn more about Intel's AI software:
+- [Intel Neural Compressor](https://github.com/intel/neural-compressor)
+## Disclaimer
+The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.
+## Cite
+@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }
+[arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)

assets/dsv4_performance.png ADDED Viewed

Git LFS Details

SHA256: 8fd472981a4c8d40c1845c51c5e8961fc4ef3ac22e7ec77801f534c239c1b30f
Pointer size: 132 Bytes
Size of remote file: 1 MB

config.json ADDED Viewed

	@@ -0,0 +1,140 @@

+{
+  "architectures": [
+    "DeepseekV4ForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 0,
+  "eos_token_id": 1,
+  "hc_eps": 1e-06,
+  "hc_mult": 4,
+  "hc_sinkhorn_iters": 20,
+  "head_dim": 512,
+  "hidden_act": "silu",
+  "hidden_size": 7168,
+  "index_head_dim": 128,
+  "index_n_heads": 64,
+  "index_topk": 1024,
+  "initializer_range": 0.02,
+  "max_position_embeddings": 1048576,
+  "model_type": "deepseek_v4",
+  "moe_intermediate_size": 3072,
+  "n_routed_experts": 384,
+  "n_shared_experts": 1,
+  "norm_topk_prob": true,
+  "num_attention_heads": 128,
+  "num_experts_per_tok": 6,
+  "num_hidden_layers": 61,
+  "num_hash_layers": 3,
+  "num_key_value_heads": 1,
+  "num_nextn_predict_layers": 1,
+  "o_groups": 16,
+  "o_lora_rank": 1024,
+  "q_lora_rank": 1536,
+  "qk_rope_head_dim": 64,
+  "quantization_config": {
+    "quant_method": "auto-round",
+    "packing_format": "auto_round:auto_gptq",
+    "bits": 4,
+    "group_size": 128,
+    "sym": true,
+    "data_type": "int",
+    "iters": 0,
+    "model_free": true,
+    "autoround_version": "0.13.0",
+    "extra_config": {
+      "embed": {
+        "bits": 16,
+        "data_type": "float"
+      },
+      "head": {
+        "bits": 16,
+        "data_type": "float"
+      }
+    }
+  },
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": {
+    "beta_fast": 32,
+    "beta_slow": 1,
+    "factor": 16,
+    "original_max_position_embeddings": 65536,
+    "type": "yarn"
+  },
+  "rope_theta": 10000,
+  "routed_scaling_factor": 2.5,
+  "scoring_func": "sqrtsoftplus",
+  "sliding_window": 128,
+  "swiglu_limit": 10.0,
+  "tie_word_embeddings": false,
+  "topk_method": "noaux_tc",
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.57.1",
+  "use_cache": true,
+  "vocab_size": 129280,
+  "compress_rope_theta": 160000,
+  "compress_ratios": [
+    128,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    128,
+    4,
+    0
+  ]
+}

encoding/README.md ADDED Viewed

	@@ -0,0 +1,156 @@

+# DeepSeek-V4 Encoding
+This document describes the prompt encoding format used by DeepSeek-V4 series models. The encoding handles multi-turn conversations, tool calling, extended thinking (reasoning), and quick instruction tasks.
+A self-contained reference implementation is provided in `encoding_dsv4.py`.
+## Quick Start
+```python
+from encoding_dsv4 import encode_messages, parse_message_from_completion_text
+# Encode a conversation
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "What is 2+2?"},
+]
+prompt = encode_messages(messages, thinking_mode="thinking")
+# => "<｜begin▁of▁sentence｜>You are a helpful assistant.<｜User｜>What is 2+2?<｜Assistant｜><think>"
+# Parse model output back to structured message
+completion = "Simple arithmetic.</think>2 + 2 = 4.<｜end▁of▁sentence｜>"
+parsed = parse_message_from_completion_text(completion, thinking_mode="thinking")
+# => {"role": "assistant", "reasoning_content": "Simple arithmetic.", "content": "2 + 2 = 4.", "tool_calls": []}
+```
+> **Note:** The `parse_message_from_completion_text` function is designed to handle well-formatted model output only. It does not attempt to correct or recover from malformed output that the model might occasionally generate. For production use, additional error handling is recommended.
+## Message Format
+### Special Tokens
+| Token | Purpose |
+|-------|---------|
+| `<｜begin▁of▁sentence｜>` | Beginning of sequence (BOS) |
+| `<｜end▁of▁sentence｜>` | End of assistant turn (EOS) |
+| `<｜User｜>` | User turn prefix |
+| `<｜Assistant｜>` | Assistant turn prefix |
+| `<｜latest_reminder｜>` | Latest reminder (date, locale, etc.) |
+| `<think>` / `</think>` | Reasoning block delimiters |
+| `｜DSML｜` | DSML markup token |
+### Roles
+The encoding supports the following message roles: `system`, `user`, `assistant`, `tool`, `latest_reminder`, and `developer`.
+> **Note on the `developer` role:** The `developer` role is used exclusively in the internal search agent pipeline. It is not needed for general-purpose chat or tool-calling tasks, and the official API does not accept messages with this role.
+### Basic Chat
+A simple multi-turn conversation is encoded as:
+```
+<｜begin▁of▁sentence｜>{system_prompt}
+<｜User｜>{user_message}<｜Assistant｜></think>{response}<｜end▁of▁sentence｜>
+<｜User｜>{user_message_2}<｜Assistant｜></think>{response_2}<｜end▁of▁sentence｜>
+```
+- The BOS token is prepended at the very beginning of the conversation.
+- In **chat mode** (`thinking_mode="chat"`), `</think>` is placed right after `<｜Assistant｜>` to immediately close the thinking block, so the model generates content directly.
+### Interleaved Thinking Mode
+In **thinking mode** (`thinking_mode="thinking"`), the model produces explicit reasoning inside `<think>...</think>` blocks before responding.
+```
+<｜begin▁of▁sentence｜>{system_prompt}
+<｜User｜>{message}<｜Assistant｜><think>{reasoning}</think>{response}<｜end▁of▁sentence｜>
+```
+The `drop_thinking` parameter (default `True`) controls whether reasoning from earlier turns is preserved:
+- **Without tools**: `drop_thinking` takes effect. Reasoning content from assistant turns **before** the last user message is stripped. Only the final assistant turn retains its `<think>...</think>` block.
+- **With tools** (on system or developer message): `drop_thinking` is automatically disabled. All turns retain their reasoning, because tool-calling conversations require full context for the model to track multi-step reasoning across tool calls.
+### Tool Calling (DSML Format)
+Tools are defined on the `system` or `developer` message via the `tools` field (OpenAI-compatible format). When tools are present, the following schema block is injected into the system/user prompt:
+```
+## Tools
+You have access to a set of tools to help answer the user's question. You can invoke tools by writing a "<｜DSML｜tool_calls>" block like the following:
+<｜DSML｜tool_calls>
+<｜DSML｜invoke name="$TOOL_NAME">
+<｜DSML｜parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE</｜DSML｜parameter>
+...
+</｜DSML｜invoke>
+<｜DSML｜invoke name="$TOOL_NAME2">
+...
+</｜DSML｜invoke>
+</｜DSML｜tool_calls>
+String parameters should be specified as is and set `string="true"`. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set `string="false"`.
+If thinking_mode is enabled (triggered by <think>), you MUST output your complete reasoning inside <think>...</think> BEFORE any tool calls or final response.
+Otherwise, output directly after </think> with tool calls or final response.
+### Available Tool Schemas
+{tool_definitions_json}
+You MUST strictly follow the above defined tool name and parameter schemas to invoke tool calls.
+```
+An actual tool call in the assistant turn looks like:
+```xml
+<｜DSML｜tool_calls>
+<｜DSML｜invoke name="function_name">
+<｜DSML｜parameter name="param" string="true">string_value</｜DSML｜parameter>
+<｜DSML｜parameter name="count" string="false">5</｜DSML｜parameter>
+</｜DSML｜invoke>
+</｜DSML｜tool_calls><｜end▁of▁sentence｜>
+```
+- `string="true"`: the parameter value is a raw string.
+- `string="false"`: the parameter value is JSON (number, boolean, array, object).
+Tool execution results are wrapped in `<tool_result>` tags within user messages:
+```
+<｜User｜><tool_result>{result_json}</tool_result><｜Assistant｜><think>...
+```
+When multiple tool results are present, they are sorted by the order of the corresponding `tool_calls` in the preceding assistant message.
+### Reasoning Effort
+When `reasoning_effort="max"` is set, a special prefix is prepended at the very beginning of the prompt (before the system message) to instruct the model to maximize its reasoning depth:
+```
+Reasoning Effort: Absolute maximum with no shortcuts permitted.
+You MUST be very thorough in your thinking and comprehensively decompose the problem to resolve the root cause, rigorously stress-testing your logic against all potential paths, edge cases, and adversarial scenarios.
+Explicitly write out your entire deliberation process, documenting every intermediate step, considered alternative, and rejected hypothesis to ensure absolutely no assumption is left unchecked.
+```
+### Quick Instruction Special Tokens
+Quick instruction tokens are used for auxiliary classification and generation tasks. They are appended to messages via the `"task"` field to trigger specialized model behavior for a single-token or short-form output.
+| Special Token | Description | Format |
+|:---|:---|:---|
+| `<｜action｜>` | Determines whether the user prompt requires a web search or can be answered directly. | `...<｜User｜>{prompt}<｜Assistant｜><think><｜action｜>` |
+| `<｜title｜>` | Generates a concise conversation title after the first assistant response. | `...<｜Assistant｜>{response}<｜end▁of▁sentence｜><｜title｜>` |
+| `<｜query｜>` | Generates search queries for the user prompt. | `...<｜User｜>{prompt}<｜query｜>` |
+| `<｜authority｜>` | Classifies the user prompt's demand for source authoritativeness. | `...<｜User｜>{prompt}<｜authority｜>` |
+| `<｜domain｜>` | Identifies the domain of the user prompt. | `...<｜User｜>{prompt}<｜domain｜>` |
+| `<｜extracted_url｜>` `<｜read_url｜>` | Determines whether each URL in the user prompt should be fetched and read. | `...<｜User｜>{prompt}<｜extracted_url｜>{url}<｜read_url｜>` |
+Usage in message format:
+- **`action`** on a user message: the `<｜action｜>` token is placed after the assistant prefix and thinking token, triggering a routing decision (e.g., "Search" or "Answer").
+- **Other tasks** (`query`, `authority`, `domain`, `read_url`) on a user message: the task token is appended directly after the user content.
+- **`title`** on an assistant message: the `<｜title｜>` token is appended after the assistant's EOS. The next assistant message provides the generated title.

encoding/encoding_dsv4.py ADDED Viewed

	@@ -0,0 +1,744 @@

+"""
+DeepSeek-V4 Encoding
+A self-contained implementation for encoding/decoding DeepSeek-V4 chat messages
+with tool calling, thinking mode, and quick instruction task support.
+"""
+from typing import Any, Dict, List, Union, Optional, Tuple
+import copy
+import json
+import re
+# ============================================================
+# Special Tokens
+# ============================================================
+bos_token: str = "<｜begin▁of▁sentence｜>"
+eos_token: str = "<｜end▁of▁sentence｜>"
+thinking_start_token: str = "<think>"
+thinking_end_token: str = "</think>"
+dsml_token: str = "｜DSML｜"
+USER_SP_TOKEN = "<｜User｜>"
+ASSISTANT_SP_TOKEN = "<｜Assistant｜>"
+LATEST_REMINDER_SP_TOKEN = "<｜latest_reminder｜>"
+# Task special tokens for internal classification tasks
+DS_TASK_SP_TOKENS = {
+    "action": "<｜action｜>",
+    "query": "<｜query｜>",
+    "authority": "<｜authority｜>",
+    "domain": "<｜domain｜>",
+    "title": "<｜title｜>",
+    "read_url": "<｜read_url｜>",
+}
+VALID_TASKS = set(DS_TASK_SP_TOKENS.keys())
+# ============================================================
+# Templates
+# ============================================================
+system_msg_template: str = "{content}"
+user_msg_template: str = "{content}"
+latest_reminder_msg_template: str = "{content}"
+assistant_msg_template: str = "{reasoning}{content}{tool_calls}" + eos_token
+assistant_msg_wo_eos_template: str = "{reasoning}{content}{tool_calls}"
+thinking_template: str = "{reasoning_content}"
+response_format_template: str = (
+    "## Response Format:\n\nYou MUST strictly adhere to the following schema to reply:\n{schema}"
+)
+tool_call_template: str = (
+    "<{dsml_token}invoke name=\"{name}\">\n{arguments}\n</{dsml_token}invoke>"
+)
+tool_calls_template = (
+    "<{dsml_token}{tc_block_name}>\n{tool_calls}\n</{dsml_token}{tc_block_name}>"
+)
+tool_calls_block_name: str = "tool_calls"
+tool_output_template: str = (
+    "<tool_result>{content}</tool_result>"
+)
+REASONING_EFFORT_MAX = (
+    "Reasoning Effort: Absolute maximum with no shortcuts permitted.\n"
+    "You MUST be very thorough in your thinking and comprehensively decompose the problem to resolve the root cause, rigorously stress-testing your logic against all potential paths, edge cases, and adversarial scenarios.\n"
+    "Explicitly write out your entire deliberation process, documenting every intermediate step, considered alternative, and rejected hypothesis to ensure absolutely no assumption is left unchecked.\n\n"
+)
+TOOLS_TEMPLATE = """## Tools
+You have access to a set of tools to help answer the user's question. You can invoke tools by writing a "<{dsml_token}tool_calls>" block like the following:
+<{dsml_token}tool_calls>
+<{dsml_token}invoke name="$TOOL_NAME">
+<{dsml_token}parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE</{dsml_token}parameter>
+...
+</{dsml_token}invoke>
+<{dsml_token}invoke name="$TOOL_NAME2">
+...
+</{dsml_token}invoke>
+</{dsml_token}tool_calls>
+String parameters should be specified as is and set `string="true"`. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set `string="false"`.
+If thinking_mode is enabled (triggered by {thinking_start_token}), you MUST output your complete reasoning inside {thinking_start_token}...{thinking_end_token} BEFORE any tool calls or final response.
+Otherwise, output directly after {thinking_end_token} with tool calls or final response.
+### Available Tool Schemas
+{tool_schemas}
+You MUST strictly follow the above defined tool name and parameter schemas to invoke tool calls.
+"""
+# ============================================================
+# Utility Functions
+# ============================================================
+def to_json(value: Any) -> str:
+    """Serialize a value to JSON string."""
+    try:
+        return json.dumps(value, ensure_ascii=False)
+    except:
+        return json.dumps(value, ensure_ascii=True)
+def tools_from_openai_format(tools):
+    """Extract function definitions from OpenAI-format tool list."""
+    return [tool["function"] for tool in tools]
+def tool_calls_from_openai_format(tool_calls):
+    """Convert OpenAI-format tool calls to internal format."""
+    return [
+        {
+            "name": tool_call["function"]["name"],
+            "arguments": tool_call["function"]["arguments"],
+        }
+        for tool_call in tool_calls
+    ]
+def tool_calls_to_openai_format(tool_calls):
+    """Convert internal tool calls to OpenAI format."""
+    return [
+        {
+            "type": "function",
+            "function": {
+                "name": tool_call["name"],
+                "arguments": tool_call["arguments"],
+            }
+        }
+        for tool_call in tool_calls
+    ]
+def encode_arguments_to_dsml(tool_call: Dict[str, str]) -> str:
+    """
+    Encode tool call arguments into DSML parameter format.
+    Args:
+        tool_call: Dict with "name" and "arguments" (JSON string) keys.
+    Returns:
+        DSML-formatted parameter string.
+    """
+    p_dsml_template = '<{dsml_token}parameter name="{key}" string="{is_str}">{value}</{dsml_token}parameter>'
+    P_dsml_strs = []
+    try:
+        arguments = json.loads(tool_call["arguments"])
+    except Exception as err:
+        arguments = {"arguments": tool_call["arguments"]}
+    for k, v in arguments.items():
+        p_dsml_str = p_dsml_template.format(
+            dsml_token=dsml_token,
+            key=k,
+            is_str="true" if isinstance(v, str) else "false",
+            value=v if isinstance(v, str) else to_json(v),
+        )
+        P_dsml_strs.append(p_dsml_str)
+    return "\n".join(P_dsml_strs)
+def decode_dsml_to_arguments(tool_name: str, tool_args: Dict[str, Tuple[str, str]]) -> Dict[str, str]:
+    """
+    Decode DSML parameters back to a tool call dict.
+    Args:
+        tool_name: Name of the tool.
+        tool_args: Dict mapping param_name -> (value, is_string_flag).
+    Returns:
+        Dict with "name" and "arguments" (JSON string) keys.
+    """
+    def _decode_value(key: str, value: str, string: str):
+        if string == "true":
+            value = to_json(value)
+        return f"{to_json(key)}: {value}"
+    tool_args_json = "{" + ", ".join([_decode_value(k, v, string=is_str) for k, (v, is_str) in tool_args.items()]) + "}"
+    return dict(name=tool_name, arguments=tool_args_json)
+def render_tools(tools: List[Dict[str, Union[str, Dict[str, Any]]]]) -> str:
+    """
+    Render tool schemas into the system prompt format.
+    Args:
+        tools: List of tool schema dicts (each with name, description, parameters).
+    Returns:
+        Formatted tools section string.
+    """
+    tools_json = [to_json(t) for t in tools]
+    return TOOLS_TEMPLATE.format(
+        tool_schemas="\n".join(tools_json),
+        dsml_token=dsml_token,
+        thinking_start_token=thinking_start_token,
+        thinking_end_token=thinking_end_token,
+    )
+def find_last_user_index(messages: List[Dict[str, Any]]) -> int:
+    """Find the index of the last user/developer message."""
+    last_user_index = -1
+    for idx in range(len(messages) - 1, -1, -1):
+        if messages[idx].get("role") in ["user", "developer"]:
+            last_user_index = idx
+            break
+    return last_user_index
+# ============================================================
+# Message Rendering
+# ============================================================
+def render_message(index: int, messages: List[Dict[str, Any]], thinking_mode: str, drop_thinking: bool = True, reasoning_effort: Optional[str] = None) -> str:
+    """
+    Render a single message at the given index into its encoded string form.
+    This is the core function that converts each message in the conversation
+    into the DeepSeek-V4 format.
+    Args:
+        index: Index of the message to render.
+        messages: Full list of messages in the conversation.
+        thinking_mode: Either "chat" or "thinking".
+        drop_thinking: Whether to drop reasoning content from earlier turns.
+        reasoning_effort: Optional reasoning effort level ("max", "high", or None).
+    Returns:
+        Encoded string for this message.
+    """
+    assert 0 <= index < len(messages)
+    assert thinking_mode in ["chat", "thinking"], f"Invalid thinking_mode `{thinking_mode}`"
+    prompt = ""
+    msg = messages[index]
+    last_user_idx = find_last_user_index(messages)
+    role = msg.get("role")
+    content = msg.get("content")
+    tools = msg.get("tools")
+    response_format = msg.get("response_format")
+    tool_calls = msg.get("tool_calls")
+    reasoning_content = msg.get("reasoning_content")
+    wo_eos = msg.get("wo_eos", False)
+    if tools:
+        tools = tools_from_openai_format(tools)
+    if tool_calls:
+        tool_calls = tool_calls_from_openai_format(tool_calls)
+    # Reasoning effort prefix (only at index 0 in thinking mode with max effort)
+    assert reasoning_effort in ['max', None, 'high'], f"Invalid reasoning effort: {reasoning_effort}"
+    if index == 0 and thinking_mode == "thinking" and reasoning_effort == 'max':
+        prompt += REASONING_EFFORT_MAX
+    if role == "system":
+        prompt += system_msg_template.format(content=content or "")
+        if tools:
+            prompt += "\n\n" + render_tools(tools)
+        if response_format:
+            prompt += "\n\n" + response_format_template.format(schema=to_json(response_format))
+    elif role == "developer":
+        assert content, f"Invalid message for role `{role}`: {msg}"
+        content_developer = USER_SP_TOKEN
+        content_developer += content
+        if tools:
+            content_developer += "\n\n" + render_tools(tools)
+        if response_format:
+            content_developer += "\n\n" + response_format_template.format(schema=to_json(response_format))
+        prompt += user_msg_template.format(content=content_developer)
+    elif role == "user":
+        prompt += USER_SP_TOKEN
+        # Handle content blocks (tool results mixed with text)
+        content_blocks = msg.get("content_blocks")
+        if content_blocks:
+            parts = []
+            for block in content_blocks:
+                block_type = block.get("type")
+                if block_type == "text":
+                    parts.append(block.get("text", ""))
+                elif block_type == "tool_result":
+                    tool_content = block.get("content", "")
+                    if isinstance(tool_content, list):
+                        text_parts = []
+                        for b in tool_content:
+                            if b.get("type") == "text":
+                                text_parts.append(b.get("text", ""))
+                            else:
+                                text_parts.append(f"[Unsupported {b.get('type')}]")
+                        tool_content = "\n\n".join(text_parts)
+                    parts.append(tool_output_template.format(content=tool_content))
+                else:
+                    parts.append(f"[Unsupported {block_type}]")
+            prompt += "\n\n".join(parts)
+        else:
+            prompt += content or ""
+    elif role == "latest_reminder":
+        prompt += LATEST_REMINDER_SP_TOKEN + latest_reminder_msg_template.format(content=content)
+    elif role == "tool":
+        raise NotImplementedError("deepseek_v4 merges tool messages into user; please preprocess with merge_tool_messages()")
+    elif role == "assistant":
+        thinking_part = ""
+        tc_content = ""
+        if tool_calls:
+            tc_list = [
+                tool_call_template.format(
+                    dsml_token=dsml_token,
+                    name=tc.get("name"),
+                    arguments=encode_arguments_to_dsml(tc)
+                )
+                for tc in tool_calls
+            ]
+            tc_content += '\n\n' + tool_calls_template.format(
+                dsml_token=dsml_token,
+                tool_calls="\n".join(tc_list),
+                tc_block_name=tool_calls_block_name,
+            )
+        summary_content = content or ""
+        rc = reasoning_content or ""
+        # Check if previous message has a task - if so, this is a task output (no thinking)
+        prev_has_task = index - 1 >= 0 and messages[index - 1].get("task") is not None
+        if thinking_mode == "thinking" and not prev_has_task:
+            if not drop_thinking or index > last_user_idx:
+                thinking_part = thinking_template.format(reasoning_content=rc) + thinking_end_token
+            else:
+                thinking_part = ""
+        if wo_eos:
+            prompt += assistant_msg_wo_eos_template.format(
+                reasoning=thinking_part,
+                content=summary_content,
+                tool_calls=tc_content,
+            )
+        else:
+            prompt += assistant_msg_template.format(
+                reasoning=thinking_part,
+                content=summary_content,
+                tool_calls=tc_content,
+            )
+    else:
+        raise NotImplementedError(f"Unknown role: {role}")
+    # Append transition tokens based on what follows
+    if index + 1 < len(messages) and messages[index + 1].get("role") not in ["assistant", "latest_reminder"]:
+        return prompt
+    task = messages[index].get("task")
+    if task is not None:
+        # Task special token for internal classification tasks
+        assert task in VALID_TASKS, f"Invalid task: '{task}'. Valid tasks are: {list(VALID_TASKS)}"
+        task_sp_token = DS_TASK_SP_TOKENS[task]
+        if task != "action":
+            # Non-action tasks: append task sp token directly after the message
+            prompt += task_sp_token
+        else:
+            # Action task: append Assistant + thinking token + action sp token
+            prompt += ASSISTANT_SP_TOKEN
+            prompt += thinking_end_token if thinking_mode != "thinking" else thinking_start_token
+            prompt += task_sp_token
+    elif messages[index].get("role") in ["user", "developer"]:
+        # Normal generation: append Assistant + thinking token
+        prompt += ASSISTANT_SP_TOKEN
+        if not drop_thinking and thinking_mode == "thinking":
+            prompt += thinking_start_token
+        elif drop_thinking and thinking_mode == "thinking" and index >= last_user_idx:
+            prompt += thinking_start_token
+        else:
+            prompt += thinking_end_token
+    return prompt
+# ============================================================
+# Preprocessing
+# ============================================================
+def merge_tool_messages(messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+    """
+    Merge tool messages into the preceding user message using content_blocks format.
+    DeepSeek-V4 does not have a standalone "tool" role; instead, tool results
+    are encoded as <tool_result> blocks within user messages.
+    This function converts a standard OpenAI-format conversation (with separate
+    "tool" role messages) into V4 format where tool results are merged into
+    user messages.
+    Args:
+        messages: List of message dicts in OpenAI format.
+    Returns:
+        Processed message list with tool messages merged into user messages.
+    """
+    merged: List[Dict[str, Any]] = []
+    for msg in messages:
+        msg = copy.deepcopy(msg)
+        role = msg.get("role")
+        if role == "tool":
+            # Convert tool message to a user message with tool_result block
+            tool_block = {
+                "type": "tool_result",
+                "tool_use_id": msg.get("tool_call_id", ""),
+                "content": msg.get("content", ""),
+            }
+            # Merge into previous message if it's already a user (merged tool)
+            if merged and merged[-1].get("role") == "user" and "content_blocks" in merged[-1]:
+                merged[-1]["content_blocks"].append(tool_block)
+            else:
+                merged.append({
+                    "role": "user",
+                    "content_blocks": [tool_block],
+                })
+        elif role == "user":
+            text_block = {"type": "text", "text": msg.get("content", "")}
+            if merged and merged[-1].get("role") == "user" and "content_blocks" in merged[-1] and merged[-1].get("task") is None:
+                merged[-1]["content_blocks"].append(text_block)
+            else:
+                new_msg = {
+                    "role": "user",
+                    "content": msg.get("content", ""),
+                    "content_blocks": [text_block],
+                }
+                # Preserve extra fields (task, wo_eos, mask, etc.)
+                for key in ("task", "wo_eos", "mask"):
+                    if key in msg:
+                        new_msg[key] = msg[key]
+                merged.append(new_msg)
+        else:
+            merged.append(msg)
+    return merged
+def sort_tool_results_by_call_order(messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+    """
+    Sort tool_result blocks within user messages by the order of tool_calls
+    in the preceding assistant message.
+    Args:
+        messages: Preprocessed message list (after merge_tool_messages).
+    Returns:
+        Message list with sorted tool result blocks.
+    """
+    last_tool_call_order: Dict[str, int] = {}
+    for msg in messages:
+        role = msg.get("role")
+        if role == "assistant" and msg.get("tool_calls"):
+            last_tool_call_order = {}
+            for idx, tc in enumerate(msg["tool_calls"]):
+                tc_id = tc.get("id") or tc.get("function", {}).get("id", "")
+                if tc_id:
+                    last_tool_call_order[tc_id] = idx
+        elif role == "user" and msg.get("content_blocks"):
+            tool_blocks = [b for b in msg["content_blocks"] if b.get("type") == "tool_result"]
+            if len(tool_blocks) > 1 and last_tool_call_order:
+                sorted_blocks = sorted(
+                    tool_blocks,
+                    key=lambda b: last_tool_call_order.get(b.get("tool_use_id", ""), 0)
+                )
+                sorted_idx = 0
+                new_blocks = []
+                for block in msg["content_blocks"]:
+                    if block.get("type") == "tool_result":
+                        new_blocks.append(sorted_blocks[sorted_idx])
+                        sorted_idx += 1
+                    else:
+                        new_blocks.append(block)
+                msg["content_blocks"] = new_blocks
+    return messages
+# ============================================================
+# Main Encoding Function
+# ============================================================
+def encode_messages(
+    messages: List[Dict[str, Any]],
+    thinking_mode: str,
+    context: Optional[List[Dict[str, Any]]] = None,
+    drop_thinking: bool = True,
+    add_default_bos_token: bool = True,
+    reasoning_effort: Optional[str] = None,
+) -> str:
+    """
+    Encode a list of messages into the DeepSeek-V4 prompt format.
+    This is the main entry point for encoding conversations. It handles:
+    - BOS token insertion
+    - Thinking mode with optional reasoning content dropping
+    - Tool message merging into user messages
+    - Multi-turn conversation context
+    Args:
+        messages: List of message dicts to encode.
+        thinking_mode: Either "chat" or "thinking".
+        context: Optional preceding context messages (already encoded prefix).
+        drop_thinking: If True, drop reasoning_content from earlier assistant turns
+                      (only keep reasoning for messages after the last user message).
+        add_default_bos_token: Whether to prepend BOS token at conversation start.
+        reasoning_effort: Optional reasoning effort level ("max", "high", or None).
+    Returns:
+        The encoded prompt string.
+    """
+    context = context if context else []
+    # Preprocess: merge tool messages and sort tool results
+    messages = merge_tool_messages(messages)
+    messages = sort_tool_results_by_call_order(context + messages)[len(context):]
+    if context:
+        context = merge_tool_messages(context)
+        context = sort_tool_results_by_call_order(context)
+    full_messages = context + messages
+    prompt = bos_token if add_default_bos_token and len(context) == 0 else ""
+    # Resolve drop_thinking: if any message has tools defined, don't drop thinking
+    effective_drop_thinking = drop_thinking
+    if any(m.get("tools") for m in full_messages):
+        effective_drop_thinking = False
+    if thinking_mode == "thinking" and effective_drop_thinking:
+        full_messages = _drop_thinking_messages(full_messages)
+        # After dropping, recalculate how many messages to render
+        # (context may have shrunk too)
+        num_to_render = len(full_messages) - len(_drop_thinking_messages(context))
+        context_len = len(full_messages) - num_to_render
+    else:
+        num_to_render = len(messages)
+        context_len = len(context)
+    for idx in range(num_to_render):
+        prompt += render_message(
+            idx + context_len,
+            full_messages,
+            thinking_mode=thinking_mode,
+            drop_thinking=effective_drop_thinking,
+            reasoning_effort=reasoning_effort,
+        )
+    return prompt
+def _drop_thinking_messages(messages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+    """
+    Drop reasoning_content and non-essential messages before the last user message.
+    Behavior:
+    - Messages with role in ["user", "system", "tool", "latest_reminder"] are always kept.
+    - Messages at or after the last user index are always kept.
+    - Assistant messages before the last user get reasoning_content removed.
+    - Developer messages before the last user are dropped entirely.
+    """
+    last_user_idx = find_last_user_index(messages)
+    result = []
+    keep_roles = {"user", "system", "tool", "latest_reminder", "direct_search_results"}
+    for idx, msg in enumerate(messages):
+        role = msg.get("role")
+        if role in keep_roles or idx >= last_user_idx:
+            result.append(msg)
+        elif role == "assistant":
+            msg = copy.copy(msg)
+            msg.pop("reasoning_content", None)
+            result.append(msg)
+        # developer and other roles before last_user_idx are dropped
+    return result
+# ============================================================
+# Parsing (Decoding model output)
+# ============================================================
+def _read_until_stop(index: int, text: str, stop: List[str]) -> Tuple[int, str, Optional[str]]:
+    """
+    Read text from index until one of the stop strings is found.
+    Returns:
+        Tuple of (new_index, content_before_stop, matched_stop_string_or_None).
+    """
+    min_pos = len(text)
+    matched_stop = None
+    for s in stop:
+        pos = text.find(s, index)
+        if pos != -1 and pos < min_pos:
+            min_pos = pos
+            matched_stop = s
+    if matched_stop:
+        content = text[index:min_pos]
+        return min_pos + len(matched_stop), content, matched_stop
+    else:
+        content = text[index:]
+        return len(text), content, None
+def parse_tool_calls(index: int, text: str) -> Tuple[int, Optional[str], List[Dict[str, str]]]:
+    """
+    Parse DSML tool calls from text starting at the given index.
+    Args:
+        index: Starting position in text.
+        text: The full text to parse.
+    Returns:
+        Tuple of (new_index, last_stop_token, list_of_tool_call_dicts).
+        Each tool call dict has "name" and "arguments" keys.
+    """
+    tool_calls: List[Dict[str, Any]] = []
+    stop_token = None
+    tool_calls_end_token = f"</{dsml_token}{tool_calls_block_name}>"
+    while index < len(text):
+        index, _, stop_token = _read_until_stop(index, text, [f"<{dsml_token}invoke", tool_calls_end_token])
+        if _ != ">\n":
+            raise ValueError(f"Tool call format error: expected '>\\n' but got '{_}'")
+        if stop_token == tool_calls_end_token:
+            break
+        if stop_token is None:
+            raise ValueError("Missing special token in tool calls")
+        index, tool_name_content, stop_token = _read_until_stop(index, text, [f"<{dsml_token}parameter", f"</{dsml_token}invoke"])
+        p_tool_name = re.findall(r'^\s*name="(.*?)">\n$', tool_name_content, flags=re.DOTALL)
+        if len(p_tool_name) != 1:
+            raise ValueError(f"Tool name format error: '{tool_name_content}'")
+        tool_name = p_tool_name[0]
+        tool_args: Dict[str, Tuple[str, str]] = {}
+        while stop_token == f"<{dsml_token}parameter":
+            index, param_content, stop_token = _read_until_stop(index, text, [f"/{dsml_token}parameter"])
+            param_kv = re.findall(r'^ name="(.*?)" string="(true|false)">(.*?)<$', param_content, flags=re.DOTALL)
+            if len(param_kv) != 1:
+                raise ValueError(f"Parameter format error: '{param_content}'")
+            param_name, string, param_value = param_kv[0]
+            if param_name in tool_args:
+                raise ValueError(f"Duplicate parameter name: '{param_name}'")
+            tool_args[param_name] = (param_value, string)
+            index, content, stop_token = _read_until_stop(index, text, [f"<{dsml_token}parameter", f"</{dsml_token}invoke"])
+            if content != ">\n":
+                raise ValueError(f"Parameter format error: expected '>\\n' but got '{content}'")
+        tool_call = decode_dsml_to_arguments(tool_name=tool_name, tool_args=tool_args)
+        tool_calls.append(tool_call)
+    return index, stop_token, tool_calls
+def parse_message_from_completion_text(text: str, thinking_mode: str) -> Dict[str, Any]:
+    """
+    Parse a model completion text into a structured assistant message.
+    This function takes the raw text output from the model (a single assistant turn)
+    and extracts:
+    - reasoning_content (thinking block)
+    - content (summary/response)
+    - tool_calls (if any)
+    NOTE: This function is designed to parse only correctly formatted strings and
+    will raise ValueError for malformed output.
+    Args:
+        text: The raw completion text (including EOS token).
+        thinking_mode: Either "chat" or "thinking".
+    Returns:
+        Dict with keys: "role", "content", "reasoning_content", "tool_calls".
+        tool_calls are in OpenAI format.
+    """
+    summary_content, reasoning_content, tool_calls = "", "", []
+    index, stop_token = 0, None
+    tool_calls_start_token = f"\n\n<{dsml_token}{tool_calls_block_name}"
+    is_thinking = thinking_mode == "thinking"
+    is_tool_calling = False
+    if is_thinking:
+        index, content_delta, stop_token = _read_until_stop(index, text, [thinking_end_token, tool_calls_start_token])
+        reasoning_content = content_delta
+        assert stop_token == thinking_end_token, "Invalid thinking format: missing </think>"
+    index, content_delta, stop_token = _read_until_stop(index, text, [eos_token, tool_calls_start_token])
+    summary_content = content_delta
+    if stop_token == tool_calls_start_token:
+        is_tool_calling = True
+    else:
+        assert stop_token == eos_token, "Invalid format: missing EOS token"
+    if is_tool_calling:
+        index, stop_token, tool_calls = parse_tool_calls(index, text)
+        index, tool_ends_text, stop_token = _read_until_stop(index, text, [eos_token])
+        assert not tool_ends_text, "Unexpected content after tool calls"
+    assert len(text) == index and stop_token in [eos_token, None], "Unexpected content at end"
+    for sp_token in [bos_token, eos_token, thinking_start_token, thinking_end_token, dsml_token]:
+        assert sp_token not in summary_content and sp_token not in reasoning_content, \
+            f"Unexpected special token '{sp_token}' in content"
+    return {
+        "role": "assistant",
+        "content": summary_content,
+        "reasoning_content": reasoning_content,
+        "tool_calls": tool_calls_to_openai_format(tool_calls)
+    }

encoding/test_encoding_dsv4.py ADDED Viewed

	@@ -0,0 +1,89 @@

+"""
+Test suite for DeepSeek-V4 Encoding.
+Run: python test_encoding_dsv4.py
+"""
+import json
+import os
+from encoding_dsv4 import encode_messages, parse_message_from_completion_text
+TESTS_DIR = os.path.join(os.path.dirname(__file__), "tests")
+def test_case_1():
+    """Thinking mode with tool calls (multi-turn, tool results merged into user)."""
+    with open(os.path.join(TESTS_DIR, "test_input_1.json")) as f:
+        td = json.load(f)
+        messages = td["messages"]
+        messages[0]["tools"] = td["tools"]
+    gold = open(os.path.join(TESTS_DIR, "test_output_1.txt")).read()
+    prompt = encode_messages(messages, thinking_mode="thinking")
+    assert prompt == gold
+    # Parse: assistant turn with tool call
+    marker = "<｜Assistant｜><think>"
+    first_start = prompt.find(marker) + len(marker)
+    first_end = prompt.find("<｜User｜>", first_start)
+    parsed_tc = parse_message_from_completion_text(prompt[first_start:first_end], thinking_mode="thinking")
+    assert parsed_tc["reasoning_content"] == "The user wants to know the weather in Beijing. I should use the get_weather tool."
+    assert parsed_tc["content"] == ""
+    assert len(parsed_tc["tool_calls"]) == 1
+    assert parsed_tc["tool_calls"][0]["function"]["name"] == "get_weather"
+    assert json.loads(parsed_tc["tool_calls"][0]["function"]["arguments"]) == {"location": "Beijing", "unit": "celsius"}
+    # Parse: final assistant turn with content
+    last_start = prompt.rfind(marker) + len(marker)
+    parsed_final = parse_message_from_completion_text(prompt[last_start:], thinking_mode="thinking")
+    assert parsed_final["reasoning_content"] == "Got the weather data. Let me format a nice response."
+    assert "22°C" in parsed_final["content"]
+    assert parsed_final["tool_calls"] == []
+    print("  [PASS] case 1: thinking with tools (encode + parse)")
+def test_case_2():
+    """Thinking mode without tools (drop_thinking removes earlier reasoning)."""
+    messages = json.load(open(os.path.join(TESTS_DIR, "test_input_2.json")))
+    gold = open(os.path.join(TESTS_DIR, "test_output_2.txt")).read()
+    prompt = encode_messages(messages, thinking_mode="thinking")
+    assert prompt == gold
+    # Parse: last assistant turn
+    marker = "<｜Assistant｜><think>"
+    last_start = prompt.rfind(marker) + len(marker)
+    parsed = parse_message_from_completion_text(prompt[last_start:], thinking_mode="thinking")
+    assert parsed["reasoning_content"] == "The user asks about the capital of France. It is Paris."
+    assert parsed["content"] == "The capital of France is Paris."
+    assert parsed["tool_calls"] == []
+    # Verify drop_thinking: first assistant's reasoning should be absent
+    assert "The user said hello" not in prompt
+    print("  [PASS] case 2: thinking without tools (encode + parse)")
+def test_case_3():
+    """Interleaved thinking + search (developer with tools, latest_reminder)."""
+    messages = json.load(open(os.path.join(TESTS_DIR, "test_input_3.json")))
+    gold = open(os.path.join(TESTS_DIR, "test_output_3.txt")).read()
+    assert encode_messages(messages, thinking_mode="thinking") == gold
+    print("  [PASS] case 3: interleaved thinking + search")
+def test_case_4():
+    """Quick instruction task with latest_reminder (chat mode, action task)."""
+    messages = json.load(open(os.path.join(TESTS_DIR, "test_input_4.json")))
+    gold = open(os.path.join(TESTS_DIR, "test_output_4.txt")).read()
+    assert encode_messages(messages, thinking_mode="chat") == gold
+    print("  [PASS] case 4: quick instruction task")
+if __name__ == "__main__":
+    print("Running DeepSeek-V4 Encoding Tests...\n")
+    test_case_1()
+    test_case_2()
+    test_case_3()
+    test_case_4()
+    print("\nAll 4 tests passed!")

encoding/tests/test_input_1.json ADDED Viewed

	@@ -0,0 +1,81 @@

+{
+    "tools": [
+        {
+            "type": "function",
+            "function": {
+                "name": "get_weather",
+                "description": "Get the weather for a specific location",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "location": {
+                            "type": "string",
+                            "description": "The city name"
+                        },
+                        "unit": {
+                            "type": "string",
+                            "enum": ["celsius", "fahrenheit"],
+                            "description": "Temperature unit"
+                        }
+                    },
+                    "required": ["location"]
+                }
+            }
+        },
+        {
+            "type": "function",
+            "function": {
+                "name": "search",
+                "description": "Search the web for information",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "query": {
+                            "type": "string",
+                            "description": "Search query"
+                        },
+                        "num_results": {
+                            "type": "integer",
+                            "description": "Number of results to return"
+                        }
+                    },
+                    "required": ["query"]
+                }
+            }
+        }
+    ],
+    "messages": [
+        {
+            "role": "system",
+            "content": "You are a helpful assistant."
+        },
+        {
+            "role": "user",
+            "content": "What's the weather in Beijing?"
+        },
+        {
+            "role": "assistant",
+            "reasoning_content": "The user wants to know the weather in Beijing. I should use the get_weather tool.",
+            "tool_calls": [
+                {
+                    "id": "call_001",
+                    "type": "function",
+                    "function": {
+                        "name": "get_weather",
+                        "arguments": "{\"location\": \"Beijing\", \"unit\": \"celsius\"}"
+                    }
+                }
+            ]
+        },
+        {
+            "role": "tool",
+            "tool_call_id": "call_001",
+            "content": "{\"temperature\": 22, \"condition\": \"sunny\", \"humidity\": 45}"
+        },
+        {
+            "role": "assistant",
+            "reasoning_content": "Got the weather data. Let me format a nice response.",
+            "content": "The weather in Beijing is currently sunny with a temperature of 22°C and 45% humidity."
+        }
+    ]
+}

encoding/tests/test_input_2.json ADDED Viewed

	@@ -0,0 +1,24 @@

+[
+  {
+    "role": "system",
+    "content": "You are a helpful assistant."
+  },
+  {
+    "role": "user",
+    "content": "Hello"
+  },
+  {
+    "role": "assistant",
+    "reasoning_content": "The user said hello, I should greet back.",
+    "content": "Hi there! How can I help you?"
+  },
+  {
+    "role": "user",
+    "content": "What is the capital of France?"
+  },
+  {
+    "role": "assistant",
+    "reasoning_content": "The user asks about the capital of France. It is Paris.",
+    "content": "The capital of France is Paris."
+  }
+]

encoding/tests/test_input_3.json ADDED Viewed

	@@ -0,0 +1,159 @@

+[
+  {
+    "role": "system",
+    "content": "该助手为DeepSeek，由深度求索公司创造。"
+  },
+  {
+    "role": "latest_reminder",
+    "content": "2026-02-21,星期六,广州,App,中文"
+  },
+  {
+    "role": "developer",
+    "content": "小柴胡冲剂和布洛芬能一起吃吗？\n\nCITATION FORMAT: 【{cursor_id}†L{start_line_id}(-L{end_line_id})?】",
+    "tools": [
+      {
+        "type": "function",
+        "function": {
+          "name": "search",
+          "description": "Web search. Split multiple queries with '||'.",
+          "parameters": {
+            "type": "object",
+            "properties": {
+              "queries": {
+                "type": "string",
+                "description": "query1||query2"
+              }
+            },
+            "required": [
+              "queries"
+            ],
+            "additionalProperties": false,
+            "$schema": "http://json-schema.org/draft-07/schema#"
+          }
+        }
+      },
+      {
+        "type": "function",
+        "function": {
+          "name": "open",
+          "description": "Batch open IDs (format 【{id}†...】) or URLs.",
+          "parameters": {
+            "type": "object",
+            "properties": {
+              "open_list": {
+                "type": "array",
+                "items": {
+                  "type": "object",
+                  "properties": {
+                    "id": {
+                      "description": "ID or URL",
+                      "anyOf": [
+                        {
+                          "type": "integer"
+                        },
+                        {
+                          "type": "string"
+                        }
+                      ],
+                      "default": -1
+                    },
+                    "cursor": {
+                      "type": "integer",
+                      "description": "",
+                      "default": -1
+                    },
+                    "loc": {
+                      "type": "integer",
+                      "description": "Start line",
+                      "default": -1
+                    },
+                    "num_lines": {
+                      "type": "integer",
+                      "description": "",
+                      "default": -1
+                    },
+                    "view_source": {
+                      "type": "boolean",
+                      "description": "",
+                      "default": false
+                    }
+                  },
+                  "additionalProperties": false
+                },
+                "description": ""
+              }
+            },
+            "required": [
+              "open_list"
+            ],
+            "additionalProperties": false,
+            "$schema": "http://json-schema.org/draft-07/schema#"
+          }
+        }
+      },
+      {
+        "type": "function",
+        "function": {
+          "name": "find",
+          "description": "Find exact text pattern in pages.",
+          "parameters": {
+            "type": "object",
+            "properties": {
+              "find_list": {
+                "type": "array",
+                "items": {
+                  "type": "object",
+                  "properties": {
+                    "pattern": {
+                      "type": "string",
+                      "description": ""
+                    },
+                    "cursor": {
+                      "type": "integer",
+                      "description": "",
+                      "default": -1
+                    }
+                  },
+                  "required": [
+                    "pattern"
+                  ],
+                  "additionalProperties": false
+                },
+                "description": ""
+              }
+            },
+            "required": [
+              "find_list"
+            ],
+            "additionalProperties": false,
+            "$schema": "http://json-schema.org/draft-07/schema#"
+          }
+        }
+      }
+    ]
+  },
+  {
+    "role": "assistant",
+    "content": "",
+    "reasoning_content": "用户想知道小柴胡冲剂和布洛芬能否一起服用。",
+    "tool_calls": [
+      {
+        "type": "function",
+        "function": {
+          "name": "search",
+          "arguments": "{\"queries\": \"小柴胡冲剂 布洛芬 相互作用 一起吃\"}"
+        }
+      }
+    ]
+  },
+  {
+    "role": "tool",
+    "content": "[0]"
+  },
+  {
+    "role": "assistant",
+    "content": "请及时就医。",
+    "reasoning_content": "现在开始组织回答。",
+    "tool_calls": []
+  }
+]

encoding/tests/test_input_4.json ADDED Viewed

	@@ -0,0 +1,28 @@

+[
+  {
+    "role": "system",
+    "content": "该助手为DeepSeek-V3，由深度求索公司创造。\n今天是2025年10月17日，星期五。"
+  },
+  {
+    "role": "latest_reminder",
+    "content": "2024-11-15,上海市,App,中文"
+  },
+  {
+    "role": "user",
+    "content": "热海大滚锅是世界著名温泉吗"
+  },
+  {
+    "role": "assistant",
+    "content": "关于热海大滚锅是否是世界著名温泉，可以这样总结：它在**中国乃至全球的地热奇观中占有重要地位**，但“世界著名”的称号更侧重于它作为**独特的地质现象和旅游景点**，而非普遍的温泉疗养体验。\n\n为了让你快速了解，我整理了一个简要的表格：\n\n| 维度 | 热海大滚锅的地位与特点 |\n| :--- | :--- |\n| **地质奇观** | **世界罕见**的高温地热系统，泉眼多、水温高、形态多样。 |\n| **国内声誉** | **中国三大地热区之一**，**国家5A级旅游景区**，自古闻名（徐霞客曾游历并记载）。 |\n| **国际知名度** | 在一些旅游平台被提及为“世界六大温泉”之一，但此说法流传不广，其国际声誉更多建立在地质独特性上。 |\n| **核心体验** | **观赏地热奇观**（如97℃沸腾的“大滚锅”）、**体验温泉煮鸡蛋**。 |\n\n### 💡 游玩攻略与温馨提示\n\n如果你计划前往热海大滚锅，这里有一些实用信息供你参考：\n\n- **门票与开放时间**：\n    - **门票**：景区门票约为**50元/人**。如果选择包含温泉沐浴的套餐，价格会更高，例如约**288元**。\n    - **开放时间**：景区一般**08:00-18:00**开放，但具体时间可能变动，建议提前核实。\n\n- **特色体验**：\n    - **温泉煮鸡蛋**：这几乎是必试项目。可以在景区门口购买用草绳串起的生鸡蛋（约5-8元/串），然后到“大滚锅”旁的指定区域蒸煮，几分钟便可熟食，趣味十足。\n    - **金汤足浴**：可以直接用从“大滚锅”流出的温泉水泡脚，缓解旅途疲劳。\n\n- **注意事项**：\n    - **安全第一**：“大滚锅”水温极高，务必遵守游览规则，在指定区域内观赏，切勿随意触碰泉水。\n    - **规划行程**：建议为热海景区预留**3-4小时**的游览时间。景区内步道不走回头路，出入口有观光车接送。\n\n希望这些信息能帮助你更好地了解热海大滚锅。如果你对腾冲的其他景点或者行程规划有更多疑问，我很乐意提供进一步的信息。",
+    "mask": 1
+  },
+  {
+    "role": "user",
+    "content": "世界著名温泉有哪些",
+    "task": "action"
+  },
+  {
+    "role": "assistant",
+    "content": "Search"
+  }
+]

encoding/tests/test_output_1.txt ADDED Viewed

	@@ -0,0 +1,36 @@

+<｜begin▁of▁sentence｜>You are a helpful assistant.
+## Tools
+You have access to a set of tools to help answer the user's question. You can invoke tools by writing a "<｜DSML｜tool_calls>" block like the following:
+<｜DSML｜tool_calls>
+<｜DSML｜invoke name="$TOOL_NAME">
+<｜DSML｜parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE</｜DSML｜parameter>
+...
+</｜DSML｜invoke>
+<｜DSML｜invoke name="$TOOL_NAME2">
+...
+</｜DSML｜invoke>
+</｜DSML｜tool_calls>
+String parameters should be specified as is and set `string="true"`. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set `string="false"`.
+If thinking_mode is enabled (triggered by <think>), you MUST output your complete reasoning inside <think>...</think> BEFORE any tool calls or final response.
+Otherwise, output directly after </think> with tool calls or final response.
+### Available Tool Schemas
+{"name": "get_weather", "description": "Get the weather for a specific location", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "The city name"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit"}}, "required": ["location"]}}
+{"name": "search", "description": "Search the web for information", "parameters": {"type": "object", "properties": {"query": {"type": "string", "description": "Search query"}, "num_results": {"type": "integer", "description": "Number of results to return"}}, "required": ["query"]}}
+You MUST strictly follow the above defined tool name and parameter schemas to invoke tool calls.
+<｜User｜>What's the weather in Beijing?<｜Assistant｜><think>The user wants to know the weather in Beijing. I should use the get_weather tool.</think>
+<｜DSML｜tool_calls>
+<｜DSML｜invoke name="get_weather">
+<｜DSML｜parameter name="location" string="true">Beijing</｜DSML｜parameter>
+<｜DSML｜parameter name="unit" string="true">celsius</｜DSML｜parameter>
+</｜DSML｜invoke>
+</｜DSML｜tool_calls><｜end▁of▁sentence｜><｜User｜><tool_result>{"temperature": 22, "condition": "sunny", "humidity": 45}</tool_result><｜Assistant｜><think>Got the weather data. Let me format a nice response.</think>The weather in Beijing is currently sunny with a temperature of 22°C and 45% humidity.<｜end▁of▁sentence｜>

encoding/tests/test_output_2.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ <｜begin▁of▁sentence｜>You are a helpful assistant.<｜User｜>Hello<｜Assistant｜></think>Hi there! How can I help you?<｜end▁of▁sentence｜><｜User｜>What is the capital of France?<｜Assistant｜><think>The user asks about the capital of France. It is Paris.</think>The capital of France is Paris.<｜end▁of▁sentence｜>

encoding/tests/test_output_3.txt ADDED Viewed

	@@ -0,0 +1,38 @@

+<｜begin▁of▁sentence｜>该助手为DeepSeek，由深度求索公司创造。<｜latest_reminder｜>2026-02-21,星期六,广州,App,中文<｜User｜>小柴胡冲剂和布洛芬能一起吃吗？
+CITATION FORMAT: 【{cursor_id}†L{start_line_id}(-L{end_line_id})?】
+## Tools
+You have access to a set of tools to help answer the user's question. You can invoke tools by writing a "<｜DSML｜tool_calls>" block like the following:
+<｜DSML｜tool_calls>
+<｜DSML｜invoke name="$TOOL_NAME">
+<｜DSML｜parameter name="$PARAMETER_NAME" string="true|false">$PARAMETER_VALUE</｜DSML｜parameter>
+...
+</｜DSML｜invoke>
+<｜DSML｜invoke name="$TOOL_NAME2">
+...
+</｜DSML｜invoke>
+</｜DSML｜tool_calls>
+String parameters should be specified as is and set `string="true"`. For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set `string="false"`.
+If thinking_mode is enabled (triggered by <think>), you MUST output your complete reasoning inside <think>...</think> BEFORE any tool calls or final response.
+Otherwise, output directly after </think> with tool calls or final response.
+### Available Tool Schemas
+{"name": "search", "description": "Web search. Split multiple queries with '||'.", "parameters": {"type": "object", "properties": {"queries": {"type": "string", "description": "query1||query2"}}, "required": ["queries"], "additionalProperties": false, "$schema": "http://json-schema.org/draft-07/schema#"}}
+{"name": "open", "description": "Batch open IDs (format 【{id}†...】) or URLs.", "parameters": {"type": "object", "properties": {"open_list": {"type": "array", "items": {"type": "object", "properties": {"id": {"description": "ID or URL", "anyOf": [{"type": "integer"}, {"type": "string"}], "default": -1}, "cursor": {"type": "integer", "description": "", "default": -1}, "loc": {"type": "integer", "description": "Start line", "default": -1}, "num_lines": {"type": "integer", "description": "", "default": -1}, "view_source": {"type": "boolean", "description": "", "default": false}}, "additionalProperties": false}, "description": ""}}, "required": ["open_list"], "additionalProperties": false, "$schema": "http://json-schema.org/draft-07/schema#"}}
+{"name": "find", "description": "Find exact text pattern in pages.", "parameters": {"type": "object", "properties": {"find_list": {"type": "array", "items": {"type": "object", "properties": {"pattern": {"type": "string", "description": ""}, "cursor": {"type": "integer", "description": "", "default": -1}}, "required": ["pattern"], "additionalProperties": false}, "description": ""}}, "required": ["find_list"], "additionalProperties": false, "$schema": "http://json-schema.org/draft-07/schema#"}}
+You MUST strictly follow the above defined tool name and parameter schemas to invoke tool calls.
+<｜Assistant｜><think>用户想知道小柴胡冲剂和布洛芬能否一起服用。</think>
+<｜DSML｜tool_calls>
+<｜DSML｜invoke name="search">
+<｜DSML｜parameter name="queries" string="true">小柴胡冲剂 布洛芬 相互作用 一起吃</｜DSML｜parameter>
+</｜DSML｜invoke>
+</｜DSML｜tool_calls><｜end▁of▁sentence｜><｜User｜><tool_result>[0]</tool_result><｜Assistant｜><think>现在开始组织回答。</think>请及时就医。<｜end▁of▁sentence｜>

encoding/tests/test_output_4.txt ADDED Viewed

	@@ -0,0 +1,29 @@

+<｜begin▁of▁sentence｜>该助手为DeepSeek-V3，由深度求索公司创造。
+今天是2025年10月17日，星期五。<｜latest_reminder｜>2024-11-15,上海市,App,中文<｜User｜>热海大滚锅是世界著名温泉吗<｜Assistant｜></think>关于热海大滚锅是否是世界著名温泉，可以这样总结：它在**中国乃至全球的地热奇观中占有重要地位**，但“世界著名”的称号更侧重于它作为**独特的地质现象和旅游景点**，而非普遍的温泉疗养体验。
+为了让你快速了解，我整理了一个简要的表格：
+| 维度 | 热海大滚锅的地位与特点 |
+| :--- | :--- |
+| **地质奇观** | **世界罕见**的高温地热系统，泉眼多、水温高、形态多样。 |
+| **国内声誉** | **中国三大地热区之一**，**国家5A级旅游景区**，自古闻名（徐霞客曾游历并记载）。 |
+| **国际知名度** | 在一些旅游平台被提及为“世界六大温泉”之一，但此说法流传不广，其国际声誉更多建立在地质独特性上。 |
+| **核心体验** | **观赏地热奇观**（如97℃沸腾的“大滚锅”）、**体验温泉煮鸡蛋**。 |
+### 💡 游玩攻略与温馨提示
+如果你计划前往热海大滚锅，这里有一些实用信息供你参考：
+- **门票与开放时间**：
+    - **门票**：景区门票约为**50元/人**。如果选择包含温泉沐浴的套餐，价格会更高，例如约**288元**。
+    - **开放时间**：景区一般**08:00-18:00**开放，但具体时间可能变动，建议提前核实。
+- **特色体验**：
+    - **温泉煮鸡蛋**：这几乎是必试项目。可以在景区门口购买用草绳串起的生鸡蛋（约5-8元/串），然后到“大滚锅”旁的指定区域蒸煮，几分钟便可熟食，趣味十足。
+    - **金汤足浴**：可以直接用从“大滚锅”流出的温泉水泡脚，缓解旅途疲劳。
+- **注意事项**：
+    - **安全第一**：“大滚锅”水温极高，务必遵守游览规则，在指定区域内观赏，切勿随意触碰泉水。
+    - **规划行程**：建议为热海景区预留**3-4小时**的游览时间。景区内步道不走回头路，出入口有观光车接送。
+希望这些信息能帮助你更好地了解热海大滚锅。如果你对腾冲的其他景点或者行程规划有更多疑问，我很乐意提供进一步的信息。<｜end▁of▁sentence｜><｜User｜>世界著名温泉有哪些<｜Assistant｜></think><｜action｜>Search<｜end▁of▁sentence｜>

generation_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 0,
+  "eos_token_id": 1,
+  "do_sample": true,
+  "temperature": 1.0,
+  "top_p": 1.0,
+  "transformers_version": "4.46.3"
+}

inference/README.md ADDED Viewed

	@@ -0,0 +1,25 @@

+# Inference code for DeepSeek models
+First convert huggingface model weight files to the format of this project.
+```bash
+export EXPERTS=384
+export MP=8
+export CONFIG=config_w4a16.json
+python convert_w4a16.py --hf-ckpt-path ${HF_CKPT_PATH} --save-path ${SAVE_PATH} --n-experts ${EXPERTS} --model-parallel ${MP}
+```
+Then chat with DeepSeek model at will!
+```bash
+torchrun --nproc-per-node ${MP} generate.py --ckpt-path ${SAVE_PATH} --config ${CONFIG} --interactive
+```
+Or batch inference from file.
+```bash
+torchrun --nproc-per-node ${MP} generate.py --ckpt-path ${SAVE_PATH} --config ${CONFIG} --input-file ${FILE}
+```
+Or multi nodes inference.
+```bash
+torchrun --nnodes ${NODES} --nproc-per-node $((MP / NODES)) --node-rank $RANK --master-addr $ADDR generate.py --ckpt-path ${SAVE_PATH} --config ${CONFIG} --input-file ${FILE}
+```

inference/config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+    "vocab_size": 129280,
+    "dim": 7168,
+    "moe_inter_dim": 3072,
+    "n_layers": 61,
+    "n_hash_layers": 3,
+    "n_heads": 128,
+    "n_routed_experts": 384,
+    "n_shared_experts": 1,
+    "n_activated_experts": 6,
+    "score_func": "sqrtsoftplus",
+    "route_scale": 2.5,
+    "swiglu_limit": 10.0,
+    "q_lora_rank": 1536,
+    "head_dim": 512,
+    "rope_head_dim": 64,
+    "o_groups": 16,
+    "o_lora_rank": 1024,
+    "window_size": 128,
+    "original_seq_len": 65536,
+    "rope_theta": 10000,
+    "rope_factor": 16,
+    "beta_fast": 32,
+    "beta_slow": 1,
+    "index_n_heads": 64,
+    "index_head_dim": 128,
+    "index_topk": 1024,
+    "hc_mult": 4,
+    "hc_sinkhorn_iters": 20,
+    "dtype": "fp8",
+    "scale_fmt": "ue8m0",
+    "expert_dtype": "fp4",
+    "compress_rope_theta": 160000,
+    "compress_ratios": [128, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 0]
+}

inference/config_w4a16.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+    "vocab_size": 129280,
+    "dim": 7168,
+    "moe_inter_dim": 3072,
+    "n_layers": 61,
+    "n_hash_layers": 3,
+    "n_heads": 128,
+    "n_routed_experts": 384,
+    "n_shared_experts": 1,
+    "n_activated_experts": 6,
+    "score_func": "sqrtsoftplus",
+    "route_scale": 2.5,
+    "swiglu_limit": 10.0,
+    "q_lora_rank": 1536,
+    "head_dim": 512,
+    "rope_head_dim": 64,
+    "o_groups": 16,
+    "o_lora_rank": 1024,
+    "window_size": 128,
+    "original_seq_len": 65536,
+    "rope_theta": 10000,
+    "rope_factor": 16,
+    "beta_fast": 32,
+    "beta_slow": 1,
+    "index_n_heads": 64,
+    "index_head_dim": 128,
+    "index_topk": 1024,
+    "hc_mult": 4,
+    "hc_sinkhorn_iters": 20,
+    "dtype": "w4a16",
+    "compress_rope_theta": 160000,
+    "compress_ratios": [128, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 128, 4, 0]
+}

inference/convert.py ADDED Viewed

	@@ -0,0 +1,168 @@

+import os
+import shutil
+from argparse import ArgumentParser
+from glob import glob
+from tqdm import tqdm, trange
+import torch
+from safetensors.torch import safe_open, save_file
+FP4_TABLE = torch.tensor([
+    0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0,
+    0.0, -0.5, -1.0, -1.5, -2.0, -3.0, -4.0, -6.0
+], dtype=torch.float32)
+def cast_e2m1fn_to_e4m3fn(x: torch.Tensor, scale: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
+    """
+    Casts a tensor from e2m1fn to e4m3fn losslessly.
+    """
+    assert x.dtype == torch.int8
+    assert x.ndim == 2
+    out_dim, in_dim = x.size()
+    in_dim *= 2
+    fp8_block_size = 128
+    fp4_block_size = 32
+    assert in_dim % fp8_block_size == 0 and out_dim % fp8_block_size == 0
+    assert scale.size(0) == out_dim and scale.size(1) == in_dim // fp4_block_size
+    x = x.view(torch.uint8)
+    low  = x & 0x0F
+    high = (x >> 4) & 0x0F
+    x = torch.stack([FP4_TABLE[low.long()], FP4_TABLE[high.long()]], dim=-1).flatten(2)
+    # max_fp4 (6.0) * MAX_OFFSET must fit in e4m3fn (max 448)
+    # 6.0 * 2^6 = 384 < 448; 6.0 * 2^7 = 768 > 448; so MAX_OFFSET_BITS = 6
+    MAX_OFFSET_BITS = 6
+    bOut = out_dim // fp8_block_size
+    bIn = in_dim // fp8_block_size
+    # bOut, bIn, 128, 128
+    x = x.view(bOut, fp8_block_size, bIn, fp8_block_size).transpose(1, 2)
+    # bOut, bIn, 128*4
+    scale = scale.float().view(bOut, fp8_block_size, bIn, -1).transpose(1, 2).flatten(2)
+    ## bOut, bIn, 1
+    scale_max_offset_bits = scale.amax(dim=-1, keepdim=True) / (2**MAX_OFFSET_BITS)
+    # bOut, bIn, 128*4
+    offset = scale / scale_max_offset_bits
+    # bOut, bIn, 128, 128
+    offset = offset.unflatten(-1, (fp8_block_size, -1)).repeat_interleave(fp4_block_size, dim=-1)
+    x = (x * offset).transpose(1, 2).reshape(out_dim, in_dim)
+    return x.to(torch.float8_e4m3fn), scale_max_offset_bits.squeeze(-1).to(torch.float8_e8m0fnu)
+mapping = {
+    "embed_tokens": ("embed", 0),
+    "input_layernorm": ("attn_norm", None),
+    "post_attention_layernorm": ("ffn_norm", None),
+    "q_proj": ("wq", 0),
+    "q_a_proj": ("wq_a", None),
+    "q_a_layernorm": ("q_norm", None),
+    "q_b_proj": ("wq_b", 0),
+    "kv_a_proj_with_mqa": ("wkv_a", None),
+    "kv_a_layernorm": ("kv_norm", None),
+    "kv_b_proj": ("wkv_b", 0),
+    "o_proj": ("wo", 1),
+    "gate_proj": ("w1", 0),
+    "down_proj": ("w2", 1),
+    "up_proj": ("w3", 0),
+    "lm_head": ("head", 0),
+    "embed": ("embed", 0),
+    "wq_b": ("wq_b", 0),
+    "wo_a": ("wo_a", 0),
+    "wo_b": ("wo_b", 1),
+    "head": ("head", 0),
+    "attn_sink": ("attn_sink", 0),
+    "weights_proj": ("weights_proj", 0),
+}
+def main(hf_ckpt_path, save_path, n_experts, mp, expert_dtype):
+    """
+    Converts and saves model checkpoint files into a specified format.
+    Args:
+        hf_ckpt_path (str): Path to the directory containing the input checkpoint files.
+        save_path (str): Path to the directory where the converted checkpoint files will be saved.
+        n_experts (int): Total number of experts in the model.
+        mp (int): Model parallelism factor.
+    Returns:
+        None
+    """
+    torch.set_num_threads(8)
+    n_local_experts = n_experts // mp
+    state_dicts = [{} for _ in range(mp)]
+    for file_path in tqdm(glob(os.path.join(hf_ckpt_path, "*.safetensors"))):
+        with safe_open(file_path, framework="pt", device="cpu") as f:
+            for name in f.keys():
+                param: torch.Tensor = f.get_tensor(name)
+                if name.startswith("model."):
+                    name = name[len("model."):]
+                if name.startswith("mtp.") and ("emb" in name or name.endswith("head.weight")):
+                    continue
+                name = name.replace("self_attn", "attn")
+                name = name.replace("mlp", "ffn")
+                name = name.replace("weight_scale_inv", "scale")
+                name = name.replace("e_score_correction_bias", "bias")
+                if any(x in name for x in ["hc", "attn_sink", "tie2eid", "ape"]):    # without .weight
+                    key = name.split(".")[-1]
+                else:
+                    key = name.split(".")[-2]
+                if key in mapping:
+                    new_key, dim = mapping[key]
+                else:
+                    new_key, dim = key, None
+                name = name.replace(key, new_key)
+                for i in range(mp):
+                    new_param = param
+                    if "experts" in name and "shared_experts" not in name:
+                        idx = int(name.split(".")[-3])
+                        if idx < i * n_local_experts or idx >= (i + 1) * n_local_experts:
+                            continue
+                    elif dim is not None:
+                        assert param.size(dim) % mp == 0, f"Dimension {dim} must be divisible by {mp}"
+                        shard_size = param.size(dim) // mp
+                        new_param = param.narrow(dim, i * shard_size, shard_size).contiguous()
+                    state_dicts[i][name] = new_param
+    os.makedirs(save_path, exist_ok=True)
+    for i in trange(mp):
+        names = list(state_dicts[i].keys())
+        for name in names:
+            if name.endswith("wo_a.weight"):
+                weight = state_dicts[i][name]
+                scale = state_dicts[i].pop(name.replace("weight", "scale"))
+                weight = weight.unflatten(0, (-1, 128)).unflatten(-1, (-1, 128)).float() * scale[:, None, :, None].float()
+                state_dicts[i][name] = weight.flatten(2, 3).flatten(0, 1).bfloat16()
+            elif "experts" in name and state_dicts[i][name].dtype == torch.int8:
+                if expert_dtype == "fp8":
+                    scale_name = name.replace("weight", "scale")
+                    weight = state_dicts[i].pop(name)
+                    scale = state_dicts[i].pop(scale_name)
+                    state_dicts[i][name], state_dicts[i][scale_name] = cast_e2m1fn_to_e4m3fn(weight, scale)
+                else:
+                    state_dicts[i][name] = state_dicts[i][name].view(torch.float4_e2m1fn_x2)
+        save_file(state_dicts[i], os.path.join(save_path, f"model{i}-mp{mp}.safetensors"))
+    for file in ["tokenizer.json", "tokenizer_config.json"]:
+        old_file_path = os.path.join(hf_ckpt_path, file)
+        new_file_path = os.path.join(save_path, file)
+        if os.path.exists(old_file_path):
+            shutil.copyfile(old_file_path, new_file_path)
+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument("--hf-ckpt-path", type=str, required=True)
+    parser.add_argument("--save-path", type=str, required=True)
+    parser.add_argument("--n-experts", type=int, required=True)
+    parser.add_argument("--model-parallel", type=int, required=True)
+    parser.add_argument("--expert-dtype", type=str, choices=["fp8", "fp4"], required=False, default=None)
+    args = parser.parse_args()
+    assert args.n_experts % args.model_parallel == 0, "Number of experts must be divisible by model parallelism"
+    main(args.hf_ckpt_path, args.save_path, args.n_experts, args.model_parallel, args.expert_dtype)

inference/convert_w4a16.py ADDED Viewed

	@@ -0,0 +1,246 @@

+"""Convert an auto-round / GPTQ W4A16 packed HuggingFace checkpoint of DeepSeek-V4
+into the MP-sharded local format consumed by `model.py`/`generate.py`.
+Packing convention (auto-round → auto_gptq):
+- qweight : int32 [in_features // 8, out_features], LSB-first 4-bit packed along dim 0
+- qzeros  : int32 [in_features // group_size, out_features // 8], LSB-first 4-bit packed along dim 1
+- scales  : fp16  [in_features // group_size, out_features]
+Sharding rules per linear:
+- ColumnParallel (shard output dim, original `dim=0` in `mapping`):
+    qweight  along dim 1; qzeros along dim 1 (must be divisible by 8 first, then by world_size);
+    scales   along dim 1.
+- RowParallel    (shard input dim, original `dim=1` in `mapping`):
+    qweight  along dim 0 (must be divisible by 8 first, then by world_size);
+    qzeros   along dim 0 (must be divisible by group_size first, then by world_size);
+    scales   along dim 0.
+Non-quantised tensors (embed.weight, *.norm.weight, attn_sink, hc_*, ape, gate.bias,
+gate.tid2eid, etc.) follow the same rules as the original `convert.py`.
+"""
+import os
+import shutil
+from argparse import ArgumentParser
+from glob import glob
+from tqdm import tqdm, trange
+import torch
+from safetensors.torch import safe_open, save_file
+GROUP_SIZE = 128
+# Same name remapping as the original convert.py
+mapping = {
+    "embed_tokens": ("embed", 0),
+    "input_layernorm": ("attn_norm", None),
+    "post_attention_layernorm": ("ffn_norm", None),
+    "q_proj": ("wq", 0),
+    "q_a_proj": ("wq_a", None),
+    "q_a_layernorm": ("q_norm", None),
+    "q_b_proj": ("wq_b", 0),
+    "kv_a_proj_with_mqa": ("wkv_a", None),
+    "kv_a_layernorm": ("kv_norm", None),
+    "kv_b_proj": ("wkv_b", 0),
+    "o_proj": ("wo", 1),
+    "gate_proj": ("w1", 0),
+    "down_proj": ("w2", 1),
+    "up_proj": ("w3", 0),
+    "lm_head": ("head", 0),
+    # Already-translated names (used by the inference checkpoints we already have)
+    "embed": ("embed", 0),
+    "wq_a": ("wq_a", None),
+    "wq_b": ("wq_b", 0),
+    "wkv": ("wkv", None),
+    "wo_a": ("wo_a", 0),
+    "wo_b": ("wo_b", 1),
+    "w1": ("w1", 0),
+    "w2": ("w2", 1),
+    "w3": ("w3", 0),
+    "head": ("head", 0),
+    "weights_proj": ("weights_proj", 0),
+    # special non-weight keys
+    "attn_sink": ("attn_sink", 0),
+    "ape": ("ape", None),
+    # NOTE: 'gate' is intentionally NOT in this mapping -- the routing gate is a
+    # plain nn.Parameter that is replicated on every rank.
+}
+# Suffixes that mark the three pieces of a packed W4A16 linear.
+QUANT_SUFFIXES = (".qweight", ".qzeros", ".scales")
+def shard_quant(qweight: torch.Tensor, qzeros: torch.Tensor, scales: torch.Tensor,
+                shard_dim: int, mp: int):
+    """Yield (qweight_i, qzeros_i, scales_i) for i in range(mp).
+    shard_dim is the *logical* dim of the dequantised weight: 0 == output (column parallel),
+    1 == input (row parallel)."""
+    out = qweight.size(1)
+    in_packed = qweight.size(0)             # in_features // 8
+    n_groups = scales.size(0)               # in_features // group_size
+    if shard_dim == 0:                      # ColumnParallel: shard along OUTPUT
+        assert out % mp == 0, f"out={out} not divisible by mp={mp}"
+        # qzeros packs 8 outputs per int32 in dim 1, so need (out/mp) % 8 == 0
+        assert (out // mp) % 8 == 0, f"shard {out//mp} of out dim not divisible by 8 (qzeros packing)"
+        sh_out = out // mp
+        sh_qz_cols = qzeros.size(1) // mp   # == out / 8 / mp
+        for i in range(mp):
+            yield (
+                qweight.narrow(1, i * sh_out, sh_out).contiguous(),
+                qzeros.narrow(1, i * sh_qz_cols, sh_qz_cols).contiguous(),
+                scales.narrow(1, i * sh_out, sh_out).contiguous(),
+            )
+    elif shard_dim == 1:                    # RowParallel: shard along INPUT
+        # qweight packs 8 inputs per int32 in dim 0, scales/qzeros are per-group on dim 0
+        assert in_packed % mp == 0, f"in_packed={in_packed} not divisible by mp={mp}"
+        assert n_groups % mp == 0, f"n_groups={n_groups} not divisible by mp={mp}"
+        sh_in_packed = in_packed // mp
+        sh_groups = n_groups // mp
+        for i in range(mp):
+            yield (
+                qweight.narrow(0, i * sh_in_packed, sh_in_packed).contiguous(),
+                qzeros.narrow(0, i * sh_groups, sh_groups).contiguous(),
+                scales.narrow(0, i * sh_groups, sh_groups).contiguous(),
+            )
+    else:
+        # Replicate
+        for _ in range(mp):
+            yield qweight, qzeros, scales
+def get_layer_key(name: str):
+    """Return the linear-name token (e.g. wq_a, w1, head) used for the rename mapping."""
+    parts = name.split(".")
+    if name.endswith(QUANT_SUFFIXES):
+        return parts[-2]                    # ...x.qweight  -> x
+    if name.endswith(".bias") and "gate" in name:
+        return "gate"                       # ffn.gate.bias
+    if name.endswith(".tid2eid"):
+        return "gate"
+    if any(k in parts for k in ("hc_attn_fn", "hc_attn_base", "hc_attn_scale",
+                                 "hc_ffn_fn", "hc_ffn_base", "hc_ffn_scale",
+                                 "hc_head_fn", "hc_head_base", "hc_head_scale",
+                                 "attn_sink", "ape")):
+        return parts[-1]
+    return parts[-2]
+def main(hf_ckpt_path, save_path, n_experts, mp):
+    torch.set_num_threads(8)
+    n_local_experts = n_experts // mp
+    state_dicts = [{} for _ in range(mp)]
+    # Group all fragments belonging to the same logical linear so we can shard
+    # qweight/qzeros/scales together.
+    pending: dict[str, dict[str, torch.Tensor]] = {}
+    def emit_linear(base_name: str, parts: dict[str, torch.Tensor], shard_dim):
+        """Distribute a quantised linear (3 tensors) across `mp` shards."""
+        qweight = parts["qweight"]
+        qzeros = parts["qzeros"]
+        scales = parts["scales"].to(torch.bfloat16)   # store bf16 instead of fp16
+        # Expert-local pruning: only the rank that owns this expert keeps the tensors.
+        if "experts" in base_name and "shared_experts" not in base_name:
+            idx = int(base_name.split(".experts.")[1].split(".")[0])
+            owner = idx // n_local_experts
+            state_dicts[owner][base_name + ".qweight"] = qweight
+            state_dicts[owner][base_name + ".qzeros"] = qzeros
+            state_dicts[owner][base_name + ".scales"] = scales
+            return
+        if shard_dim is None:
+            # Replicate across all ranks
+            for i in range(mp):
+                state_dicts[i][base_name + ".qweight"] = qweight
+                state_dicts[i][base_name + ".qzeros"] = qzeros
+                state_dicts[i][base_name + ".scales"] = scales
+        else:
+            for i, (qw, qz, sc) in enumerate(shard_quant(qweight, qzeros, scales, shard_dim, mp)):
+                state_dicts[i][base_name + ".qweight"] = qw
+                state_dicts[i][base_name + ".qzeros"] = qz
+                state_dicts[i][base_name + ".scales"] = sc
+    files = sorted(glob(os.path.join(hf_ckpt_path, "*.safetensors")))
+    for file_path in tqdm(files, desc="files"):
+        with safe_open(file_path, framework="pt", device="cpu") as f:
+            for orig_name in f.keys():
+                # ----- name remapping (mirrors original convert.py) -----
+                name = orig_name
+                if name.startswith("model."):
+                    name = name[len("model."):]
+                if name.startswith("mtp.") and ("emb" in name or name.endswith("head.weight")):
+                    continue
+                name = name.replace("self_attn", "attn")
+                name = name.replace("mlp", "ffn")
+                name = name.replace("weight_scale_inv", "scale")
+                name = name.replace("e_score_correction_bias", "bias")
+                key = get_layer_key(name)
+                if key in mapping:
+                    new_key, dim = mapping[key]
+                    name = name.replace(key, new_key)
+                else:
+                    dim = None
+                tensor = f.get_tensor(orig_name)
+                # ----- handle the three-piece quantised linear -----
+                # `shared_experts` are plain (non-parallel) Linears in the model;
+                # never shard them even though `w1/w2/w3` are in the mapping.
+                if "shared_experts" in name:
+                    dim = None
+                if orig_name.endswith(QUANT_SUFFIXES):
+                    base = name.rsplit(".", 1)[0]
+                    suf = name.rsplit(".", 1)[1]                # qweight|qzeros|scales
+                    pending.setdefault(base, {"_dim": dim})[suf] = tensor
+                    pending[base]["_dim"] = dim
+                    parts = pending[base]
+                    if all(s in parts for s in ("qweight", "qzeros", "scales")):
+                        emit_linear(base, parts, parts["_dim"])
+                        del pending[base]
+                    continue
+                # ----- non-quantised tensor -----
+                if "experts" in name and "shared_experts" not in name:
+                    idx = int(name.split(".experts.")[1].split(".")[0])
+                    owner = idx // n_local_experts
+                    state_dicts[owner][name] = tensor
+                    continue
+                if dim is None:
+                    for i in range(mp):
+                        state_dicts[i][name] = tensor
+                else:
+                    assert tensor.size(dim) % mp == 0, f"{name} dim {dim} ({tensor.size(dim)}) not divisible by {mp}"
+                    sh = tensor.size(dim) // mp
+                    for i in range(mp):
+                        state_dicts[i][name] = tensor.narrow(dim, i * sh, sh).contiguous()
+    if pending:
+        raise RuntimeError(f"Incomplete quantised linears: {list(pending)[:5]}")
+    os.makedirs(save_path, exist_ok=True)
+    for i in trange(mp, desc="write shards"):
+        save_file(state_dicts[i], os.path.join(save_path, f"model{i}-mp{mp}.safetensors"))
+    for fn in ["tokenizer.json", "tokenizer_config.json"]:
+        src = os.path.join(hf_ckpt_path, fn)
+        dst = os.path.join(save_path, fn)
+        if os.path.exists(src):
+            shutil.copyfile(src, dst)
+if __name__ == "__main__":
+    p = ArgumentParser()
+    p.add_argument("--hf-ckpt-path", required=True)
+    p.add_argument("--save-path", required=True)
+    p.add_argument("--n-experts", type=int, required=True)
+    p.add_argument("--model-parallel", type=int, required=True)
+    a = p.parse_args()
+    assert a.n_experts % a.model_parallel == 0
+    main(a.hf_ckpt_path, a.save_path, a.n_experts, a.model_parallel)

inference/generate.py ADDED Viewed

	@@ -0,0 +1,159 @@

+import os
+# Prevent gptqmodel from setting CUDA_DEVICE_ORDER=PCI_BUS_ID (breaks multi-GPU on some systems)
+os.environ.setdefault("CUDA_DEVICE_ORDER", "FASTEST_FIRST")
+import json
+import sys
+from argparse import ArgumentParser
+from typing import List
+import torch
+import torch.distributed as dist
+from transformers import AutoTokenizer
+from safetensors.torch import load_model
+from model import Transformer, ModelArgs
+current_dir = os.path.dirname(os.path.abspath(__file__))
+encoding_dir = os.path.join(current_dir, '../encoding')
+sys.path.insert(0, os.path.abspath(encoding_dir))
+from encoding_dsv4 import encode_messages, parse_message_from_completion_text
+def sample(logits, temperature: float = 1.0):
+    """Gumbel-max trick: equivalent to multinomial sampling but faster on GPU,
+    since it avoids the GPU-to-CPU sync in torch.multinomial."""
+    logits = logits / max(temperature, 1e-5)
+    probs = torch.softmax(logits, dim=-1, dtype=torch.float32)
+    return probs.div_(torch.empty_like(probs).exponential_(1)).argmax(dim=-1)
+@torch.inference_mode()
+def generate(
+    model: Transformer,
+    prompt_tokens: List[List[int]],
+    max_new_tokens: int,
+    eos_id: int,
+    temperature: float = 1.0
+) -> List[List[int]]:
+    """Batch generation with left-padded prompts.
+    The first forward pass processes [min_prompt_len:] tokens (prefill phase).
+    Subsequent passes generate one token at a time (decode phase). For positions
+    still within a prompt, the ground-truth token overrides the model's prediction.
+    """
+    prompt_lens = [len(t) for t in prompt_tokens]
+    assert max(prompt_lens) <= model.max_seq_len, f"Prompt length exceeds model maximum sequence length (max_seq_len={model.max_seq_len})"
+    total_len = min(model.max_seq_len, max_new_tokens + max(prompt_lens))
+    tokens = torch.full((len(prompt_tokens), total_len), -1, dtype=torch.long)
+    for i, t in enumerate(prompt_tokens):
+        tokens[i, :len(t)] = torch.tensor(t, dtype=torch.long)
+    prev_pos = 0
+    finished = torch.tensor([False] * len(prompt_tokens))
+    prompt_mask = tokens != -1
+    for cur_pos in range(min(prompt_lens), total_len):
+        logits = model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
+        if temperature > 0:
+            next_token = sample(logits, temperature)
+        else:
+            next_token = logits.argmax(dim=-1)
+        next_token = torch.where(prompt_mask[:, cur_pos], tokens[:, cur_pos], next_token)
+        tokens[:, cur_pos] = next_token
+        finished |= torch.logical_and(~prompt_mask[:, cur_pos], next_token == eos_id)
+        prev_pos = cur_pos
+        if finished.all():
+            break
+    completion_tokens = []
+    for i, toks in enumerate(tokens.tolist()):
+        toks = toks[prompt_lens[i]:prompt_lens[i]+max_new_tokens]
+        if eos_id in toks:
+            toks = toks[:toks.index(eos_id)]
+        toks.append(eos_id)
+        completion_tokens.append(toks)
+    return completion_tokens
+def main(
+    ckpt_path: str,
+    config: str,
+    input_file: str = "",
+    interactive: bool = True,
+    max_new_tokens: int = 100,
+    temperature: float = 1.0,
+) -> None:
+    world_size = int(os.getenv("WORLD_SIZE", "1"))
+    rank = int(os.getenv("RANK", "0"))
+    local_rank = int(os.getenv("LOCAL_RANK", "0"))
+    if world_size > 1:
+        dist.init_process_group("nccl")
+    global print
+    if rank != 0:
+        print = lambda *_, **__: None
+    torch.cuda.set_device(local_rank)
+    torch.cuda.memory._set_allocator_settings("expandable_segments:True")
+    torch.set_default_dtype(torch.bfloat16)
+    torch.set_num_threads(8)
+    torch.manual_seed(33377335)
+    with open(config) as f:
+        args = ModelArgs(**json.load(f))
+    if interactive:
+        args.max_batch_size = 1
+    print(args)
+    with torch.device("cuda"):
+        model = Transformer(args)
+    tokenizer = AutoTokenizer.from_pretrained(ckpt_path)
+    print("load model")
+    load_model(model, os.path.join(ckpt_path, f"model{rank}-mp{world_size}.safetensors"), strict=False)
+    if args.dtype == "w4a16":
+        model.init_woq_layers()
+    torch.set_default_device("cuda")
+    print("I'm DeepSeek 👋")
+    if interactive:
+        messages = []
+        while True:
+            if world_size == 1:
+                prompt = input(">>> ")
+            elif rank == 0:
+                prompt = input(">>> ")
+                objects = [prompt]
+                dist.broadcast_object_list(objects, 0)
+            else:
+                objects = [None]
+                dist.broadcast_object_list(objects, 0)
+                prompt = objects[0]
+            if prompt == "/exit":
+                break
+            elif prompt == "/clear":
+                messages.clear()
+                continue
+            messages.append({"role": "user", "content": prompt})
+            prompt_tokens = tokenizer.encode(encode_messages(messages, thinking_mode="chat"))
+            completion_tokens = generate(model, [prompt_tokens], max_new_tokens, tokenizer.eos_token_id, temperature)
+            completion = tokenizer.decode(completion_tokens[0])
+            print(completion)
+            messages.append(parse_message_from_completion_text(completion, thinking_mode="chat"))
+    else:
+        with open(input_file) as f:
+            prompts = f.read().split("\n\n")
+        prompt_tokens = [tokenizer.encode(encode_messages([{"role": "user", "content": prompt}], thinking_mode="chat")) for prompt in prompts]
+        completion_tokens = generate(model, prompt_tokens, max_new_tokens, tokenizer.eos_token_id, temperature)
+        completions = tokenizer.batch_decode(completion_tokens)
+        for prompt, completion in zip(prompts, completions):
+            print("Prompt:", prompt)
+            print("Completion:", completion)
+            print()
+    if world_size > 1:
+        dist.destroy_process_group()
+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument("--ckpt-path", type=str, required=True)
+    parser.add_argument("--config", type=str, required=True)
+    parser.add_argument("--input-file", type=str, default="")
+    parser.add_argument("--interactive", action="store_true")
+    parser.add_argument("--max-new-tokens", type=int, default=300)
+    parser.add_argument("--temperature", type=float, default=0.6)
+    args = parser.parse_args()
+    assert args.input_file or args.interactive, "Either input-file or interactive mode must be specified"
+    main(args.ckpt_path, args.config, args.input_file, args.interactive, args.max_new_tokens, args.temperature)

inference/kernel.py ADDED Viewed

	@@ -0,0 +1,536 @@

+import torch
+import tilelang
+import tilelang.language as T
+from typing import Tuple, Optional
+tilelang.set_log_level("WARNING")
+pass_configs = {
+    tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True,
+    tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True,
+}
+FP8 = "float8_e4m3"
+FP4 = "float4_e2m1fn"
+FE8M0 = "float8_e8m0fnu"
+BF16 = "bfloat16"
+FP32 = "float32"
+INT32 = "int32"
+def fast_log2_ceil(x):
+    """Compute ceil(log2(x)) via IEEE 754 bit manipulation. Avoids slow log/ceil intrinsics."""
+    bits_x = T.reinterpret("uint32", x)
+    exp_x = (bits_x >> 23) & 0xFF
+    man_bits = bits_x & ((1 << 23) - 1)
+    return T.Cast("int32", exp_x - 127 + T.if_then_else(man_bits != 0, 1, 0))
+def fast_pow2(x):
+    """Compute 2^x for integer x via IEEE 754 bit manipulation."""
+    bits_x = (x + 127) << 23
+    return T.reinterpret("float32", bits_x)
+def fast_round_scale(amax, fp8_max_inv):
+    return fast_pow2(fast_log2_ceil(amax * fp8_max_inv))
+@tilelang.jit(pass_configs=pass_configs)
+def act_quant_kernel(
+    N, block_size=128, in_dtype=BF16, out_dtype=FP8, scale_dtype=FP32,
+    round_scale=False, inplace=False
+):
+    """Block-wise FP8 quantization. inplace=True does fused quant+dequant back to BF16."""
+    M = T.symbolic("M")
+    fp8_min = -448.0
+    fp8_max = 448.0
+    fp8_max_inv = 1 / fp8_max
+    num_stages = 0 if round_scale or inplace else 2
+    blk_m = 32
+    group_size = block_size
+    # Internal computation in FP32; scale_dtype controls output storage format.
+    compute_dtype = FP32
+    out_dtype = in_dtype if inplace else out_dtype
+    @T.prim_func
+    def act_quant_kernel_(
+        X: T.Tensor[(M, N), in_dtype],
+        Y: T.Tensor[(M, N), out_dtype],
+        S: T.Tensor[(M, T.ceildiv(N, group_size)), scale_dtype],
+    ):
+        with T.Kernel(T.ceildiv(M, blk_m), T.ceildiv(N, group_size), threads=128) as (
+            pid_m,
+            pid_n,
+        ):
+            x_shared = T.alloc_shared((blk_m, group_size), in_dtype)
+            x_local = T.alloc_fragment((blk_m, group_size), in_dtype)
+            amax_local = T.alloc_fragment((blk_m,), compute_dtype)
+            s_local = T.alloc_fragment((blk_m,), compute_dtype)
+            y_local = T.alloc_fragment((blk_m, group_size), out_dtype)
+            y_shared = T.alloc_shared((blk_m, group_size), out_dtype)
+            for _ in T.Pipelined(1, num_stages=num_stages):
+                T.copy(X[pid_m * blk_m, pid_n * group_size], x_shared)
+                T.copy(x_shared, x_local)
+                T.reduce_absmax(x_local, amax_local, dim=1)
+                for i in T.Parallel(blk_m):
+                    amax_local[i] = T.max(amax_local[i], 1e-4)
+                    if round_scale:
+                        s_local[i] = fast_round_scale(amax_local[i], fp8_max_inv)
+                    else:
+                        s_local[i] = amax_local[i] * fp8_max_inv
+                if inplace:
+                    for i, j in T.Parallel(blk_m, group_size):
+                        y_local[i, j] = T.Cast(
+                            out_dtype,
+                            T.Cast(compute_dtype, T.Cast(out_dtype, T.clamp(
+                                x_local[i, j] / s_local[i], fp8_min, fp8_max
+                            ))) * s_local[i],
+                        )
+                else:
+                    for i, j in T.Parallel(blk_m, group_size):
+                        y_local[i, j] = T.clamp(
+                            x_local[i, j] / s_local[i], fp8_min, fp8_max
+                        )
+                for i in T.Parallel(blk_m):
+                    S[pid_m * blk_m + i, pid_n] = T.Cast(scale_dtype, s_local[i])
+                T.copy(y_local, y_shared)
+                T.copy(y_shared, Y[pid_m * blk_m, pid_n * group_size])
+    return act_quant_kernel_
+def act_quant(
+    x: torch.Tensor, block_size: int = 128, scale_fmt: Optional[str] = None,
+    scale_dtype: torch.dtype = torch.float32, inplace: bool = False,
+) -> torch.Tensor:
+    """Block-wise FP8 quantization. inplace=True does fused quant+dequant back to BF16.
+    When scale_fmt is set, scales are rounded to power-of-2 (MXFP)."""
+    N = x.size(-1)
+    assert N % block_size == 0
+    tl_dtype = FE8M0 if scale_dtype == torch.float8_e8m0fnu else FP32
+    z = x.contiguous()
+    y = torch.empty_like(z) if inplace else torch.empty_like(z, dtype=torch.float8_e4m3fn)
+    s = z.new_empty(*z.size()[:-1], N // block_size, dtype=scale_dtype)
+    kernel = act_quant_kernel(
+        N, block_size, scale_dtype=tl_dtype,
+        round_scale=scale_fmt is not None, inplace=inplace,
+    )
+    kernel(z.view(-1, N), y.view(-1, N), s.view(-1, N // block_size))
+    if inplace:
+        x.copy_(y)
+        return x
+    return y, s
+@tilelang.jit(pass_configs=pass_configs)
+def fp4_quant_kernel(
+    N, block_size=32, in_dtype=BF16, scale_dtype=FE8M0, inplace=False
+):
+    """Block-wise FP4 quantization. Power-of-2 scale via bit ops. inplace=True does fused quant+dequant."""
+    M = T.symbolic("M")
+    fp4_max = 6.0
+    fp4_max_inv = 1.0 / fp4_max
+    blk_m = 32
+    group_size = block_size
+    compute_dtype = FP32
+    out_dtype = in_dtype if inplace else FP4
+    @T.prim_func
+    def fp4_quant_kernel_(
+        X: T.Tensor[(M, N), in_dtype],
+        Y: T.Tensor[(M, N), out_dtype],
+        S: T.Tensor[(M, T.ceildiv(N, group_size)), scale_dtype],
+    ):
+        with T.Kernel(T.ceildiv(M, blk_m), T.ceildiv(N, group_size), threads=128) as (
+            pid_m,
+            pid_n,
+        ):
+            x_shared = T.alloc_shared((blk_m, group_size), in_dtype)
+            x_local = T.alloc_fragment((blk_m, group_size), in_dtype)
+            amax_local = T.alloc_fragment((blk_m,), compute_dtype)
+            s_local = T.alloc_fragment((blk_m,), compute_dtype)
+            y_local = T.alloc_fragment((blk_m, group_size), out_dtype)
+            y_shared = T.alloc_shared((blk_m, group_size), out_dtype)
+            for _ in T.Pipelined(1, num_stages=2):
+                T.copy(X[pid_m * blk_m, pid_n * group_size], x_shared)
+                T.copy(x_shared, x_local)
+                T.reduce_absmax(x_local, amax_local, dim=1)
+                for i in T.Parallel(blk_m):
+                    amax_local[i] = T.max(amax_local[i], 6 * (2**-126))
+                    s_local[i] = fast_round_scale(amax_local[i], fp4_max_inv)
+                if inplace:
+                    for i, j in T.Parallel(blk_m, group_size):
+                        y_local[i, j] = T.Cast(
+                            out_dtype,
+                            T.Cast(compute_dtype, T.Cast(FP4, T.clamp(
+                                x_local[i, j] / s_local[i], -fp4_max, fp4_max
+                            ))) * s_local[i],
+                        )
+                else:
+                    for i, j in T.Parallel(blk_m, group_size):
+                        y_local[i, j] = T.clamp(
+                            x_local[i, j] / s_local[i], -fp4_max, fp4_max
+                        )
+                for i in T.Parallel(blk_m):
+                    S[pid_m * blk_m + i, pid_n] = T.Cast(scale_dtype, s_local[i])
+                T.copy(y_local, y_shared)
+                T.copy(y_shared, Y[pid_m * blk_m, pid_n * group_size])
+    return fp4_quant_kernel_
+def fp4_act_quant(
+    x: torch.Tensor, block_size: int = 32, inplace: bool = False,
+) -> torch.Tensor:
+    """Block-wise FP4 quantization. inplace=True does fused quant+dequant back to BF16."""
+    N = x.size(-1)
+    assert N % block_size == 0
+    z = x.contiguous()
+    y = torch.empty_like(z) if inplace else z.new_empty(*z.shape[:-1], N // 2, dtype=torch.float4_e2m1fn_x2)
+    s = z.new_empty(*z.size()[:-1], N // block_size, dtype=torch.float8_e8m0fnu)
+    kernel = fp4_quant_kernel(N, block_size, inplace=inplace)
+    kernel(z.view(-1, N), y.view(-1, y.size(-1)), s.view(-1, N // block_size))
+    if inplace:
+        x.copy_(y)
+        return x
+    return y, s
+@tilelang.jit(pass_configs=pass_configs)
+def fp8_gemm_kernel(N, K, out_dtype=BF16, accum_dtype=FP32, scale_dtype=FP32):
+    assert out_dtype in [BF16, FP32]
+    M = T.symbolic("M")
+    group_size = 128
+    block_M = 32
+    block_N = 128
+    block_K = 128
+    @T.prim_func
+    def fp8_gemm_kernel_(
+        A: T.Tensor[(M, K), FP8],
+        B: T.Tensor[(N, K), FP8],
+        C: T.Tensor[(M, N), out_dtype],
+        scales_a: T.Tensor[(M, T.ceildiv(K, group_size)), scale_dtype],
+        scales_b: T.Tensor[(T.ceildiv(N, group_size), T.ceildiv(K, group_size)), scale_dtype],
+    ):
+        with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (
+            bx,
+            by,
+        ):
+            A_shared = T.alloc_shared((block_M, block_K), FP8)
+            B_shared = T.alloc_shared((block_N, block_K), FP8)
+            C_shared = T.alloc_shared((block_M, block_N), out_dtype)
+            Scale_C_shared = T.alloc_shared((block_M), FP32)
+            C_local = T.alloc_fragment((block_M, block_N), accum_dtype)
+            C_local_accum = T.alloc_fragment((block_M, block_N), accum_dtype)
+            # Improve L2 Cache
+            T.use_swizzle(panel_size=10)
+            T.clear(C_local)
+            T.clear(C_local_accum)
+            K_iters = T.ceildiv(K, block_K)
+            for k in T.Pipelined(K_iters, num_stages=4):
+                T.copy(A[by * block_M, k * block_K], A_shared)
+                T.copy(B[bx * block_N, k * block_K], B_shared)
+                # Cast scales to FP32 for computation; scales_b has one value per block_N group
+                Scale_B = T.Cast(FP32, scales_b[bx * block_N // group_size, k])
+                for i in T.Parallel(block_M):
+                    Scale_C_shared[i] = T.Cast(FP32, scales_a[by * block_M + i, k]) * Scale_B
+                T.gemm(A_shared, B_shared, C_local, transpose_B=True)
+                # Separate accumulator for scale-corrected results (2x accumulation precision)
+                for i, j in T.Parallel(block_M, block_N):
+                    C_local_accum[i, j] += C_local[i, j] * Scale_C_shared[i]
+                T.clear(C_local)
+            T.copy(C_local_accum, C_shared)
+            T.copy(C_shared, C[by * block_M, bx * block_N])
+    return fp8_gemm_kernel_
+def fp8_gemm(
+    a: torch.Tensor, a_s: torch.Tensor, b: torch.Tensor, b_s: torch.Tensor,
+    scale_dtype: torch.dtype = torch.float32,
+) -> torch.Tensor:
+    """C[M,N] = A[M,K] @ B[N,K]^T with per-128 block FP8 scaling on both A and B."""
+    assert a.is_contiguous() and b.is_contiguous(), "Input tensors must be contiguous"
+    assert a_s.is_contiguous() and b_s.is_contiguous(), (
+        "Scaling factor tensors must be contiguous"
+    )
+    tl_dtype = FE8M0 if scale_dtype == torch.float8_e8m0fnu else FP32
+    K = a.size(-1)
+    M = a.numel() // K
+    N = b.size(0)
+    c = a.new_empty(*a.size()[:-1], N, dtype=torch.get_default_dtype())
+    kernel = fp8_gemm_kernel(N, K, scale_dtype=tl_dtype)
+    kernel(a.view(M, K), b, c.view(M, N), a_s.view(M, -1), b_s)
+    return c
+@tilelang.jit(pass_configs=pass_configs)
+def sparse_attn_kernel(h: int, d: int, scale=None):
+    """Sparse multi-head attention via index gathering + online softmax (FlashAttention-style).
+    For each (batch, seq_pos), gathers top-k KV positions by index, computes attention
+    with numerically stable running max/sum, and includes a learnable attn_sink bias."""
+    b = T.symbolic("b")
+    m = T.symbolic("m")
+    n = T.symbolic("n")
+    topk = T.symbolic("topk")
+    if scale is None:
+        scale = (1.0 / d) ** 0.5
+    num_stages = 2
+    threads = 256
+    block = 64
+    num_blocks = tilelang.cdiv(topk, block)
+    @T.prim_func
+    def sparse_attn_kernel_(
+        q: T.Tensor[(b, m, h, d), BF16],
+        kv: T.Tensor[(b, n, d), BF16],
+        o: T.Tensor[(b, m, h, d), BF16],
+        attn_sink: T.Tensor[(h,), FP32],
+        topk_idxs: T.Tensor[(b, m, topk), INT32],
+    ):
+        with T.Kernel(m, b, threads=threads) as (bx, by):
+            q_shared = T.alloc_shared((h, d), BF16)
+            kv_shared = T.alloc_shared((block, d), BF16)
+            o_shared = T.alloc_shared((h, d), BF16)
+            acc_s_cast = T.alloc_shared((h, block), BF16)
+            idxs = T.alloc_fragment(block, INT32)
+            acc_s = T.alloc_fragment((h, block), FP32)
+            acc_o = T.alloc_fragment((h, d), FP32)
+            scores_max = T.alloc_fragment(h, FP32)
+            scores_max_prev = T.alloc_fragment(h, FP32)
+            scores_scale = T.alloc_fragment(h, FP32)
+            scores_sum = T.alloc_fragment(h, FP32)
+            sum_exp = T.alloc_fragment(h, FP32)
+            T.clear(acc_o)
+            T.clear(sum_exp)
+            T.fill(scores_max, -T.infinity(FP32))
+            T.copy(q[by, bx, :, :], q_shared)
+            for t in T.Pipelined(num_blocks, num_stages=num_stages):
+                for i in T.Parallel(block):
+                    idxs[i] = T.if_then_else(t * block + i < topk, topk_idxs[by, bx, t * block + i], -1)
+                for i, j in T.Parallel(block, d):
+                    kv_shared[i, j] = T.if_then_else(idxs[i] != -1, kv[by, idxs[i], j], 0)
+                for i, j in T.Parallel(h, block):
+                    acc_s[i, j] = T.if_then_else(idxs[j] != -1, 0, -T.infinity(FP32))
+                T.gemm(q_shared, kv_shared, acc_s, transpose_B=True, policy=T.GemmWarpPolicy.FullRow)
+                for i, j in T.Parallel(h, block):
+                    acc_s[i, j] *= scale
+                T.copy(scores_max, scores_max_prev)
+                T.reduce_max(acc_s, scores_max, dim=1, clear=False)
+                for i in T.Parallel(h):
+                    scores_scale[i] = T.exp(scores_max_prev[i] - scores_max[i])
+                for i, j in T.Parallel(h, block):
+                    acc_s[i, j] = T.exp(acc_s[i, j] - scores_max[i])
+                T.reduce_sum(acc_s, scores_sum, dim=1)
+                for i in T.Parallel(h):
+                    sum_exp[i] = sum_exp[i] * scores_scale[i] + scores_sum[i]
+                T.copy(acc_s, acc_s_cast)
+                for i, j in T.Parallel(h, d):
+                    acc_o[i, j] *= scores_scale[i]
+                T.gemm(acc_s_cast, kv_shared, acc_o, policy=T.GemmWarpPolicy.FullRow)
+            for i in T.Parallel(h):
+                sum_exp[i] += T.exp(attn_sink[i] - scores_max[i])
+            for i, j in T.Parallel(h, d):
+                acc_o[i, j] /= sum_exp[i]
+            T.copy(acc_o, o_shared)
+            T.copy(o_shared, o[by, bx, :, :])
+    return sparse_attn_kernel_
+def sparse_attn(
+    q: torch.Tensor, kv: torch.Tensor, attn_sink: torch.Tensor, topk_idxs: torch.Tensor, softmax_scale: float
+) -> torch.Tensor:
+    b, s, h, d = q.size()
+    # Pad heads to 16 for kernel efficiency (stripped after)
+    if h < 16:
+        q = torch.cat([q, q.new_zeros(b, s, 16 - h, d)], dim=2)
+        attn_sink = torch.cat([attn_sink, attn_sink.new_zeros(16 - h)])
+    o = torch.empty_like(q)
+    kernel = sparse_attn_kernel(q.size(2), d, softmax_scale)
+    kernel(q, kv, o, attn_sink, topk_idxs)
+    if h < 16:
+        o = o.narrow(2, 0, h).contiguous()
+    return o
+@tilelang.jit(pass_configs=pass_configs)
+def hc_split_sinkhorn_kernel(hc: int, sinkhorn_iters: int, eps: float):
+    n = T.symbolic("n")
+    mix_hc = (2 + hc) * hc
+    threads = 64
+    @T.prim_func
+    def hc_split_sinkhorn_kernel_(
+        mixes: T.Tensor[(n, mix_hc), FP32],
+        hc_scale: T.Tensor[(3,), FP32],
+        hc_base: T.Tensor[(mix_hc,), FP32],
+        pre: T.Tensor[(n, hc), FP32],
+        post: T.Tensor[(n, hc), FP32],
+        comb: T.Tensor[(n, hc, hc), FP32],
+    ):
+        with T.Kernel(n, threads=threads) as i:
+            mixes_shared = T.alloc_shared(mix_hc, FP32)
+            comb_frag = T.alloc_fragment((hc, hc), FP32)
+            T.copy(mixes[i, :], mixes_shared)
+            for j in T.Parallel(hc):
+                pre[i, j] = T.sigmoid(mixes_shared[j] * hc_scale[0] + hc_base[j]) + eps
+            for j in T.Parallel(hc):
+                post[i, j] = 2 * T.sigmoid(mixes_shared[j + hc] * hc_scale[1] + hc_base[j + hc])
+            for j, k in T.Parallel(hc, hc):
+                comb_frag[j, k] = mixes_shared[j * hc + k + hc * 2] * hc_scale[2] + hc_base[j * hc + k + hc * 2]
+            row_sum = T.alloc_fragment(hc, FP32)
+            col_sum = T.alloc_fragment(hc, FP32)
+            # comb = comb.softmax(-1) + eps
+            row_max = T.alloc_fragment(hc, FP32)
+            T.reduce_max(comb_frag, row_max, dim=1)
+            for j, k in T.Parallel(hc, hc):
+                comb_frag[j, k] = T.exp(comb_frag[j, k] - row_max[j])
+            T.reduce_sum(comb_frag, row_sum, dim=1)
+            for j, k in T.Parallel(hc, hc):
+                comb_frag[j, k] = comb_frag[j, k] / row_sum[j] + eps
+            # comb = comb / (comb.sum(-2) + eps)
+            T.reduce_sum(comb_frag, col_sum, dim=0)
+            for j, k in T.Parallel(hc, hc):
+                comb_frag[j, k] = comb_frag[j, k] / (col_sum[k] + eps)
+            for _ in T.serial(sinkhorn_iters - 1):
+                # comb = comb / (comb.sum(-1) + eps)
+                T.reduce_sum(comb_frag, row_sum, dim=1)
+                for j, k in T.Parallel(hc, hc):
+                    comb_frag[j, k] = comb_frag[j, k] / (row_sum[j] + eps)
+                # comb = comb / (comb.sum(-2) + eps)
+                T.reduce_sum(comb_frag, col_sum, dim=0)
+                for j, k in T.Parallel(hc, hc):
+                    comb_frag[j, k] = comb_frag[j, k] / (col_sum[k] + eps)
+            T.copy(comb_frag, comb[i, :, :])
+    return hc_split_sinkhorn_kernel_
+def hc_split_sinkhorn(mixes: torch.Tensor, hc_scale: torch.Tensor, hc_base: torch.Tensor, hc_mult: int = 4, sinkhorn_iters: int = 20, eps: float = 1e-6):
+    b, s, _ = mixes.size()
+    pre = mixes.new_empty(b, s, hc_mult)
+    post = mixes.new_empty(b, s, hc_mult)
+    comb = mixes.new_empty(b, s, hc_mult, hc_mult)
+    kernel = hc_split_sinkhorn_kernel(hc_mult, sinkhorn_iters, eps)
+    kernel(mixes.view(-1, (2 + hc_mult) * hc_mult), hc_scale, hc_base,
+           pre.view(-1, hc_mult), post.view(-1, hc_mult), comb.view(-1, hc_mult, hc_mult))
+    return pre, post, comb
+@tilelang.jit(pass_configs=pass_configs)
+def fp4_gemm_kernel(N, K, out_dtype=BF16, accum_dtype=FP32, scale_dtype=FP32):
+    """FP8 act x FP4 weight GEMM kernel.
+    C[M, N] = A_fp8[M, K] @ B_fp4[N, K]^T
+    Act: 1x128 quant on K (reduce dim), FP8 with configurable scale dtype
+    Weight: 1x32 quant on K (reduce dim), FP4 with E8M0 scale
+    B is stored as [N, K//2] in float4_e2m1fn_x2, logical [N, K] in fp4.
+    The FP4 values are packed along the K (last) dimension.
+    Strategy: load FP4 sub-blocks of size [block_N, sub_K] (sub_K=32),
+    cast FP4 to FP8 via float, then do FP8xFP8 GEMM.
+    Apply act scale (per 128 on K) and weight scale (per 32 on K) to the accumulator.
+    """
+    M = T.symbolic("M")
+    act_group_size = 128
+    weight_group_size = 32
+    block_M = 32
+    block_N = 128
+    block_K = 32   # matches weight_group_size for simple scale handling
+    n_sub = act_group_size // block_K  # 4 sub-blocks per act scale group
+    @T.prim_func
+    def fp4_gemm_kernel_(
+        A: T.Tensor[(M, K), FP8],
+        B: T.Tensor[(N, K), FP4],
+        C: T.Tensor[(M, N), out_dtype],
+        scales_a: T.Tensor[(M, T.ceildiv(K, act_group_size)), scale_dtype],
+        scales_b: T.Tensor[(N, T.ceildiv(K, weight_group_size)), scale_dtype],
+    ):
+        with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (
+            bx,
+            by,
+        ):
+            A_shared = T.alloc_shared((block_M, block_K), FP8)
+            B_fp4_shared = T.alloc_shared((block_N, block_K), FP4)
+            B_shared = T.alloc_shared((block_N, block_K), FP8)
+            C_shared = T.alloc_shared((block_M, block_N), out_dtype)
+            C_local = T.alloc_fragment((block_M, block_N), accum_dtype)
+            C_local_accum = T.alloc_fragment((block_M, block_N), accum_dtype)
+            scale_a_frag = T.alloc_fragment((block_M,), FP32)
+            scale_b_frag = T.alloc_fragment((block_N,), FP32)
+            T.use_swizzle(panel_size=10)
+            T.clear(C_local)
+            T.clear(C_local_accum)
+            K_iters = T.ceildiv(K, block_K)
+            for k in T.Pipelined(K_iters, num_stages=2):
+                T.copy(A[by * block_M, k * block_K], A_shared)
+                T.copy(B[bx * block_N, k * block_K], B_fp4_shared)
+                # FP4->FP8 cast must go through FP32 to avoid ambiguous C++ overload
+                for i, j in T.Parallel(block_N, block_K):
+                    B_shared[i, j] = T.Cast(FP8, T.Cast(FP32, B_fp4_shared[i, j]))
+                # Weight scale: per 32 on K, indexed by k (each k is one block_K=32)
+                for i in T.Parallel(block_N):
+                    scale_b_frag[i] = T.Cast(FP32, scales_b[bx * block_N + i, k])
+                # Act scale: per 128 on K, indexed by k // 4
+                for i in T.Parallel(block_M):
+                    scale_a_frag[i] = T.Cast(FP32, scales_a[by * block_M + i, k // n_sub])
+                T.gemm(A_shared, B_shared, C_local, transpose_B=True)
+                for i, j in T.Parallel(block_M, block_N):
+                    C_local_accum[i, j] += C_local[i, j] * scale_a_frag[i] * scale_b_frag[j]
+                T.clear(C_local)
+            T.copy(C_local_accum, C_shared)
+            T.copy(C_shared, C[by * block_M, bx * block_N])
+    return fp4_gemm_kernel_
+def fp4_gemm(
+    a: torch.Tensor, a_s: torch.Tensor, b: torch.Tensor, b_s: torch.Tensor,
+    scale_dtype: torch.dtype = torch.float32,
+) -> torch.Tensor:
+    """C[M,N] = A_fp8[M,K] @ B_fp4[N,K]^T.
+    A has per-128 act scale; B has per-32 E8M0 weight scale.
+    B is stored as [N, K//2] in float4_e2m1fn_x2 (2 FP4 values per byte, packed along K)."""
+    assert a.is_contiguous() and b.is_contiguous(), "Input tensors must be contiguous"
+    assert a_s.is_contiguous() and b_s.is_contiguous(), (
+        "Scaling factor tensors must be contiguous"
+    )
+    tl_dtype = FE8M0 if scale_dtype == torch.float8_e8m0fnu else FP32
+    K = a.size(-1)
+    M = a.numel() // K
+    N = b.size(0)
+    c = a.new_empty(*a.size()[:-1], N, dtype=torch.get_default_dtype())
+    kernel = fp4_gemm_kernel(N, K, scale_dtype=tl_dtype)
+    kernel(a.view(M, K), b, c.view(M, N), a_s.view(M, -1), b_s)
+    return c

inference/model.py ADDED Viewed

	@@ -0,0 +1,992 @@

+import math
+from dataclasses import dataclass
+from typing import Tuple, Optional, Literal
+from functools import lru_cache
+from contextlib import contextmanager
+import torch
+from torch import nn
+import torch.nn.functional as F
+import torch.distributed as dist
+from kernel import act_quant, fp4_act_quant, fp8_gemm, fp4_gemm, sparse_attn, hc_split_sinkhorn
+world_size = 1
+rank = 0
+block_size = 128
+fp4_block_size = 32
+w4a16_group_size = 128
+default_dtype = torch.bfloat16
+scale_fmt = None
+scale_dtype = torch.float32
+w4a16_mode = False  # set in Transformer.__init__ when args.dtype == "w4a16"
+def dequantize_w4a16(qweight: torch.Tensor, qzeros: torch.Tensor, scales: torch.Tensor,
+                     group_size: int = 128) -> torch.Tensor:
+    """Auto-round / auto_gptq W4A16 packing -> BF16 weight [out, in].
+    qweight: int32 [in/8, out], LSB-first 4-bit packed along input dim
+    qzeros : int32 [in/g, out/8], LSB-first 4-bit packed along output dim
+    scales : bf16 [in/g, out]
+    """
+    in_packed, out_features = qweight.shape
+    in_features = in_packed * 8
+    n_groups = scales.shape[0]
+    device = qweight.device
+    shifts = torch.arange(0, 32, 4, device=device, dtype=torch.int32)
+    w = (qweight.unsqueeze(1) >> shifts.view(1, 8, 1)) & 0xF       # [in/8, 8, out]
+    w = w.reshape(in_features, out_features).to(torch.float32)
+    z = (qzeros.unsqueeze(2) >> shifts.view(1, 1, 8)) & 0xF        # [in/g, out/8, 8]
+    z = z.reshape(n_groups, out_features).to(torch.float32) + 1.0  # GPTQ stores zero - 1
+    s = scales.to(torch.float32)
+    w = w.view(n_groups, group_size, out_features)
+    deq = (w - z.unsqueeze(1)) * s.unsqueeze(1)
+    deq = deq.view(in_features, out_features)
+    return deq.t().contiguous().to(torch.bfloat16)
+@contextmanager
+def set_dtype(dtype):
+    """Temporarily override torch default dtype, restoring it on exit (even if an exception occurs)."""
+    prev = torch.get_default_dtype()
+    torch.set_default_dtype(dtype)
+    try:
+        yield
+    finally:
+        torch.set_default_dtype(prev)
+@dataclass
+class ModelArgs:
+    """Model hyperparameters. Field names match the config JSON keys."""
+    max_batch_size: int = 4
+    max_seq_len: int = 4096
+    dtype: Literal["bf16", "fp8", "w4a16"] = "fp8"
+    scale_fmt: Literal[None, "ue8m0"] = "ue8m0"
+    expert_dtype: Literal[None, "fp4"] = None
+    scale_dtype: Literal["fp32", "fp8"] = "fp8"
+    vocab_size: int = 129280
+    dim: int = 4096
+    moe_inter_dim: int = 4096
+    n_layers: int = 7
+    n_hash_layers: int = 0
+    n_mtp_layers: int = 1
+    n_heads: int = 64
+    # moe
+    n_routed_experts: int = 8
+    n_shared_experts: int = 1
+    n_activated_experts: int = 2
+    score_func: Literal["softmax", "sigmoid", "sqrtsoftplus"] = "sqrtsoftplus"
+    route_scale: float = 1.
+    swiglu_limit: float = 0.
+    # mqa
+    q_lora_rank: int = 1024
+    head_dim: int = 512
+    rope_head_dim: int = 64
+    norm_eps: float = 1e-6
+    o_groups: int = 8
+    o_lora_rank: int = 1024
+    window_size: int = 128
+    compress_ratios: Tuple[int] = (0, 0, 4, 128, 4, 128, 4, 0)
+    # yarn
+    compress_rope_theta: float = 40000.0
+    original_seq_len: int = 0
+    rope_theta: float = 10000.0
+    rope_factor: float = 40
+    beta_fast: int = 32
+    beta_slow: int = 1
+    # index
+    index_n_heads: int = 64
+    index_head_dim: int = 128
+    index_topk: int = 512
+    # hc
+    hc_mult: int = 4
+    hc_sinkhorn_iters: int = 20
+    hc_eps: float = 1e-6
+class ParallelEmbedding(nn.Module):
+    """Embedding sharded along the vocab dimension. Each rank holds vocab_size // world_size rows.
+    Out-of-range indices are zero-masked before all_reduce to combine partial embeddings."""
+    def __init__(self, vocab_size: int, dim: int):
+        super().__init__()
+        self.vocab_size = vocab_size
+        self.dim = dim
+        assert vocab_size % world_size == 0, f"Vocabulary size must be divisible by world size (world_size={world_size})"
+        self.part_vocab_size = (vocab_size // world_size)
+        self.vocab_start_idx = rank * self.part_vocab_size
+        self.vocab_end_idx = self.vocab_start_idx + self.part_vocab_size
+        self.weight = nn.Parameter(torch.empty(self.part_vocab_size, self.dim))
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        if world_size > 1:
+            mask = (x < self.vocab_start_idx) | (x >= self.vocab_end_idx)
+            x = x - self.vocab_start_idx
+            x[mask] = 0
+        y = F.embedding(x, self.weight)
+        if world_size > 1:
+            y[mask] = 0
+            dist.all_reduce(y)
+        return y
+def linear(x: torch.Tensor, weight: torch.Tensor, bias: Optional[torch.Tensor] = None) -> torch.Tensor:
+    """Dispatches to fp4_gemm / fp8_gemm / F.linear based on weight dtype.
+    For quantized weights, x is first quantized to FP8 via act_quant."""
+    assert bias is None
+    if weight.dtype == torch.float4_e2m1fn_x2:
+        x, s = act_quant(x, block_size, scale_fmt, scale_dtype)
+        return fp4_gemm(x, s, weight, weight.scale, scale_dtype)
+    elif weight.dtype == torch.float8_e4m3fn:
+        x, s = act_quant(x, block_size, scale_fmt, scale_dtype)
+        return fp8_gemm(x, s, weight, weight.scale, scale_dtype)
+    else:
+        return F.linear(x, weight)
+class Linear(nn.Module):
+    """Linear layer supporting BF16, FP8, and FP4 weight formats with per-block scaling."""
+    def __init__(self, in_features: int, out_features: int, bias: bool = False, dtype = None):
+        super().__init__()
+        self.in_features = in_features
+        self.out_features = out_features
+        # In a W4A16 build every Linear becomes W4A16 regardless of the dtype the
+        # original FP8/FP4 model wanted. The non-quant special cases (RMSNorm,
+        # embed, attn_sink, etc.) are NOT instances of `Linear`, so they are
+        # untouched.
+        if w4a16_mode:
+            dtype = "w4a16"
+        else:
+            dtype = dtype or default_dtype
+        self.is_w4a16 = (dtype == "w4a16")
+        if self.is_w4a16:
+            assert in_features % 8 == 0 and in_features % w4a16_group_size == 0
+            assert out_features % 8 == 0
+            self.group_size = w4a16_group_size
+            self.qweight = nn.Parameter(
+                torch.empty(in_features // 8, out_features, dtype=torch.int32),
+                requires_grad=False,
+            )
+            self.qzeros = nn.Parameter(
+                torch.empty(in_features // self.group_size, out_features // 8, dtype=torch.int32),
+                requires_grad=False,
+            )
+            self.scales = nn.Parameter(
+                torch.empty(in_features // self.group_size, out_features, dtype=torch.bfloat16),
+                requires_grad=False,
+            )
+            self.register_parameter("weight", None)
+            self.register_parameter("scale", None)
+        elif dtype == torch.float4_e2m1fn_x2:
+            # FP4: weight is [out, in//2] in float4_e2m1fn_x2, logically [out, in] in fp4
+            # Scale is [out, in//32] in float8_e8m0fnu (1 scale per 32 fp4 elements along K)
+            self.weight = nn.Parameter(torch.empty(out_features, in_features // 2, dtype=torch.float4_e2m1fn_x2))
+            scale_out_features = out_features
+            scale_in_features = in_features // fp4_block_size
+            self.weight.scale = self.scale = nn.Parameter(torch.empty(scale_out_features, scale_in_features, dtype=torch.float8_e8m0fnu))
+        elif dtype == torch.float8_e4m3fn:
+            self.weight = nn.Parameter(torch.empty(out_features, in_features, dtype=dtype))
+            scale_out_features = (out_features + block_size - 1) // block_size
+            scale_in_features = (in_features + block_size - 1) // block_size
+            self.weight.scale = self.scale = nn.Parameter(torch.empty(scale_out_features, scale_in_features, dtype=torch.float8_e8m0fnu))
+        else:
+            self.weight = nn.Parameter(torch.empty(out_features, in_features, dtype=dtype))
+            self.register_parameter("scale", None)
+        if bias:
+            self.bias = nn.Parameter(torch.empty(out_features))
+        else:
+            self.register_parameter("bias", None)
+    def init_woq(self, QuantLinear):
+        """Create a QuantLinear from loaded GPTQ parameters."""
+        if not self.is_w4a16:
+            return
+        # Marlin requires out_features % 64 == 0; fall back to manual dequant
+        if self.out_features % 64 != 0:
+            self._woq = None
+            return
+        dev = self.qweight.device
+        layer = QuantLinear(
+            bits=4, group_size=self.group_size,
+            in_features=self.in_features, out_features=self.out_features,
+            bias=False, desc_act=False, sym=True, register_buffers=True,
+        )
+        layer = layer.to(dev)
+        layer.qweight.copy_(self.qweight.data)
+        layer.qzeros.copy_(self.qzeros.data)
+        layer.scales.copy_(self.scales.to(layer.scales.dtype).data)
+        layer.g_idx.copy_(torch.arange(self.in_features, dtype=torch.int32, device=dev) // self.group_size)
+        layer.post_init()
+        self._woq = layer
+        # Free original parameters to save memory
+        self.qweight = None
+        self.qzeros = None
+        self.scales = None
+    def get_weight(self) -> torch.Tensor:
+        """Return the dequantised BF16 weight [out, in]. For non-W4A16 modes
+        returns ``self.weight`` unchanged. Used only for wo_a einsum path."""
+        if self.is_w4a16:
+            if self._woq is not None:
+                return dequantize_w4a16(self._woq.qweight, self._woq.qzeros, self._woq.scales, self.group_size)
+            return dequantize_w4a16(self.qweight, self.qzeros, self.scales, self.group_size)
+        return self.weight
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        if self.is_w4a16:
+            if hasattr(self, '_woq') and self._woq is not None:
+                y = self._woq(x.to(torch.bfloat16))
+            else:
+                w = dequantize_w4a16(self.qweight, self.qzeros, self.scales, self.group_size)
+                y = F.linear(x.to(w.dtype), w)
+            if self.bias is not None:
+                y = y + self.bias
+            return y.type_as(x)
+        return linear(x, self.weight, self.bias)
+class ColumnParallelLinear(Linear):
+    """Shards output dim across TP ranks. No all-reduce needed on output."""
+    def __init__(self, in_features: int, out_features: int, bias: bool = False, dtype = None):
+        assert out_features % world_size == 0, f"Output features must be divisible by world size (world_size={world_size})"
+        self.part_out_features = out_features // world_size
+        super().__init__(in_features, self.part_out_features, bias, dtype)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        if self.is_w4a16:
+            return Linear.forward(self, x)
+        return linear(x, self.weight, self.bias)
+class RowParallelLinear(Linear):
+    """Shards input dim across TP ranks. All-reduce on output to sum partial results."""
+    def __init__(self, in_features: int, out_features: int, bias: bool = False, dtype = None):
+        assert in_features % world_size == 0, f"Input features must be divisible by world size (world_size={world_size})"
+        self.part_in_features = in_features // world_size
+        super().__init__(self.part_in_features, out_features, bias, dtype)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        if self.is_w4a16:
+            if hasattr(self, '_woq') and self._woq is not None:
+                y = self._woq(x.to(torch.bfloat16))
+            else:
+                w = dequantize_w4a16(self.qweight, self.qzeros, self.scales, self.group_size)
+                y = F.linear(x.to(w.dtype), w)
+        else:
+            y = linear(x, self.weight, None)
+        if world_size > 1:
+            y = y.float()
+            dist.all_reduce(y)
+        if self.bias is not None:
+            y += self.bias
+        return y.type_as(x)
+class RMSNorm(nn.Module):
+    def __init__(self, dim: int, eps: float = 1e-6):
+        super().__init__()
+        self.dim = dim
+        self.eps = eps
+        # rmsnorm in the checkpoint is stored in bf16, while the parameter here is stored in fp32 for convenient.
+        self.weight = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+    def forward(self, x: torch.Tensor):
+        dtype = x.dtype
+        x = x.float()
+        var = x.square().mean(-1, keepdim=True)
+        x = x * torch.rsqrt(var + self.eps)
+        return (self.weight * x).to(dtype)
+@lru_cache(2)
+def precompute_freqs_cis(dim, seqlen, original_seq_len, base, factor, beta_fast, beta_slow) -> torch.Tensor:
+    """Precomputes complex exponentials for rotary embeddings with YaRN scaling.
+    When original_seq_len > 0, applies frequency interpolation with a smooth
+    linear ramp between beta_fast and beta_slow correction ranges."""
+    def find_correction_dim(num_rotations, dim, base, max_seq_len):
+        return dim * math.log(max_seq_len / (num_rotations * 2 * math.pi)) / (2 * math.log(base))
+    def find_correction_range(low_rot, high_rot, dim, base, max_seq_len):
+        low = math.floor(find_correction_dim(low_rot, dim, base, max_seq_len))
+        high = math.ceil(find_correction_dim(high_rot, dim, base, max_seq_len))
+        return max(low, 0), min(high, dim-1)
+    def linear_ramp_factor(min, max, dim):
+        if min == max:
+            max += 0.001
+        linear_func = (torch.arange(dim, dtype=torch.float32) - min) / (max - min)
+        ramp_func = torch.clamp(linear_func, 0, 1)
+        return ramp_func
+    freqs = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
+    if original_seq_len > 0:
+        low, high = find_correction_range(beta_fast, beta_slow, dim, base, original_seq_len)
+        smooth = 1 - linear_ramp_factor(low, high, dim // 2)
+        freqs = freqs / factor * (1 - smooth) + freqs * smooth
+    t = torch.arange(seqlen)
+    freqs = torch.outer(t, freqs)
+    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
+    return freqs_cis
+def apply_rotary_emb(x: torch.Tensor, freqs_cis: torch.Tensor, inverse: bool = False) -> torch.Tensor:
+    """Applies rotary positional embeddings in-place. Uses conjugate for inverse (de-rotation)."""
+    y = x
+    x = torch.view_as_complex(x.float().unflatten(-1, (-1, 2)))
+    if inverse:
+        freqs_cis = freqs_cis.conj()
+    if x.ndim == 3:
+        freqs_cis = freqs_cis.view(1, x.size(1), x.size(-1))
+    else:
+        freqs_cis = freqs_cis.view(1, x.size(1), 1, x.size(-1))
+    x = torch.view_as_real(x * freqs_cis).flatten(-2)
+    y.copy_(x)
+    return y
+def rotate_activation(x: torch.Tensor) -> torch.Tensor:
+    """Applies randomized Hadamard rotation to spread information across dims before FP8 quant."""
+    assert x.dtype == torch.bfloat16
+    from fast_hadamard_transform import hadamard_transform
+    return hadamard_transform(x, scale=x.size(-1) ** -0.5)
+@lru_cache(1)
+def get_window_topk_idxs(window_size: int, bsz: int, seqlen: int, start_pos: int):
+    if start_pos >= window_size - 1:
+        start_pos %= window_size
+        matrix = torch.cat([torch.arange(start_pos + 1, window_size),  torch.arange(0, start_pos + 1)], dim=0)
+    elif start_pos > 0:
+        matrix = F.pad(torch.arange(start_pos + 1), (0, window_size - start_pos - 1), value=-1)
+    else:
+        base = torch.arange(seqlen).unsqueeze(1)
+        matrix = (base - window_size + 1).clamp(0) + torch.arange(min(seqlen, window_size))
+        matrix = torch.where(matrix > base, -1, matrix)
+    return matrix.unsqueeze(0).expand(bsz, -1, -1)
+@lru_cache(2)
+def get_compress_topk_idxs(ratio: int, bsz: int, seqlen: int, start_pos: int, offset: int):
+    if start_pos > 0:
+        matrix = torch.arange(0, (start_pos + 1) // ratio) + offset
+    else:
+        matrix = torch.arange(seqlen // ratio).repeat(seqlen, 1)
+        mask = matrix >= torch.arange(1, seqlen + 1).unsqueeze(1) // ratio
+        matrix = torch.where(mask, -1, matrix + offset)
+    return matrix.unsqueeze(0).expand(bsz, -1, -1)
+class Compressor(nn.Module):
+    """Compresses KV cache via learned gated pooling over `compress_ratio` consecutive tokens.
+    When overlap=True (ratio==4), uses overlapping windows for smoother compression boundaries."""
+    def __init__(self, args: ModelArgs, compress_ratio: int = 4, head_dim: int = 512, rotate: bool = False):
+        super().__init__()
+        self.dim = args.dim
+        self.head_dim = head_dim
+        self.rope_head_dim = args.rope_head_dim
+        self.nope_head_dim = head_dim - args.rope_head_dim
+        self.compress_ratio = compress_ratio
+        self.overlap = compress_ratio == 4
+        self.rotate = rotate
+        coff = 1 + self.overlap
+        self.ape = nn.Parameter(torch.empty(compress_ratio, coff * self.head_dim, dtype=torch.float32))
+        # wkv and wgate in the checkpoint is stored in bf16, while the parameter here is stored in fp32 for convenient.
+        # When overlap, the first half of dims is for overlapping compression, second half for normal.
+        self.wkv = Linear(self.dim, coff * self.head_dim, dtype=torch.float32)
+        self.wgate = Linear(self.dim, coff * self.head_dim, dtype=torch.float32)
+        self.norm = RMSNorm(self.head_dim, args.norm_eps)
+        self.kv_cache: torch.Tensor = None  # assigned lazily from Attention.kv_cache
+        # State buffers for decode-phase incremental compression.
+        # With overlap: state[:, :ratio] = overlapping window, state[:, ratio:] = current window.
+        self.register_buffer("kv_state", torch.zeros(args.max_batch_size, coff * compress_ratio, coff * self.head_dim, dtype=torch.float32), persistent=False)
+        self.register_buffer("score_state", torch.full((args.max_batch_size, coff * compress_ratio, coff * self.head_dim), float("-inf"), dtype=torch.float32), persistent=False)
+        self.freqs_cis: torch.Tensor = None
+    def overlap_transform(self, tensor: torch.Tensor, value=0):
+        # tensor: [b,s,r,2d]
+        b, s, _, _ = tensor.size()
+        ratio, d = self.compress_ratio, self.head_dim
+        new_tensor = tensor.new_full((b, s, 2 * ratio, d), value)
+        new_tensor[:, :, ratio:] = tensor[:, :, :, d:]
+        new_tensor[:, 1:, :ratio] = tensor[:, :-1, :, :d]
+        return new_tensor
+    def forward(self, x: torch.Tensor, start_pos: int):
+        assert self.kv_cache is not None
+        bsz, seqlen, _ = x.size()
+        ratio, overlap, d, rd = self.compress_ratio, self.overlap, self.head_dim, self.rope_head_dim
+        dtype = x.dtype
+        # compression need fp32
+        x = x.float()
+        kv = self.wkv(x)
+        score = self.wgate(x)
+        if start_pos == 0:
+            should_compress = seqlen >= ratio
+            remainder = seqlen % ratio
+            cutoff = seqlen - remainder
+            offset = ratio if overlap else 0
+            if overlap and cutoff >= ratio:
+                self.kv_state[:bsz, :ratio] = kv[:, cutoff-ratio : cutoff]
+                self.score_state[:bsz, :ratio] = score[:, cutoff-ratio : cutoff] + self.ape
+            if remainder > 0:
+                kv, self.kv_state[:bsz, offset : offset+remainder] = kv.split([cutoff, remainder], dim=1)
+                self.score_state[:bsz, offset : offset+remainder] = score[:, cutoff:] + self.ape[:remainder]
+                score = score[:, :cutoff]
+            kv = kv.unflatten(1, (-1, ratio))
+            score = score.unflatten(1, (-1, ratio)) + self.ape
+            if overlap:
+                kv = self.overlap_transform(kv, 0)
+                score = self.overlap_transform(score, float("-inf"))
+            kv = (kv * score.softmax(dim=2)).sum(dim=2)
+        else:
+            should_compress = (start_pos + 1) % self.compress_ratio == 0
+            score += self.ape[start_pos % ratio]
+            if overlap:
+                self.kv_state[:bsz, ratio + start_pos % ratio] = kv.squeeze(1)
+                self.score_state[:bsz, ratio + start_pos % ratio] = score.squeeze(1)
+                if should_compress:
+                    kv_state = torch.cat([self.kv_state[:bsz, :ratio, :d], self.kv_state[:bsz, ratio:, d:]], dim=1)
+                    score_state = torch.cat([self.score_state[:bsz, :ratio, :d], self.score_state[:bsz, ratio:, d:]], dim=1)
+                    kv = (kv_state * score_state.softmax(dim=1)).sum(dim=1, keepdim=True)
+                    self.kv_state[:bsz, :ratio] = self.kv_state[:bsz, ratio:]
+                    self.score_state[:bsz, :ratio] = self.score_state[:bsz, ratio:]
+            else:
+                self.kv_state[:bsz, start_pos % ratio] = kv.squeeze(1)
+                self.score_state[:bsz, start_pos % ratio] = score.squeeze(1)
+                if should_compress:
+                    kv = (self.kv_state[:bsz] * self.score_state[:bsz].softmax(dim=1)).sum(dim=1, keepdim=True)
+        if not should_compress:
+            return
+        kv = self.norm(kv.to(dtype))
+        if start_pos == 0:
+            freqs_cis = self.freqs_cis[:cutoff:ratio]
+        else:
+            freqs_cis = self.freqs_cis[start_pos + 1 - self.compress_ratio].unsqueeze(0)
+        apply_rotary_emb(kv[..., -rd:], freqs_cis)
+        if self.rotate:
+            kv = rotate_activation(kv)
+            fp4_act_quant(kv, fp4_block_size, True)
+        else:
+            act_quant(kv[..., :-rd], 64, scale_fmt, scale_dtype, True)
+        if start_pos == 0:
+            self.kv_cache[:bsz, :seqlen // ratio] = kv
+        else:
+            self.kv_cache[:bsz, start_pos // ratio] = kv.squeeze(1)
+        return kv
+class Indexer(torch.nn.Module):
+    """Selects top-k compressed KV positions for sparse attention via learned scoring.
+    Has its own Compressor (with Hadamard rotation) to build compressed KV for scoring."""
+    def __init__(self, args: ModelArgs, compress_ratio: int = 4):
+        super().__init__()
+        self.dim = args.dim
+        self.n_heads = args.index_n_heads
+        self.n_local_heads = args.index_n_heads // world_size
+        self.head_dim = args.index_head_dim
+        self.rope_head_dim = args.rope_head_dim
+        self.index_topk = args.index_topk
+        self.q_lora_rank = args.q_lora_rank
+        self.wq_b = ColumnParallelLinear(self.q_lora_rank, self.n_heads * self.head_dim)
+        self.weights_proj = ColumnParallelLinear(self.dim, self.n_heads, dtype=torch.bfloat16)
+        self.softmax_scale = self.head_dim ** -0.5
+        self.compress_ratio = compress_ratio
+        self.compressor = Compressor(args, compress_ratio, self.head_dim, True)
+        self.register_buffer("kv_cache", torch.zeros(args.max_batch_size, args.max_seq_len // compress_ratio, self.head_dim), persistent=False)
+        self.freqs_cis = None
+    def forward(self, x: torch.Tensor, qr: torch.Tensor, start_pos: int, offset: int):
+        bsz, seqlen, _ = x.size()
+        freqs_cis = self.freqs_cis[start_pos:start_pos+seqlen]
+        ratio = self.compress_ratio
+        rd = self.rope_head_dim
+        end_pos = start_pos + seqlen
+        if self.compressor.kv_cache is None:
+            self.compressor.kv_cache = self.kv_cache
+            self.compressor.freqs_cis = self.freqs_cis
+        q = self.wq_b(qr)
+        q = q.unflatten(-1, (self.n_local_heads, self.head_dim))
+        apply_rotary_emb(q[..., -rd:], freqs_cis)
+        q = rotate_activation(q)
+        # use fp4 simulation for q and kv in indexer
+        fp4_act_quant(q, fp4_block_size, True)
+        self.compressor(x, start_pos)
+        weights = self.weights_proj(x) * (self.softmax_scale * self.n_heads ** -0.5)
+        # We performed QAT here, kv could also use fp8 format, though current implementation uses bf16
+        index_score = torch.einsum("bshd,btd->bsht", q, self.kv_cache[:bsz, :end_pos // ratio])
+        index_score = (index_score.relu_() * weights.unsqueeze(-1)).sum(dim=2)
+        if world_size > 1:
+            dist.all_reduce(index_score)
+        if start_pos == 0:
+            mask = torch.arange(seqlen // ratio).repeat(seqlen, 1) >= torch.arange(1, seqlen + 1).unsqueeze(1) // ratio
+            index_score += torch.where(mask, float("-inf"), 0)
+        topk_idxs = index_score.topk(min(self.index_topk, end_pos // ratio), dim=-1)[1]
+        if start_pos == 0:
+            mask = topk_idxs >= torch.arange(1, seqlen + 1).unsqueeze(1) // ratio
+            topk_idxs = torch.where(mask, -1, topk_idxs + offset)
+        else:
+            topk_idxs += offset
+        return topk_idxs
+class Attention(nn.Module):
+    """Multi-head Latent Attention (MLA) with sliding window + optional KV compression.
+    Uses low-rank Q projection (wq_a -> q_norm -> wq_b) and grouped low-rank O projection."""
+    def __init__(self, layer_id: int, args: ModelArgs):
+        super().__init__()
+        self.layer_id = layer_id
+        self.dim = args.dim
+        self.n_heads = args.n_heads
+        self.n_local_heads = args.n_heads // world_size
+        self.q_lora_rank = args.q_lora_rank
+        self.o_lora_rank = args.o_lora_rank
+        self.head_dim = args.head_dim
+        self.rope_head_dim = args.rope_head_dim
+        self.nope_head_dim = args.head_dim - args.rope_head_dim
+        self.n_groups = args.o_groups
+        self.n_local_groups = self.n_groups // world_size
+        self.window_size = args.window_size
+        self.compress_ratio = args.compress_ratios[layer_id]
+        self.eps = args.norm_eps
+        self.attn_sink = nn.Parameter(torch.empty(self.n_local_heads, dtype=torch.float32))
+        self.wq_a = Linear(self.dim, self.q_lora_rank)
+        self.q_norm = RMSNorm(self.q_lora_rank, self.eps)
+        self.wq_b = ColumnParallelLinear(self.q_lora_rank, self.n_heads * self.head_dim)
+        self.wkv = Linear(self.dim, self.head_dim)
+        self.kv_norm = RMSNorm(self.head_dim, self.eps)
+        self.wo_a = ColumnParallelLinear(self.n_heads * self.head_dim // self.n_groups, self.n_groups * args.o_lora_rank, dtype=torch.bfloat16)
+        self.wo_b = RowParallelLinear(self.n_groups * args.o_lora_rank, self.dim)
+        self.softmax_scale = self.head_dim ** -0.5
+        if self.compress_ratio:
+            self.compressor = Compressor(args, self.compress_ratio, self.head_dim)
+            if self.compress_ratio == 4:
+                self.indexer = Indexer(args, self.compress_ratio)
+            else:
+                self.indexer = None
+        kv_cache_size = args.window_size + (args.max_seq_len // self.compress_ratio if self.compress_ratio else 0)
+        self.register_buffer("kv_cache", torch.zeros(args.max_batch_size, kv_cache_size, self.head_dim), persistent=False)
+        if self.compress_ratio:
+            original_seq_len, rope_theta = args.original_seq_len, args.compress_rope_theta
+        else:
+            # disable YaRN and use base rope_theta in pure sliding-window attention
+            original_seq_len, rope_theta = 0, args.rope_theta
+        freqs_cis = precompute_freqs_cis(self.rope_head_dim, args.max_seq_len, original_seq_len,
+                                         rope_theta, args.rope_factor, args.beta_fast, args.beta_slow)
+        self.register_buffer("freqs_cis", freqs_cis, persistent=False)
+    def forward(self, x: torch.Tensor, start_pos: int):
+        bsz, seqlen, _ = x.size()
+        freqs_cis = self.freqs_cis[start_pos:start_pos+seqlen]
+        win = self.window_size
+        ratio = self.compress_ratio
+        rd = self.rope_head_dim
+        if self.compress_ratio and self.compressor.kv_cache is None:
+            self.compressor.kv_cache = self.kv_cache[:, win:]
+            self.compressor.freqs_cis = self.freqs_cis
+            if self.indexer is not None:
+                self.indexer.freqs_cis = self.freqs_cis
+        # q
+        qr = q = self.q_norm(self.wq_a(x))
+        q = self.wq_b(q).unflatten(-1, (self.n_local_heads, self.head_dim))
+        q *= torch.rsqrt(q.square().mean(-1, keepdim=True) + self.eps)
+        apply_rotary_emb(q[..., -rd:], freqs_cis)
+        # win kv & topk_idxs
+        kv = self.wkv(x)
+        kv = self.kv_norm(kv)
+        apply_rotary_emb(kv[..., -rd:], freqs_cis)
+        # FP8-simulate non-rope dims to match QAT; rope dims stay bf16 for positional precision
+        act_quant(kv[..., :-rd], 64, scale_fmt, scale_dtype, True)
+        topk_idxs = get_window_topk_idxs(win, bsz, seqlen, start_pos)
+        if self.compress_ratio:
+            offset = kv.size(1) if start_pos == 0 else win
+            if self.indexer is not None:
+                compress_topk_idxs = self.indexer(x, qr, start_pos, offset)
+            else:
+                compress_topk_idxs = get_compress_topk_idxs(ratio, bsz, seqlen, start_pos, offset)
+            topk_idxs = torch.cat([topk_idxs, compress_topk_idxs], dim=-1)
+        topk_idxs = topk_idxs.int()
+        # compress kv & attn
+        if start_pos == 0:
+            if seqlen <= win:
+                self.kv_cache[:bsz, :seqlen] = kv
+            else:
+                cutoff = seqlen % win
+                self.kv_cache[:bsz, cutoff: win], self.kv_cache[:bsz, :cutoff] = kv[:, -win:].split([win - cutoff, cutoff], dim=1)
+            if self.compress_ratio:
+                if (kv_compress := self.compressor(x, start_pos)) is not None:
+                    kv = torch.cat([kv, kv_compress], dim=1)
+            # We performed QAT here, kv could also use fp8 format, though current implementation uses bf16
+            o = sparse_attn(q, kv, self.attn_sink, topk_idxs, self.softmax_scale)
+        else:
+            self.kv_cache[:bsz, start_pos % win] = kv.squeeze(1)
+            if self.compress_ratio:
+                self.compressor(x, start_pos)
+            o = sparse_attn(q, self.kv_cache[:bsz], self.attn_sink, topk_idxs, self.softmax_scale)
+        apply_rotary_emb(o[..., -rd:], freqs_cis, True)
+        # o: apply wo_a per-group projection then wo_b
+        # Flatten groups into the feature dim, call wo_a as a normal linear, then reshape back.
+        # Equivalent to the per-group einsum when wo_a weight is block-diagonal across groups
+        # (always true here since n_local_groups = n_groups/world_size = 1 for 8-GPU deploy).
+        o = o.view(bsz, seqlen, self.n_local_groups, -1)
+        o = self.wo_a(o.flatten(2)).view(bsz, seqlen, self.n_local_groups, self.o_lora_rank)
+        x = self.wo_b(o.flatten(2))
+        return x
+class Gate(nn.Module):
+    """MoE gating: computes expert routing scores and selects top-k experts.
+    Supports hash-based routing (first n_hash_layers) where expert indices are
+    predetermined per token ID, and score-based routing (remaining layers)."""
+    def __init__(self, layer_id: int, args: ModelArgs):
+        super().__init__()
+        self.dim = args.dim
+        self.topk = args.n_activated_experts
+        self.score_func = args.score_func
+        self.route_scale = args.route_scale
+        self.hash = layer_id < args.n_hash_layers
+        self.is_w4a16 = w4a16_mode
+        if self.is_w4a16:
+            in_f, out_f = args.dim, args.n_routed_experts
+            assert in_f % w4a16_group_size == 0 and out_f % 8 == 0
+            self.group_size = w4a16_group_size
+            self.qweight = nn.Parameter(
+                torch.empty(in_f // 8, out_f, dtype=torch.int32), requires_grad=False)
+            self.qzeros = nn.Parameter(
+                torch.empty(in_f // self.group_size, out_f // 8, dtype=torch.int32), requires_grad=False)
+            self.scales = nn.Parameter(
+                torch.empty(in_f // self.group_size, out_f, dtype=torch.bfloat16), requires_grad=False)
+            self.register_parameter("weight", None)
+        else:
+            self.weight = nn.Parameter(torch.empty(args.n_routed_experts, args.dim))
+        if self.hash:
+            self.tid2eid = nn.Parameter(torch.empty(args.vocab_size, args.n_activated_experts, dtype=torch.int32), requires_grad=False)
+            self.bias = None
+        else:
+            self.bias = nn.Parameter(torch.empty(args.n_routed_experts, dtype=torch.float32))
+    def init_woq(self, QuantLinear):
+        if not self.is_w4a16:
+            return
+        dev = self.qweight.device
+        in_f, out_f = self.dim, self.qweight.shape[1]
+        if out_f % 64 != 0:
+            self._woq = None
+            return
+        layer = QuantLinear(
+            bits=4, group_size=self.group_size,
+            in_features=in_f, out_features=out_f,
+            bias=False, desc_act=False, sym=True, register_buffers=True,
+        )
+        layer = layer.to(dev)
+        layer.qweight.copy_(self.qweight.data)
+        layer.qzeros.copy_(self.qzeros.data)
+        layer.scales.copy_(self.scales.to(layer.scales.dtype).data)
+        layer.g_idx.copy_(torch.arange(in_f, dtype=torch.int32, device=dev) // self.group_size)
+        layer.post_init()
+        self._woq = layer
+        self.qweight = None
+        self.qzeros = None
+        self.scales = None
+    def forward(self, x: torch.Tensor, input_ids: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, torch.Tensor]:
+        if self.is_w4a16:
+            if hasattr(self, '_woq') and self._woq is not None:
+                scores = self._woq(x.to(torch.bfloat16)).float()
+            else:
+                w = dequantize_w4a16(self.qweight, self.qzeros, self.scales, self.group_size)
+                scores = F.linear(x.to(w.dtype), w).float()
+        else:
+            scores = linear(x.float(), self.weight.float())
+        if self.score_func == "softmax":
+            scores = scores.softmax(dim=-1)
+        elif self.score_func == "sigmoid":
+            scores = scores.sigmoid()
+        else:
+            scores = F.softplus(scores).sqrt()
+        original_scores = scores
+        # Bias shifts scores for expert selection (topk) but does not affect routing weights.
+        if self.bias is not None:
+            scores = scores + self.bias
+        if self.hash:
+            indices = self.tid2eid[input_ids]
+        else:
+            indices = scores.topk(self.topk, dim=-1)[1]
+        weights = original_scores.gather(1, indices)
+        if self.score_func != "softmax":
+            weights /= weights.sum(dim=-1, keepdim=True)
+        weights *= self.route_scale
+        return weights, indices
+class Expert(nn.Module):
+    """Single MoE expert: SwiGLU FFN (w1, w2, w3). Computation in float32 for stability."""
+    def __init__(self, dim: int, inter_dim: int, dtype=None, swiglu_limit=0):
+        super().__init__()
+        self.w1 = Linear(dim, inter_dim, dtype=dtype)
+        self.w2 = Linear(inter_dim, dim, dtype=dtype)
+        self.w3 = Linear(dim, inter_dim, dtype=dtype)
+        self.swiglu_limit = swiglu_limit
+    def forward(self, x: torch.Tensor, weights: Optional[torch.Tensor] = None) -> torch.Tensor:
+        dtype = x.dtype
+        gate = self.w1(x).float()
+        up = self.w3(x).float()
+        if self.swiglu_limit > 0:
+            up = torch.clamp(up, min=-self.swiglu_limit, max=self.swiglu_limit)
+            gate = torch.clamp(gate, max=self.swiglu_limit)
+        x = F.silu(gate) * up
+        if weights is not None:
+            x = weights * x
+        return self.w2(x.to(dtype))
+class MoE(nn.Module):
+    """Mixture-of-Experts: gate routes each token to top-k routed experts + 1 shared expert.
+    Experts are sharded across TP ranks; each rank handles n_routed_experts // world_size experts."""
+    def __init__(self, layer_id: int, args: ModelArgs):
+        super().__init__()
+        self.layer_id = layer_id
+        self.dim = args.dim
+        assert args.n_routed_experts % world_size == 0, f"Number of experts must be divisible by world size (world_size={world_size})"
+        self.n_routed_experts = args.n_routed_experts
+        self.n_local_experts = args.n_routed_experts // world_size
+        self.n_activated_experts = args.n_activated_experts
+        self.experts_start_idx = rank * self.n_local_experts
+        self.experts_end_idx = self.experts_start_idx + self.n_local_experts
+        self.gate = Gate(layer_id, args)
+        expert_dtype = torch.float4_e2m1fn_x2 if args.expert_dtype == "fp4" else None
+        self.experts = nn.ModuleList([Expert(args.dim, args.moe_inter_dim, dtype=expert_dtype, swiglu_limit=args.swiglu_limit) if self.experts_start_idx <= i < self.experts_end_idx else None
+                                       for i in range(self.n_routed_experts)])
+        assert args.n_shared_experts == 1
+        # no swiglu_limit
+        self.shared_experts = Expert(args.dim, args.moe_inter_dim, swiglu_limit=args.swiglu_limit)
+    def forward(self, x: torch.Tensor, input_ids: torch.Tensor) -> torch.Tensor:
+        shape = x.size()
+        x = x.view(-1, self.dim)
+        weights, indices = self.gate(x, input_ids.flatten())
+        y = torch.zeros_like(x, dtype=torch.float32)
+        counts = torch.bincount(indices.flatten(), minlength=self.n_routed_experts).tolist()
+        for i in range(self.experts_start_idx, self.experts_end_idx):
+            if counts[i] == 0:
+                continue
+            expert = self.experts[i]
+            idx, top = torch.where(indices == i)
+            y[idx] += expert(x[idx], weights[idx, top, None])
+        if world_size > 1:
+            dist.all_reduce(y)
+        y += self.shared_experts(x)
+        return y.type_as(x).view(shape)
+class Block(nn.Module):
+    """Transformer block with Hyper-Connections (HC) mixing.
+    Instead of a simple residual, HC maintains `hc_mult` copies of the hidden state.
+    hc_pre: reduces hc copies -> 1 via learned weighted sum (pre-weights from Sinkhorn).
+    hc_post: expands 1 -> hc copies via learned post-weights + combination matrix."""
+    def __init__(self, layer_id: int, args: ModelArgs):
+        super().__init__()
+        self.layer_id = layer_id
+        self.norm_eps = args.norm_eps
+        self.attn = Attention(layer_id, args)
+        self.ffn = MoE(layer_id, args)
+        self.attn_norm = RMSNorm(args.dim, self.norm_eps)
+        self.ffn_norm = RMSNorm(args.dim, self.norm_eps)
+        self.hc_mult = hc_mult = args.hc_mult
+        self.hc_sinkhorn_iters = args.hc_sinkhorn_iters
+        self.hc_eps = args.hc_eps
+        mix_hc = (2 + hc_mult) * hc_mult
+        hc_dim = hc_mult * args.dim
+        with set_dtype(torch.float32):
+            self.hc_attn_fn = nn.Parameter(torch.empty(mix_hc, hc_dim))
+            self.hc_ffn_fn = nn.Parameter(torch.empty(mix_hc, hc_dim))
+            self.hc_attn_base = nn.Parameter(torch.empty(mix_hc))
+            self.hc_ffn_base = nn.Parameter(torch.empty(mix_hc))
+            self.hc_attn_scale = nn.Parameter(torch.empty(3))
+            self.hc_ffn_scale = nn.Parameter(torch.empty(3))
+    def hc_pre(self, x: torch.Tensor, hc_fn: torch.Tensor, hc_scale: torch.Tensor, hc_base: torch.Tensor):
+        # x: [b,s,hc,d], hc_fn: [mix_hc,hc*d], hc_scale: [3], hc_base: [mix_hc], y: [b,s,hc,d]
+        shape, dtype = x.size(), x.dtype
+        x = x.flatten(2).float()
+        rsqrt = torch.rsqrt(x.square().mean(-1, keepdim=True) + self.norm_eps)
+        mixes = F.linear(x, hc_fn) * rsqrt
+        pre, post, comb = hc_split_sinkhorn(mixes, hc_scale, hc_base, self.hc_mult, self.hc_sinkhorn_iters, self.hc_eps)
+        y = torch.sum(pre.unsqueeze(-1) * x.view(shape), dim=2)
+        return y.to(dtype), post, comb
+    def hc_post(self, x: torch.Tensor, residual: torch.Tensor, post: torch.Tensor, comb: torch.Tensor):
+        # x: [b,s,d], residual: [b,s,hc,d], post: [b,s,hc], comb: [b,s,hc,hc], y: [b,s,hc,d]
+        y = post.unsqueeze(-1) * x.unsqueeze(-2) + torch.sum(comb.unsqueeze(-1) * residual.unsqueeze(-2), dim=2)
+        return y.type_as(x)
+    def forward(self, x: torch.Tensor, start_pos: int, input_ids: Optional[torch.Tensor]) -> torch.Tensor:
+        residual = x
+        x, post, comb = self.hc_pre(x, self.hc_attn_fn, self.hc_attn_scale, self.hc_attn_base)
+        x = self.attn_norm(x)
+        x = self.attn(x, start_pos)
+        x = self.hc_post(x, residual, post, comb)
+        residual = x
+        x, post, comb = self.hc_pre(x, self.hc_ffn_fn, self.hc_ffn_scale, self.hc_ffn_base)
+        x = self.ffn_norm(x)
+        x = self.ffn(x, input_ids)
+        x = self.hc_post(x, residual, post, comb)
+        return x
+class ParallelHead(nn.Module):
+    def __init__(self, vocab_size: int, dim: int, norm_eps: float = 1e-6, hc_eps: float = 1e-6):
+        super().__init__()
+        self.vocab_size = vocab_size
+        self.dim = dim
+        self.norm_eps = norm_eps
+        self.hc_eps = hc_eps
+        self.part_vocab_size = (vocab_size // world_size)
+        # lm_head is always stored as bf16 (even in W4A16 checkpoints); use fp32 for logit precision
+        self.weight = nn.Parameter(torch.empty(self.part_vocab_size, self.dim, dtype=torch.float32))
+    def get_logits(self, x):
+        return F.linear(x[:, -1].float(), self.weight)
+    def forward(self, x: torch.Tensor, hc_fn: torch.Tensor, hc_scale: torch.Tensor, hc_base: torch.Tensor, norm: RMSNorm):
+        # x: [b,s,hc,d]
+        x = self.hc_head(x, hc_fn, hc_scale, hc_base)
+        logits = self.get_logits(norm(x))
+        if world_size > 1:
+            all_logits = [torch.empty_like(logits) for _ in range(world_size)]
+            dist.all_gather(all_logits, logits)
+            logits = torch.cat(all_logits, dim=-1)
+        return logits
+    def hc_head(self, x: torch.Tensor, hc_fn: torch.Tensor, hc_scale: torch.Tensor, hc_base: torch.Tensor):
+        shape, dtype = x.size(), x.dtype
+        x = x.flatten(2).float()
+        rsqrt = torch.rsqrt(x.square().mean(-1, keepdim=True) + self.norm_eps)
+        mixes = F.linear(x, hc_fn) * rsqrt
+        pre = torch.sigmoid(mixes * hc_scale + hc_base) + self.hc_eps
+        y = torch.sum(pre.unsqueeze(-1) * x.view(shape), dim=2)
+        return y.to(dtype)
+class MTPBlock(Block):
+    def __init__(self, layer_id: int, args: ModelArgs):
+        super().__init__(layer_id, args)
+        self.e_proj = Linear(args.dim, args.dim)
+        self.h_proj = Linear(args.dim, args.dim)
+        self.enorm = RMSNorm(args.dim, args.norm_eps)
+        self.hnorm = RMSNorm(args.dim, args.norm_eps)
+        self.norm = RMSNorm(args.dim, args.norm_eps)
+        self.hc_mult = hc_mult = args.hc_mult
+        hc_dim = hc_mult * args.dim
+        with set_dtype(torch.float32):
+            self.hc_head_fn = nn.Parameter(torch.empty(hc_mult, hc_dim))
+            self.hc_head_base = nn.Parameter(torch.empty(hc_mult))
+            self.hc_head_scale = nn.Parameter(torch.empty(1))
+        self.embed: ParallelEmbedding = None
+        self.head: ParallelHead = None
+    @torch.inference_mode()
+    def forward(self, x: torch.Tensor, start_pos: int, input_ids: torch.Tensor) -> torch.Tensor:
+        # x: [b,s,hc,d]
+        assert self.embed is not None and self.head is not None
+        e = self.embed(input_ids)
+        e = self.enorm(e)
+        x = self.hnorm(x)
+        x = self.e_proj(e).unsqueeze(2) + self.h_proj(x)
+        x = super().forward(x, start_pos, input_ids)
+        logits = self.head(x, self.hc_head_fn, self.hc_head_scale, self.hc_head_base, self.norm)
+        return logits
+class Transformer(nn.Module):
+    """Full DeepSeek-V4 model: embed -> HC-expand -> N blocks -> HC-head -> logits.
+    Sets global state (world_size, rank, default_dtype, scale_fmt, scale_dtype) in __init__."""
+    def __init__(self, args: ModelArgs):
+        global world_size, rank, default_dtype, scale_fmt, scale_dtype, w4a16_mode
+        world_size = dist.get_world_size() if dist.is_initialized() else 1
+        rank = dist.get_rank() if dist.is_initialized() else 0
+        w4a16_mode = (args.dtype == "w4a16")
+        if w4a16_mode:
+            default_dtype = torch.bfloat16
+            scale_fmt = None
+            scale_dtype = torch.float32
+        else:
+            default_dtype = torch.float8_e4m3fn if args.dtype == "fp8" else torch.bfloat16
+            scale_fmt = "ue8m0" if args.scale_dtype == "fp8" else args.scale_fmt
+            scale_dtype = torch.float8_e8m0fnu if args.scale_dtype == "fp8" else torch.float32
+        super().__init__()
+        self.max_seq_len = args.max_seq_len
+        self.norm_eps = args.norm_eps
+        self.hc_eps = args.hc_eps
+        self.embed = ParallelEmbedding(args.vocab_size, args.dim)
+        self.layers = torch.nn.ModuleList()
+        for layer_id in range(args.n_layers):
+            self.layers.append(Block(layer_id, args))
+        self.norm = RMSNorm(args.dim, self.norm_eps)
+        self.head = ParallelHead(args.vocab_size, args.dim, self.norm_eps, self.hc_eps)
+        self.mtp = torch.nn.ModuleList()
+        for layer_id in range(args.n_mtp_layers):
+            self.mtp.append(MTPBlock(args.n_layers + layer_id, args))
+            self.mtp[-1].embed = self.embed
+            self.mtp[-1].head = self.head
+        self.hc_mult = hc_mult = args.hc_mult
+        hc_dim = hc_mult * args.dim
+        with set_dtype(torch.float32):
+            self.hc_head_fn = nn.Parameter(torch.empty(hc_mult, hc_dim))
+            self.hc_head_base = nn.Parameter(torch.empty(hc_mult))
+            self.hc_head_scale = nn.Parameter(torch.empty(1))
+    def init_woq_layers(self):
+        """After load_model(), convert all W4A16 parameters into QuantLinear layers."""
+        # from gptqmodel.nn_modules.qlinear.tritonv2 import TritonV2QuantLinear as QuantLinear
+        from gptqmodel.nn_modules.qlinear.marlin import MarlinQuantLinear as QuantLinear
+        for module in self.modules():
+            if hasattr(module, 'init_woq') and module is not self:
+                module.init_woq(QuantLinear)
+        torch.cuda.empty_cache()
+    @torch.inference_mode()
+    def forward(self, input_ids: torch.Tensor, start_pos: int = 0):
+        h = self.embed(input_ids)
+        # Expand to hc_mult copies for Hyper-Connections
+        h = h.unsqueeze(2).repeat(1, 1, self.hc_mult, 1)
+        for layer in self.layers:
+            h = layer(h, start_pos, input_ids)
+        logits = self.head(h, self.hc_head_fn, self.hc_head_scale, self.hc_head_base, self.norm)
+        return logits
+if __name__ == "__main__":
+    torch.set_default_dtype(torch.bfloat16)
+    torch.set_default_device("cuda")
+    torch.manual_seed(0)
+    args = ModelArgs(n_hash_layers=0)
+    x = torch.randint(0, args.vocab_size, (2, 128))
+    model = Transformer(args)
+    print(model(x).size())
+    for i in range(128, 150):
+        print(i, model(x[:, 0:1], i).size())
+    h = torch.randn(2, 128, args.hc_mult, args.dim)
+    mtp = model.mtp[0]
+    print(mtp(h, 0, x).size())
+    print(mtp(h[:, 0:1], 1, x[:, 0:1]).size())

inference/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+torch>=2.10.0
+transformers>=5.0.0
+safetensors>=0.7.0
+fast_hadamard_transform
+tilelang==0.1.8
+gptqmodel==6.0.3

model-00001-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0ef64d991b80c86f24bd78d1bac9d452d95bc78e6b1d8feb6a182dae0240c7e5
+size 1853358176

model-00002-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f7d7f2862248d66164e5bf9007a9da3b0f1d6c0b9d5143c43dbc2d9ad4a7ff12
+size 13390865040

model-00003-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:881002f38770cf6e0be5c24801d11c595674c3c4d7ff3485c45bf06d848e90a7
+size 13390865040

model-00004-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8c1c824e837c5c78d2467c70222e5b690c91a9bc772f6662ad48b39db06d116d
+size 13403120848

model-00005-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6c7cb7212f3374b69f29e1198b572392084135bdd0bda38282f8bd9a0c8ed46b
+size 13384661096

model-00006-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:683444102f6eeae98ccaa00d1aada1e44253a80541a17d457382bd070947ac59
+size 13396916904

model-00007-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:863e5634abb25b3cf83625ff90f3c5783b5619435ae184112f05c2d65c407a84
+size 13384661096

model-00008-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a86efce4fc51c41911f78480863ba467d97f8a65ce03365d3f0a002b892674ad
+size 13396916904

model-00009-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6cdf0891b172d21168999bd0adb520b820a078351ee15a77d7b07867908347cd
+size 13384661096

model-00010-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2615fd85e6a5913201d05b1cefe8e664dc910b88531ff1ae46da21ed41a9558e
+size 13396916904

model-00011-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4922eb51e18579581a83bce28ab880b8dd5390cd4d03dd0c24e4b2c4f6fd3dc6
+size 13384661096

model-00012-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f1b289316453fe719094a30e5cea8e577787dc7560512f872db669962addac88
+size 13396920416

model-00013-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:53c5b20d54738bdcfc827f165a6543848b6ccf384ef66bf9402ea7244ef67c52
+size 13384664600

model-00014-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7423ce801964f2af7f4056531432fcc1a140a7cd33a2a5b530fb30e49609e618
+size 13396920416

model-00015-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d13483f05edd67b110bcf0cefe7a84f9bfe94ce47e654af4e60238955fa50989
+size 13384664600

model-00016-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:75704e031c9be328e8133ee2c6603a430bcfd61979f5a561964bf29d0134104b
+size 13396920416

model-00017-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c7f0c6876bade6ea29e2662ebae000ce3cfde96887c4ca66a90db77ee530db8b
+size 13384664600

model-00018-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4abe0458401fc5e990e3699dbff2b98e05355f66bcc5d974ece79e7e14672765
+size 13396920416

model-00019-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:29238f11c48bbb56de1c513e9806edcaf63a142b191e36668b64d93df29b5238
+size 13384664600

model-00020-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6bd7d5d0e0337dc0af3b68287c237c6ac6c922596a64c88320fa8c62ddc32ceb
+size 13396920416

model-00021-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:572a2a0e0910421cd7f4519da30c604cddf4a1aa61fe50e51788ac8f3e0875d7
+size 13384664600

model-00022-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:077bce599585a64769465c6ef3614d4717184ce7f776162ee63884cc38e5ded3
+size 13396920416

model-00023-of-00064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b5ea118977b28e91802ff0df2cfd262c42eb2b439d6bd019bb2f830ea9ee7de2
+size 13384664600