Instructions to use moonshotai/Kimi-K2.6 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-K2.6 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="moonshotai/Kimi-K2.6", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("moonshotai/Kimi-K2.6", trust_remote_code=True, dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use moonshotai/Kimi-K2.6 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-K2.6"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.6",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-K2.6

SGLang

How to use moonshotai/Kimi-K2.6 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-K2.6" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.6",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-K2.6" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.6",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-K2.6 with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-K2.6
```

courage17340 commited on 23 days ago

Commit

e0ee936

1 Parent(s): 344a634

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +2 -0
LICENSE +27 -0
README.md +618 -0
THIRD_PARTY_NOTICES.md +43 -0
chat_template.jinja +112 -0
config.json +192 -0
configuration_deepseek.py +214 -0
configuration_kimi_k25.py +123 -0
docs/deploy_guidance.md +94 -0
figures/demo_video.mp4 +3 -0
figures/kimi-logo.png +0 -0
generation_config.json +4 -0
kimi_k25_processor.py +165 -0
kimi_k25_vision_processing.py +251 -0
media_utils.py +368 -0
model-00001-of-000064.safetensors +3 -0
model-00002-of-000064.safetensors +3 -0
model-00003-of-000064.safetensors +3 -0
model-00004-of-000064.safetensors +3 -0
model-00005-of-000064.safetensors +3 -0
model-00006-of-000064.safetensors +3 -0
model-00007-of-000064.safetensors +3 -0
model-00008-of-000064.safetensors +3 -0
model-00009-of-000064.safetensors +3 -0
model-00010-of-000064.safetensors +3 -0
model-00011-of-000064.safetensors +3 -0
model-00012-of-000064.safetensors +3 -0
model-00013-of-000064.safetensors +3 -0
model-00014-of-000064.safetensors +3 -0
model-00015-of-000064.safetensors +3 -0
model-00016-of-000064.safetensors +3 -0
model-00017-of-000064.safetensors +3 -0
model-00018-of-000064.safetensors +3 -0
model-00019-of-000064.safetensors +3 -0
model-00020-of-000064.safetensors +3 -0
model-00021-of-000064.safetensors +3 -0
model-00022-of-000064.safetensors +3 -0
model-00023-of-000064.safetensors +3 -0
model-00024-of-000064.safetensors +3 -0
model-00025-of-000064.safetensors +3 -0
model-00026-of-000064.safetensors +3 -0
model-00027-of-000064.safetensors +3 -0
model-00028-of-000064.safetensors +3 -0
model-00029-of-000064.safetensors +3 -0
model-00030-of-000064.safetensors +3 -0
model-00031-of-000064.safetensors +3 -0
model-00032-of-000064.safetensors +3 -0
model-00033-of-000064.safetensors +3 -0
model-00034-of-000064.safetensors +3 -0
model-00035-of-000064.safetensors +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+model.safetensors.index.json filter=lfs diff=lfs merge=lfs -text
+figures/demo_video.mp4 filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,27 @@

+Modified MIT License
+Copyright (c) 2026 Moonshot AI
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the “Software”), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+Our only modification part is that, if the Software (or any derivative works
+thereof) is used for any of your commercial products or services that have
+more than 100 million monthly active users, or more than 20 million US dollars
+(or equivalent in other currencies) in monthly revenue, you shall prominently
+display "Kimi K2.6" on the user interface of such product or service.

README.md ADDED Viewed

	@@ -0,0 +1,618 @@

+---
+tags:
+- compressed-tensors
+license: other
+license_name: modified-mit
+library_name: transformers
+pipeline_tag: image-text-to-text
+---
+<div align="center">
+  <picture>
+      <img src="figures/kimi-logo.png" width="30%" alt="Kimi K2.6">
+  </picture>
+</div>
+<hr>
+<div align="center" style="line-height:1">
+  <a href="https://www.kimi.com" target="_blank"><img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-Kimi%20K2.6-ff6b6b?color=1783ff&logoColor=white"/></a>
+  <a href="https://www.moonshot.ai" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-Moonshot%20AI-white?logo=Kimi&logoColor=white"/></a>
+</div>
+<div align="center" style="line-height: 1;">
+  <a href="https://huggingface.co/moonshotai" target="_blank"><img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Moonshot%20AI-ffc107?color=ffc107&logoColor=white"/></a>
+  <a href="https://twitter.com/kimi_moonshot" target="_blank"><img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-Kimi.ai-white?logo=x&logoColor=white"/></a>
+    <a href="https://discord.gg/TYU2fdJykW" target="_blank"><img alt="Discord" src="https://img.shields.io/badge/Discord-Kimi.ai-white?logo=discord&logoColor=white"/></a>
+</div>
+<div align="center" style="line-height: 1;">
+  <a href="https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Modified_MIT-f5de53?&color=f5de53"/></a>
+</div>
+## 1. Model Introduction
+Kimi K2.6 is an open-source, native multimodal agentic model that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration.
+### Key Features
+- **Long-Horizon Coding**: K2.6 achieves significant improvements on complex, end-to-end coding tasks, generalizing robustly across programming languages (Rust, Go, Python) and domains spanning front-end, DevOps, and performance optimization.
+- **Coding-Driven Design**: K2.6 is capable of transforming simple prompts and visual inputs into production-ready interfaces and lightweight full-stack workflows, generating structured layouts, interactive elements, and rich animations with deliberate aesthetic precision.
+- **Elevated Agent Swarm**: Scaling horizontally to 300 sub-agents executing 4,000 coordinated steps, K2.6 can dynamically decompose tasks into parallel, domain-specialized subtasks, delivering end-to-end outputs from documents to websites to spreadsheets in a single autonomous run.
+- **Proactive & Open Orchestration**: For autonomous tasks, K2.6 demonstrates strong performance in powering persistent, 24/7 background agents that proactively manage schedules, execute code, and orchestrate cross-platform operations without human oversight.
+## 2. Model Summary
+<div align="center">
+| | |
+|:---:|:---:|
+| **Architecture** | Mixture-of-Experts (MoE) |
+| **Total Parameters** | 1T |
+| **Activated Parameters** | 32B |
+| **Number of Layers** (Dense layer included) | 61 |
+| **Number of Dense Layers** | 1 |
+| **Attention Hidden Dimension** | 7168 |
+| **MoE Hidden Dimension** (per Expert) | 2048 |
+| **Number of Attention Heads** | 64 |
+| **Number of Experts** | 384 |
+| **Selected Experts per Token** | 8 |
+| **Number of Shared Experts** | 1 |
+| **Vocabulary Size** | 160K |
+| **Context Length** | 256K |
+| **Attention Mechanism** | MLA |
+| **Activation Function** | SwiGLU |
+| **Vision Encoder** | MoonViT |
+| **Parameters of Vision Encoder** | 400M |
+</div>
+## 3. Evaluation Results
+<div align="center">
+<table>
+<thead>
+<tr>
+<th align="center">Benchmark</th>
+<th align="center"><sup>Kimi K2.6</sup></th>
+<th align="center"><sup>GPT-5.4 <br><sup>(xhigh)</sup></sup></th>
+<th align="center"><sup>Claude Opus 4.6 <br><sup>(max effort)</sup></sup></th>
+<th align="center"><sup>Gemini 3.1 Pro<br><sup>(thinking high)</sup></sup></th>
+<th align="center"><sup>Kimi K2.5</sup></th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td align="center" colspan=6><strong>Agentic</strong></td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">HLE-Full<br>(w/ tools)</td>
+<td align="center" style="vertical-align: middle">54.0</td>
+<td align="center" style="vertical-align: middle">52.1</td>
+<td align="center" style="vertical-align: middle">53.0</td>
+<td align="center" style="vertical-align: middle">51.4</td>
+<td align="center" style="vertical-align: middle">50.2</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">BrowseComp</td>
+<td align="center" style="vertical-align: middle">83.2</td>
+<td align="center" style="vertical-align: middle" rowspan="2">82.7</td>
+<td align="center" style="vertical-align: middle" rowspan="2">83.7</td>
+<td align="center" style="vertical-align: middle" rowspan="2">85.9</td>
+<td align="center" style="vertical-align: middle">74.9</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">BrowseComp<br>(Agent Swarm)</td>
+<td align="center" style="vertical-align: middle">86.3</td>
+<td align="center" style="vertical-align: middle">78.4</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">DeepSearchQA<br>(f1-score)</td>
+<td align="center" style="vertical-align: middle">92.5</td>
+<td align="center" style="vertical-align: middle">78.6</td>
+<td align="center" style="vertical-align: middle">91.3</td>
+<td align="center" style="vertical-align: middle">81.9</td>
+<td align="center" style="vertical-align: middle">89.0</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">DeepSearchQA<br>(accuracy)</td>
+<td align="center" style="vertical-align: middle">83.0</td>
+<td align="center" style="vertical-align: middle">63.7</td>
+<td align="center" style="vertical-align: middle">80.6</td>
+<td align="center" style="vertical-align: middle">60.2</td>
+<td align="center" style="vertical-align: middle">77.1</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">WideSearch<br> (item-f1)</td>
+<td align="center" style="vertical-align: middle">80.8</td>
+<td align="center" style="vertical-align: middle">-</td>
+<td align="center" style="vertical-align: middle">-</td>
+<td align="center" style="vertical-align: middle">-</td>
+<td align="center" style="vertical-align: middle">72.7</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">Toolathlon</td>
+<td align="center" style="vertical-align: middle">50.0</td>
+<td align="center" style="vertical-align: middle">54.6</td>
+<td align="center" style="vertical-align: middle">47.2</td>
+<td align="center" style="vertical-align: middle">48.8</td>
+<td align="center" style="vertical-align: middle">27.8</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">MCPMark</td>
+<td align="center" style="vertical-align: middle">55.9</td>
+<td align="center" style="vertical-align: middle">62.5*</td>
+<td align="center" style="vertical-align: middle">56.7*</td>
+<td align="center" style="vertical-align: middle">55.9*</td>
+<td align="center" style="vertical-align: middle">29.5</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">Claw Eval (pass^3)</td>
+<td align="center" style="vertical-align: middle">62.3</td>
+<td align="center" style="vertical-align: middle">60.3</td>
+<td align="center" style="vertical-align: middle">70.4</td>
+<td align="center" style="vertical-align: middle">57.8</td>
+<td align="center" style="vertical-align: middle">52.3</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">Claw Eval (pass@3)</td>
+<td align="center" style="vertical-align: middle">80.9</td>
+<td align="center" style="vertical-align: middle">78.4</td>
+<td align="center" style="vertical-align: middle">82.4</td>
+<td align="center" style="vertical-align: middle">82.9</td>
+<td align="center" style="vertical-align: middle">75.4</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">APEX-Agents</td>
+<td align="center" style="vertical-align: middle">27.9</td>
+<td align="center" style="vertical-align: middle">33.3</td>
+<td align="center" style="vertical-align: middle">33.0</td>
+<td align="center" style="vertical-align: middle">32.0</td>
+<td align="center" style="vertical-align: middle">11.5</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">OSWorld-Verified</td>
+<td align="center" style="vertical-align: middle">73.1</td>
+<td align="center" style="vertical-align: middle">75.0</td>
+<td align="center" style="vertical-align: middle">72.7</td>
+<td align="center" style="vertical-align: middle">-</td>
+<td align="center" style="vertical-align: middle">63.3</td>
+</tr>
+<tr>
+<td align="center" colspan=6><strong>Coding</strong></td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">Terminal-Bench 2.0<br>(Terminus-2)</td>
+<td align="center" style="vertical-align: middle">66.7</td>
+<td align="center" style="vertical-align: middle">65.4*</td>
+<td align="center" style="vertical-align: middle">65.4</td>
+<td align="center" style="vertical-align: middle">68.5</td>
+<td align="center" style="vertical-align: middle">50.8</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">SWE-Bench Pro</td>
+<td align="center" style="vertical-align: middle">58.6</td>
+<td align="center" style="vertical-align: middle">57.7</td>
+<td align="center" style="vertical-align: middle">53.4</td>
+<td align="center" style="vertical-align: middle">54.2</td>
+<td align="center" style="vertical-align: middle">50.7</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">SWE-Bench Multilingual</td>
+<td align="center" style="vertical-align: middle">76.7</td>
+<td align="center" style="vertical-align: middle">-</td>
+<td align="center" style="vertical-align: middle">77.8</td>
+<td align="center" style="vertical-align: middle">76.9*</td>
+<td align="center" style="vertical-align: middle">73.0</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">SWE-Bench Verified</td>
+<td align="center" style="vertical-align: middle">80.2</td>
+<td align="center" style="vertical-align: middle">-</td>
+<td align="center" style="vertical-align: middle">80.8</td>
+<td align="center" style="vertical-align: middle">80.6</td>
+<td align="center" style="vertical-align: middle">76.8</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">SciCode</td>
+<td align="center" style="vertical-align: middle">52.2</td>
+<td align="center" style="vertical-align: middle">56.6</td>
+<td align="center" style="vertical-align: middle">51.9</td>
+<td align="center" style="vertical-align: middle">58.9</td>
+<td align="center" style="vertical-align: middle">48.7</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">OJBench (python)</td>
+<td align="center" style="vertical-align: middle">60.6</td>
+<td align="center" style="vertical-align: middle">-</td>
+<td align="center" style="vertical-align: middle">60.3</td>
+<td align="center" style="vertical-align: middle">70.7</td>
+<td align="center" style="vertical-align: middle">54.7</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">LiveCodeBench (v6)</td>
+<td align="center" style="vertical-align: middle">89.6</td>
+<td align="center" style="vertical-align: middle">-</td>
+<td align="center" style="vertical-align: middle">88.8</td>
+<td align="center" style="vertical-align: middle">91.7</td>
+<td align="center" style="vertical-align: middle">85.0</td>
+</tr>
+<tr>
+<td align="center" colspan=6><strong>Reasoning &amp; Knowledge</strong></td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">HLE-Full</td>
+<td align="center" style="vertical-align: middle">34.7</td>
+<td align="center" style="vertical-align: middle">39.8</td>
+<td align="center" style="vertical-align: middle">40.0</td>
+<td align="center" style="vertical-align: middle">44.4</td>
+<td align="center" style="vertical-align: middle">30.1</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">AIME 2026</td>
+<td align="center" style="vertical-align: middle">96.4</td>
+<td align="center" style="vertical-align: middle">99.2</td>
+<td align="center" style="vertical-align: middle">96.7</td>
+<td align="center" style="vertical-align: middle">98.3</td>
+<td align="center" style="vertical-align: middle">95.8</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">HMMT 2026 (Feb)</td>
+<td align="center" style="vertical-align: middle">92.7</td>
+<td align="center" style="vertical-align: middle">97.7</td>
+<td align="center" style="vertical-align: middle">96.2</td>
+<td align="center" style="vertical-align: middle">94.7</td>
+<td align="center" style="vertical-align: middle">87.1</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">IMO-AnswerBench</td>
+<td align="center" style="vertical-align: middle">86.0</td>
+<td align="center" style="vertical-align: middle">91.4</td>
+<td align="center" style="vertical-align: middle">75.3</td>
+<td align="center" style="vertical-align: middle">91.0*</td>
+<td align="center" style="vertical-align: middle">81.8</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">GPQA-Diamond</td>
+<td align="center" style="vertical-align: middle">90.5</td>
+<td align="center" style="vertical-align: middle">92.8</td>
+<td align="center" style="vertical-align: middle">91.3</td>
+<td align="center" style="vertical-align: middle">94.3</td>
+<td align="center" style="vertical-align: middle">87.6</td>
+</tr>
+<tr>
+<td align="center" colspan=6><strong>Vision</strong></td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">MMMU-Pro</td>
+<td align="center" style="vertical-align: middle">79.4</td>
+<td align="center" style="vertical-align: middle">81.2</td>
+<td align="center" style="vertical-align: middle">73.9</td>
+<td align="center" style="vertical-align: middle">83.0*</td>
+<td align="center" style="vertical-align: middle">78.5</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">MMMU-Pro (w/ python)</td>
+<td align="center" style="vertical-align: middle">80.1</td>
+<td align="center" style="vertical-align: middle">82.1</td>
+<td align="center" style="vertical-align: middle">77.3</td>
+<td align="center" style="vertical-align: middle">85.3*</td>
+<td align="center" style="vertical-align: middle">77.7</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">CharXiv (RQ)</td>
+<td align="center" style="vertical-align: middle">80.4</td>
+<td align="center" style="vertical-align: middle">82.8*</td>
+<td align="center" style="vertical-align: middle">69.1</td>
+<td align="center" style="vertical-align: middle">80.2*</td>
+<td align="center" style="vertical-align: middle">77.5</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">CharXiv (RQ) (w/ python)</td>
+<td align="center" style="vertical-align: middle">86.7</td>
+<td align="center" style="vertical-align: middle">90.0*</td>
+<td align="center" style="vertical-align: middle">84.7</td>
+<td align="center" style="vertical-align: middle">89.9*</td>
+<td align="center" style="vertical-align: middle">78.7</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">MathVision</td>
+<td align="center" style="vertical-align: middle">87.4</td>
+<td align="center" style="vertical-align: middle">92.0*</td>
+<td align="center" style="vertical-align: middle">71.2*</td>
+<td align="center" style="vertical-align: middle">89.8*</td>
+<td align="center" style="vertical-align: middle">84.2</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">MathVision (w/ python)</td>
+<td align="center" style="vertical-align: middle">93.2</td>
+<td align="center" style="vertical-align: middle">96.1*</td>
+<td align="center" style="vertical-align: middle">84.6*</td>
+<td align="center" style="vertical-align: middle">95.7*</td>
+<td align="center" style="vertical-align: middle">85.0</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">BabyVision</td>
+<td align="center" style="vertical-align: middle">39.8</td>
+<td align="center" style="vertical-align: middle">49.7</td>
+<td align="center" style="vertical-align: middle">14.8</td>
+<td align="center" style="vertical-align: middle">51.6</td>
+<td align="center" style="vertical-align: middle">36.5</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">BabyVision (w/ python)</td>
+<td align="center" style="vertical-align: middle">68.5</td>
+<td align="center" style="vertical-align: middle">80.2*</td>
+<td align="center" style="vertical-align: middle">38.4*</td>
+<td align="center" style="vertical-align: middle">68.3*</td>
+<td align="center" style="vertical-align: middle">40.5</td>
+</tr>
+<tr>
+<td align="center" style="vertical-align: middle">V* (w/ python)</td>
+<td align="center" style="vertical-align: middle">96.9</td>
+<td align="center" style="vertical-align: middle">98.4*</td>
+<td align="center" style="vertical-align: middle">86.4*</td>
+<td align="center" style="vertical-align: middle">96.9*</td>
+<td align="center" style="vertical-align: middle">86.9</td>
+</tr>
+</tbody>
+</table>
+</div>
+<details>
+<summary><b>Footnotes</b></summary>
+1. **General Testing Details**
+   - We report results for Kimi K2.6 and Kimi K2.5 with thinking mode enabled, Claude Opus 4.6 with max effort, GPT-5.4 with xhigh reasoning effort, and Gemini 3.1 Pro with a high thinking level.
+   - Unless otherwise specified, all Kimi K2.6 experiments were conducted with temperature = 1.0, top-p = 1.0, and a context length of 262,144 tokens.
+   - Benchmarks without publicly available scores were re-evaluated under the same conditions used for Kimi K2.6 and are marked with an asterisk (`*`). Except where noted with an asterisk, all other results are cited from official reports.
+2. **Reasoning Benchmarks**
+   - IMO-AnswerBench scores for GPT-5.4 and Claude 4.6 were obtained from [z.ai/blog/glm-5.1](https://z.ai/blog/glm-5.1).
+   - Humanity's Last Exam (HLE) and other reasoning tasks were evaluated with a maximum generation length of 98,304 tokens. By default, we report results on the HLE full set. For the text-only subset, Kimi K2.6 achieves 36.4% accuracy without tools and 55.5% with tools.
+3. **Tool-Augmented / Agentic Tasks**
+   - Kimi K2.6 was equipped with search, code-interpreter, and web-browsing tools for HLE with tools, BrowseComp, DeepSearchQA, and WideSearch.
+   - For HLE-Full with tools, the maximum generation length is 262,144 tokens with a per-step limit of 49,152 tokens. We employ a simple context management strategy: once the context window exceeds the threshold, only the most recent round of tool-related messages is retained.
+   - For BrowseComp, we report scores obtained with context management using the same discard-all strategy as Kimi K2.5 and DeepSeek-V3.2.
+   - For DeepSearchQA, no context management was applied to Kimi K2.6 tests, and tasks exceeding the supported context length were directly counted as failed. Scores for Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro on DeepSearchQA are cited from the [Claude Opus 4.7 System Card](https://cdn.sanity.io/files/4zrzovbb/website/037f06850df7fbe871e206dad004c3db5fd50340.pdf).
+   - For WideSearch, we report results under the "hide tool result" context management setting. Once the context window exceeds the threshold, only the most recent round of tool-related messages is retained.
+   - The test system prompts are identical to those used in the [Kimi K2.5 technical report](https://arxiv.org/pdf/2602.02276).
+   - Claw Eval was conducted using version 1.1  with max-tokens-per-step = 16384.
+   - For APEX-Agents, we evaluate 452 tasks from the public 480-task release, as done by [Artificial Analysis](https://artificialanalysis.ai/evaluations/apex-agents-aa)(excluding Investment Banking Worlds 244 and 246, which have external runtime dependencies)
+4. **Coding Tasks**
+   - Terminal-Bench 2.0 scores were obtained with the default agent framework (Terminus-2) and the provided JSON parser, operating in preserve thinking mode.
+   - For the SWE-Bench series of evaluations (including Verified, Multilingual, and Pro), we used an in-house evaluation framework adapted from SWE-agent. This framework includes a minimal set of tools—bash tool, createfile tool, insert tool, view tool, strreplace tool, and submit tool.
+   - All reported scores for coding tasks are averaged over 10 independent runs.
+5. **Vision Benchmarks**
+   - Max-tokens = 98,304, averaged over three runs (avg@3).
+   - Settings with Python tool use max-tokens-per-step = 64k and max-steps = 50 for multi-step reasoning.
+   - MMMU-Pro follows the official protocol, preserving input order and prepending images.
+</details>
+## 4. Native INT4 Quantization
+Kimi-K2.6 adopts the same native int4 quantization method as [Kimi-K2-Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking#4-native-int4-quantization).
+## 5. Deployment
+> [!Note]
+> You can access Kimi-K2.6's API on https://platform.moonshot.ai and we provide OpenAI/Anthropic-compatible API for you. To verify the deployment is correct, we also provide the  [Kimi Vendor Verifier](https://kimi.com/blog/kimi-vendor-verifier.html).
+Currently, Kimi-K2.6 is recommended to run on the following inference engines:
+* vLLM
+* SGLang
+* KTransformers
+Kimi-K2.6 has the same architecture as Kimi-K2.5, and the deployment method can be directly reused.
+The minimum version requirement for `transformers` is `4.57.1`.
+Deployment examples can be found in the [Model Deployment Guide](docs/deploy_guidance.md).
+---
+## 6. Model Usage
+The usage demos below demonstrate how to call our official API.
+For third-party APIs deployed with vLLM or SGLang, please note that:
+> [!Note]
+> - Chat with video content is an experimental feature and is only supported in our official API for now.
+>
+> - The recommended `temperature` will be `1.0` for Thinking mode and `0.6` for Instant mode.
+>
+> - The recommended `top_p` is `0.95`.
+>
+> - To use instant mode, you need to pass `{'chat_template_kwargs': {"thinking": False}}` in `extra_body`.
+### Chat Completion
+This is a simple chat completion script which shows how to call K2.6 API in Thinking and Instant modes.
+```python
+import openai
+import base64
+import requests
+def simple_chat(client: openai.OpenAI, model_name: str):
+    messages = [
+        {'role': 'system', 'content': 'You are Kimi, an AI assistant created by Moonshot AI.'},
+        {
+            'role': 'user',
+            'content': [
+                {'type': 'text', 'text': 'which one is bigger, 9.11 or 9.9? think carefully.'}
+            ],
+        },
+    ]
+    response = client.chat.completions.create(
+        model=model_name, messages=messages, stream=False, max_tokens=4096
+    )
+    print('====== Below is reasoning content in Thinking Mode ======')
+    print(f'reasoning content: {response.choices[0].message.reasoning}')
+    print('====== Below is response in Thinking Mode ======')
+    print(f'response: {response.choices[0].message.content}')
+    # To use instant mode, pass {"thinking" = {"type":"disabled"}}
+    response = client.chat.completions.create(
+        model=model_name,
+        messages=messages,
+        stream=False,
+        max_tokens=4096,
+        extra_body={'thinking': {'type': 'disabled'}},  # this is for official API
+        # extra_body= {'chat_template_kwargs': {"thinking": False}}  # this is for vLLM/SGLang
+    )
+    print('====== Below is response in Instant Mode ======')
+    print(f'response: {response.choices[0].message.content}')
+```
+### Chat Completion with visual content
+K2.6 supports Image and Video input.
+The following example demonstrates how to call K2.6 API with image input:
+```python
+import openai
+import base64
+import requests
+def chat_with_image(client: openai.OpenAI, model_name: str):
+    url = 'https://huggingface.co/moonshotai/Kimi-K2.6/resolve/main/figures/kimi-logo.png'
+    image_base64 = base64.b64encode(requests.get(url).content).decode()
+    messages = [
+        {
+            'role': 'user',
+            'content': [
+                {'type': 'text', 'text': 'Describe this image in detail.'},
+                {
+                    'type': 'image_url',
+                    'image_url': {'url': f'data:image/png;base64, {image_base64}'},
+                },
+            ],
+        }
+    ]
+    response = client.chat.completions.create(
+        model=model_name, messages=messages, stream=False, max_tokens=8192
+    )
+    print('====== Below is reasoning content in Thinking Mode ======')
+    print(f'reasoning content: {response.choices[0].message.reasoning}')
+    print('====== Below is response in Thinking Mode ======')
+    print(f'response: {response.choices[0].message.content}')
+    # Also support instant mode if you pass {"thinking" = {"type":"disabled"}}
+    response = client.chat.completions.create(
+        model=model_name,
+        messages=messages,
+        stream=False,
+        max_tokens=4096,
+        extra_body={'thinking': {'type': 'disabled'}},  # this is for official API
+        # extra_body= {'chat_template_kwargs': {"thinking": False}}  # this is for vLLM/SGLang
+    )
+    print('====== Below is response in Instant Mode ======')
+    print(f'response: {response.choices[0].message.content}')
+    return response.choices[0].message.content
+```
+The following example demonstrates how to call K2.6 API with video input:
+```python
+import openai
+import base64
+import requests
+def chat_with_video(client: openai.OpenAI, model_name:str):
+    url = 'https://huggingface.co/moonshotai/Kimi-K2.6/resolve/main/figures/demo_video.mp4'
+    video_base64 = base64.b64encode(requests.get(url).content).decode()
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "text","text": "Describe the video in detail."},
+                {
+                    "type": "video_url",
+                    "video_url": {"url": f"data:video/mp4;base64,{video_base64}"},
+                },
+            ],
+        }
+    ]
+    response = client.chat.completions.create(model=model_name, messages=messages)
+    print('====== Below is reasoning content in Thinking Mode ======')
+    print(f'reasoning content: {response.choices[0].message.reasoning}')
+    print('====== Below is response in Thinking Mode ======')
+    print(f'response: {response.choices[0].message.content}')
+    # Also support instant mode if pass {"thinking" = {"type":"disabled"}}
+    response = client.chat.completions.create(
+        model=model_name,
+        messages=messages,
+        stream=False,
+        max_tokens=4096,
+        extra_body={'thinking': {'type': 'disabled'}},  # this is for official API
+        # extra_body= {'chat_template_kwargs': {"thinking": False}}  # this is for vLLM/SGLang
+    )
+    print('====== Below is response in Instant Mode ======')
+    print(f'response: {response.choices[0].message.content}')
+    return response.choices[0].message.content
+```
+### Preserve Thinking
+Kimi K2.6 supports `preserve_thinking` mode, which enabling it to retain full reasoning content across multi-turn interactions and enhances performance in coding agent scenarios.
+This feature is disabled by default. The following example demonstrates how to call K2.6 API in `preserve_thinking` mode:
+```python
+def chat_with_preserve_thinking(client: openai.OpenAI, model_name: str):
+    messages = [
+        {
+            "role": "user",
+            "content": "Tell me three random numbers."
+        },
+        {
+            "role": "assistant",
+            "reasoning_content": "I'll start by listing five numbers: 473, 921, 235, 215, 222, and I'll tell you the first three.",
+            "content": "473, 921, 235"
+        },
+        {
+            "role": "user",
+            "content": "What are the other two numbers you have in mind?"
+        }
+    ]
+    response = client.chat.completions.create(
+        model=model_name,
+        messages=messages,
+        stream=False,
+        max_tokens=4096,
+        extra_body={'thinking': {'type': 'enabled', keep: 'all'}},  # this is for official API
+        # extra_body={"chat_template_kwargs": {"thinking":True, "preserve_thinking": True}},  # this is for vLLM/SGLang
+        # We recommend enabling preserve_thinking only in think mode.
+    )
+    # the assistant should mention 215 and 222 that appear in the prior reasoning content
+    print(f"response: {response.choices[0].message.reasoning}")
+    return response.choices[0].message.content
+```
+### Interleaved Thinking and Multi-Step Tool Call
+K2.6 shares the same design of Interleaved Thinking and Multi-Step Tool Call as K2 Thinking. For usage example, please refer to the [K2 Thinking documentation](https://platform.moonshot.ai/docs/guide/use-kimi-k2-thinking-model#complete-example).
+### Coding Agent Framework
+Kimi K2.6 works best with Kimi Code CLI as its agent framework — give it a try at https://www.kimi.com/code.
+---
+## 7. License
+Both the code repository and the model weights are released under the [Modified MIT License](LICENSE).
+---
+## 8. Third Party Notices
+See [THIRD PARTY NOTICES](THIRD_PARTY_NOTICES.md)
+---
+## 9. Contact Us
+If you have any questions, please reach out at [support@moonshot.ai](mailto:support@moonshot.ai).

THIRD_PARTY_NOTICES.md ADDED Viewed

	@@ -0,0 +1,43 @@

+# THIRD_PARTY_NOTICES
+This file lists third-party software contained in Kimi-K2.6 along with their licenses, in compliance with the redistribution clauses of those licenses.
+---
+## 1. DeepSeek-V3
+Our model archietecture is DeepSeek-V3-like. Some of modeling codes are copied from the source repository.
+- **Source Repository**
+  https://huggingface.co/deepseek-ai/DeepSeek-V3
+- **Files / Directories Used**
+  - configuration_deepseek.py
+  - modeling_deepseek.py
+- **License Type**
+  MIT License
+- **Copyright Notice**
+  Copyright (c)  2023 DeepSeek
+- **Full License Text**
+```
+MIT License
+Copyright (c) 2023 DeepSeek
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+```

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,112 @@

+{%- macro render_content(msg) -%}
+    {%- set c = msg.get('content') -%}
+    {%- if c is string -%}
+      {{ c }}
+    {%- elif c is not none -%}
+      {% for content in c -%}
+        {% if content['type'] == 'image' or content['type'] == 'image_url' -%}
+          <|media_begin|>image<|media_content|><|media_pad|><|media_end|>
+        {% elif content['type'] == 'video' or content['type']== 'video_url'-%}
+          <|kimi_k25_video_placeholder|>
+        {% else -%}
+          {{ content['text'] }}
+        {%- endif -%}
+      {%- endfor -%}
+    {%- endif -%}
+{%- endmacro -%}
+{% macro set_roles(message) -%}
+  {%- set role_name =  message.get('name') or  message['role'] -%}
+  {%- if message['role'] == 'user' -%}
+    <|im_user|>{{role_name}}<|im_middle|>
+  {%- elif message['role'] == 'assistant' -%}
+    <|im_assistant|>{{role_name}}<|im_middle|>
+  {%- else -%}
+    <|im_system|>{{role_name}}<|im_middle|>
+  {%- endif -%}
+{%- endmacro -%}
+{%- macro render_toolcalls(message) -%}
+  <|tool_calls_section_begin|>
+  {%- for tool_call in message['tool_calls'] -%}
+    {%- set formatted_id = tool_call['id'] -%}
+    <|tool_call_begin|>{{ formatted_id }}<|tool_call_argument_begin|>{% if tool_call['function']['arguments'] is string %}{{ tool_call['function']['arguments'] }}{% else %}{{ tool_call['function']['arguments'] | tojson }}{% endif %}<|tool_call_end|>
+  {%- endfor -%}
+  <|tool_calls_section_end|>
+{%- endmacro -%}
+{%- set preserve_thinking = preserve_thinking | default(false) -%}
+{# Find last non-tool-call assistant message. If preserve_thinking, keep -1 so hist is empty and all msgs use suffix (retain reasoning). #}
+{%- set ns = namespace(last_non_tool_call_assistant_msg=-1) -%}
+{%- if not preserve_thinking -%}
+{%- for idx in range(messages|length-1, -1, -1) -%}
+    {%- if messages[idx]['role'] == 'assistant' and not messages[idx].get('tool_calls') -%}
+        {%- set ns.last_non_tool_call_assistant_msg = idx -%}
+        {%- break -%}
+    {%- endif -%}
+{%- endfor -%}
+{%- endif -%}
+{# split all messages into history & suffix, reasoning_content in suffix should be reserved.#}
+{%- set hist_msgs = messages[:ns.last_non_tool_call_assistant_msg+1] -%}
+{%- set suffix_msgs = messages[ns.last_non_tool_call_assistant_msg+1:] -%}
+{%- if tools -%}
+  {%- if tools_ts_str -%}
+    <|im_system|>tool_declare<|im_middle|>{{ tools_ts_str }}<|im_end|>
+  {%- else -%}
+    <|im_system|>tool_declare<|im_middle|>{{ tools | tojson(separators=(',', ':')) }}<|im_end|>
+  {%- endif -%}
+{%- endif -%}
+{%- for message in hist_msgs -%}
+  {{set_roles(message)}}
+  {%- if message['role'] == 'assistant' -%}
+    <think></think>{{render_content(message)}}
+    {%- if message.get('tool_calls') -%}
+      {{render_toolcalls(message)}}
+    {%- endif -%}
+  {%- elif message['role'] == 'tool' -%}
+    {%- set tool_call_id = message.tool_call_id -%}
+    ## Return of {{ tool_call_id }}
+{{render_content(message)}}
+  {%- elif message['content'] is not none -%}
+    {{render_content(message)}}
+  {%- endif -%}
+  <|im_end|>
+{%- endfor -%}
+{%- for message in suffix_msgs -%}
+  {{set_roles(message)}}
+  {%- if message['role'] == 'assistant' -%}
+    {%- if thinking is defined and thinking is false and preserve_thinking is false -%}
+    <think></think>{{render_content(message)}}
+    {%- else -%}
+    {%- set rc = message.get('reasoning', message.get('reasoning_content', '')) -%}
+    <think>{{rc}}</think>{{render_content(message)}}
+    {%- endif -%}
+    {%- if message.get('tool_calls') -%}
+     {{render_toolcalls(message)}}
+    {%- endif -%}
+  {%- elif message['role'] == 'tool' -%}
+    {%- set tool_call_id = message.tool_call_id -%}
+    ## Return of {{ tool_call_id }}
+{{render_content(message)}}
+  {%- elif message['content'] is not none -%}
+    {{render_content(message)}}
+  {%- endif -%}
+  <|im_end|>
+{%- endfor -%}
+{%- if add_generation_prompt -%}
+  <|im_assistant|>assistant<|im_middle|>
+  {%- if thinking is defined and thinking is false -%}
+  <think></think>
+  {%- else -%}
+  <think>
+  {%- endif -%}
+{%- endif -%}

config.json ADDED Viewed

	@@ -0,0 +1,192 @@

+{
+  "architectures": [
+    "KimiK25ForConditionalGeneration"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_kimi_k25.KimiK25Config",
+    "AutoModel": "modeling_kimi_k25.KimiK25ForConditionalGeneration",
+    "AutoModelForCausalLM": "modeling_kimi_k25.KimiK25ForConditionalGeneration"
+  },
+  "bos_token_id": 163584,
+  "dtype": "bfloat16",
+  "eos_token_id": 163586,
+  "ignore_index": -100,
+  "media_placeholder_token_id": 163605,
+  "model_type": "kimi_k25",
+  "pad_token_id": 163839,
+  "text_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "architectures": [
+      "DeepseekV3ForCausalLM"
+    ],
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "auto_map": {
+      "AutoConfig": "configuration_deepseek.DeepseekV3Config",
+      "AutoModel": "modeling_deepseek.DeepseekV3Model",
+      "AutoModelForCausalLM": "modeling_deepseek.DeepseekV3ForCausalLM"
+    },
+    "aux_loss_alpha": 0.001,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": 163584,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "dtype": "bfloat16",
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": 163586,
+    "ep_size": 1,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "first_k_dense_replace": 1,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "silu",
+    "hidden_size": 7168,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "initializer_range": 0.02,
+    "intermediate_size": 18432,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "kv_lora_rank": 512,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "max_position_embeddings": 262144,
+    "min_length": 0,
+    "model_type": "kimi_k2",
+    "moe_intermediate_size": 2048,
+    "moe_layer_freq": 1,
+    "n_group": 1,
+    "n_routed_experts": 384,
+    "n_shared_experts": 1,
+    "no_repeat_ngram_size": 0,
+    "norm_topk_prob": true,
+    "num_attention_heads": 64,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_experts_per_tok": 8,
+    "num_hidden_layers": 61,
+    "num_key_value_heads": 64,
+    "num_nextn_predict_layers": 0,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": 163839,
+    "prefix": null,
+    "pretraining_tp": 1,
+    "problem_type": null,
+    "pruned_heads": {},
+    "q_lora_rank": 1536,
+    "qk_nope_head_dim": 128,
+    "qk_rope_head_dim": 64,
+    "quantization_config": {
+      "config_groups": {
+        "group_0": {
+          "input_activations": null,
+          "output_activations": null,
+          "targets": [
+            "Linear"
+          ],
+          "weights": {
+            "actorder": null,
+            "block_structure": null,
+            "dynamic": false,
+            "group_size": 32,
+            "num_bits": 4,
+            "observer": "minmax",
+            "observer_kwargs": {},
+            "strategy": "group",
+            "symmetric": true,
+            "type": "int"
+          }
+        }
+      },
+      "format": "pack-quantized",
+      "ignore": [
+        "lm_head",
+        "re:.*self_attn.*",
+        "re:.*shared_experts.*",
+        "re:.*mlp\\.(gate|up|gate_up|down)_proj.*"
+      ],
+      "kv_cache_scheme": null,
+      "quant_method": "compressed-tensors",
+      "quantization_status": "compressed"
+    },
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "rms_norm_eps": 1e-05,
+    "rope_scaling": {
+      "beta_fast": 32.0,
+      "beta_slow": 1.0,
+      "factor": 64.0,
+      "mscale": 1.0,
+      "mscale_all_dim": 1.0,
+      "original_max_position_embeddings": 4096,
+      "type": "yarn"
+    },
+    "rope_theta": 50000.0,
+    "routed_scaling_factor": 2.827,
+    "scoring_func": "sigmoid",
+    "sep_token_id": null,
+    "seq_aux": true,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": false,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "topk_group": 1,
+    "topk_method": "noaux_tc",
+    "torchscript": false,
+    "transformers_version": "4.56.2",
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "use_cache": true,
+    "v_head_dim": 128,
+    "vocab_size": 163840
+  },
+  "tie_word_embeddings": false,
+  "use_unified_vision_chunk": true,
+  "video_placeholder": "<|kimi_k25_video_placeholder|>",
+  "vision_config": {
+    "_attn_implementation": "flash_attention_2",
+    "init_pos_emb_height": 64,
+    "init_pos_emb_time": 4,
+    "init_pos_emb_width": 64,
+    "merge_kernel_size": [
+      2,
+      2
+    ],
+    "merge_type": "sd2_tpool",
+    "mm_hidden_size": 1152,
+    "mm_projector_type": "patchmerger",
+    "patch_size": 14,
+    "pos_emb_type": "divided_fixed",
+    "projector_hidden_act": "gelu",
+    "projector_ln_eps": 1e-05,
+    "text_hidden_size": 7168,
+    "video_attn_type": "spatial_temporal",
+    "vt_hidden_size": 1152,
+    "vt_intermediate_size": 4304,
+    "vt_num_attention_heads": 16,
+    "vt_num_hidden_layers": 27
+  }
+}

configuration_deepseek.py ADDED Viewed

	@@ -0,0 +1,214 @@

+# Copy from https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/configuration_deepseek.py
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+DEEPSEEK_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
+class DeepseekV3Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`DeepseekV3Model`]. It is used to instantiate an DeepSeek
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the DeepSeek-V3.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 129280):
+            Vocabulary size of the Deep model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`DeepseekV3Model`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 11008):
+            Dimension of the MLP representations.
+        moe_intermediate_size (`int`, *optional*, defaults to 1407):
+            Dimension of the MoE representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer decoder.
+        num_nextn_predict_layers (`int`, *optional*, defaults to 1):
+            Number of nextn predict layers in the DeepSeekV3 Model.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        n_shared_experts (`int`, *optional*, defaults to None):
+            Number of shared experts, None means dense model.
+        n_routed_experts (`int`, *optional*, defaults to None):
+            Number of routed experts, None means dense model.
+        routed_scaling_factor (`float`, *optional*, defaults to 1.0):
+            Scaling factor or routed experts.
+        topk_method (`str`, *optional*, defaults to `gready`):
+            Topk method used in routed gate.
+        n_group (`int`, *optional*, defaults to None):
+            Number of groups for routed experts.
+        topk_group (`int`, *optional*, defaults to None):
+            Number of selected groups for each token(for each token, ensuring the selected experts is only within `topk_group` groups).
+        num_experts_per_tok (`int`, *optional*, defaults to None):
+            Number of selected experts, None means dense model.
+        moe_layer_freq (`int`, *optional*, defaults to 1):
+            The frequency of the MoE layer: one expert layer for every `moe_layer_freq - 1` dense layers.
+        first_k_dense_replace (`int`, *optional*, defaults to 0):
+            Number of dense layers in shallow layers(embed->dense->dense->...->dense->moe->moe...->lm_head).
+                                                            \--k dense layers--/
+        norm_topk_prob (`bool`, *optional*, defaults to False):
+            Whether to normalize the weights of the routed experts.
+        scoring_func (`str`, *optional*, defaults to 'softmax'):
+            Method of computing expert weights.
+        aux_loss_alpha (`float`, *optional*, defaults to 0.001):
+            Auxiliary loss weight coefficient.
+        seq_aux = (`bool`, *optional*, defaults to True):
+            Whether to compute the auxiliary loss for each individual sample.
+        num_key_value_heads (`int`, *optional*):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
+            `num_attention_heads`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        pad_token_id (`int`, *optional*):
+            Padding token id.
+        bos_token_id (`int`, *optional*, defaults to 1):
+            Beginning of stream token id.
+        eos_token_id (`int`, *optional*, defaults to 2):
+            End of stream token id.
+        pretraining_tp (`int`, *optional*, defaults to 1):
+            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
+            document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
+            necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
+            issue](https://github.com/pytorch/pytorch/issues/76232).
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether to tie weight embeddings
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
+            strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
+            `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
+            `max_position_embeddings` to the expected new maximum.
+        attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
+            Whether to use a bias in the query, key, value and output projection layers during self-attention.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+    ```python
+    >>> from transformers import DeepseekV3Model, DeepseekV3Config
+    >>> # Initializing a Deepseek-V3 style configuration
+    >>> configuration = DeepseekV3Config()
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "deepseek_v3"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        vocab_size=129280,
+        hidden_size=7168,
+        intermediate_size=18432,
+        moe_intermediate_size=2048,
+        num_hidden_layers=61,
+        num_nextn_predict_layers=1,
+        num_attention_heads=128,
+        num_key_value_heads=128,
+        n_shared_experts=1,
+        n_routed_experts=256,
+        ep_size=1,
+        routed_scaling_factor=2.5,
+        kv_lora_rank=512,
+        q_lora_rank=1536,
+        qk_rope_head_dim=64,
+        v_head_dim=128,
+        qk_nope_head_dim=128,
+        topk_method='noaux_tc',
+        n_group=8,
+        topk_group=4,
+        num_experts_per_tok=8,
+        moe_layer_freq=1,
+        first_k_dense_replace=3,
+        norm_topk_prob=True,
+        scoring_func='sigmoid',
+        aux_loss_alpha=0.001,
+        seq_aux=True,
+        hidden_act="silu",
+        max_position_embeddings=4096,
+        initializer_range=0.02,
+        rms_norm_eps=1e-6,
+        use_cache=True,
+        pad_token_id=None,
+        bos_token_id=0,
+        eos_token_id=1,
+        pretraining_tp=1,
+        tie_word_embeddings=False,
+        rope_theta=10000.0,
+        rope_scaling=None,
+        attention_bias=False,
+        attention_dropout=0.0,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.moe_intermediate_size = moe_intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_nextn_predict_layers = num_nextn_predict_layers
+        self.num_attention_heads = num_attention_heads
+        self.n_shared_experts = n_shared_experts
+        self.n_routed_experts = n_routed_experts
+        self.ep_size = ep_size
+        self.routed_scaling_factor = routed_scaling_factor
+        self.kv_lora_rank = kv_lora_rank
+        self.q_lora_rank = q_lora_rank
+        self.qk_rope_head_dim = qk_rope_head_dim
+        self.v_head_dim = v_head_dim
+        self.qk_nope_head_dim = qk_nope_head_dim
+        self.topk_method = topk_method
+        self.n_group = n_group
+        self.topk_group = topk_group
+        self.num_experts_per_tok = num_experts_per_tok
+        self.moe_layer_freq = moe_layer_freq
+        self.first_k_dense_replace = first_k_dense_replace
+        self.norm_topk_prob = norm_topk_prob
+        self.scoring_func = scoring_func
+        self.aux_loss_alpha = aux_loss_alpha
+        self.seq_aux = seq_aux
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.pretraining_tp = pretraining_tp
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.attention_bias = attention_bias
+        self.attention_dropout = attention_dropout
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )

configuration_kimi_k25.py ADDED Viewed

	@@ -0,0 +1,123 @@

+from transformers.configuration_utils import PretrainedConfig
+try:
+    from configuration_deepseek import DeepseekV3Config
+except ImportError:
+    from .configuration_deepseek import DeepseekV3Config
+class KimiK25VisionConfig(PretrainedConfig):
+    def __init__(
+            self,
+            patch_size: int = 14,
+            init_pos_emb_height: int = 64,
+            init_pos_emb_width: int = 64,
+            init_pos_emb_time: int = 4,
+            pos_emb_type: str = 'divided_fixed',
+            vt_num_attention_heads: int = 16,
+            vt_num_hidden_layers: int = 27,
+            vt_hidden_size: int = 1152,
+            vt_intermediate_size: int = 4304,
+            merge_kernel_size: tuple = (2, 2),
+            video_attn_type: str = 'spatial_temporal',
+            merge_type: str = 'sd2_tpool',
+            _attn_implementation: str = 'flash_attention_2',
+            # MM Projector parameters
+            mm_projector_type: str = 'patchmerger',
+            mm_hidden_size: int | None = None,
+            projector_hidden_act: str = "gelu",
+            projector_ln_eps: float = 1e-5,
+            # Other parameters
+            ignore_index: int = -100,
+            media_placeholder_token_id: int = 163605,
+            pad_token_id: int = 0,
+            use_unified_vision_chunk: bool = True,
+            video_placeholder="<|kimi_k25_video_placeholder|>",
+            text_hidden_size=7168,
+            **vision_config_kwargs):
+        self.patch_size = patch_size
+        self.init_pos_emb_height = init_pos_emb_height
+        self.init_pos_emb_width = init_pos_emb_width
+        self.init_pos_emb_time = init_pos_emb_time
+        self.pos_emb_type = pos_emb_type
+        self.vt_num_attention_heads = vt_num_attention_heads
+        self.vt_num_hidden_layers = vt_num_hidden_layers
+        self.vt_hidden_size = vt_hidden_size
+        self.vt_intermediate_size = vt_intermediate_size
+        self.merge_kernel_size = merge_kernel_size
+        self.video_attn_type = video_attn_type
+        self.merge_type = merge_type
+        self._attn_implementation = _attn_implementation
+        # MM Projector config
+        self.mm_projector_type = mm_projector_type
+        self.mm_hidden_size = mm_hidden_size if mm_hidden_size is not None else vt_hidden_size
+        self.projector_hidden_act = projector_hidden_act
+        self.projector_ln_eps = projector_ln_eps
+        self.text_hidden_size = text_hidden_size
+class KimiK25Config(PretrainedConfig):
+    """Kimi-K2.5 model configuration.
+    Args:
+        text_config (dict | DeepseekV3Config): Configuration for the text model.
+        Vision Tower Parameters (from MoonViT3dConfig):
+            patch_size (int): Patch size for vision tower.
+            init_pos_emb_height (int): Initial position embedding height.
+            init_pos_emb_width (int): Initial position embedding width.
+            init_pos_emb_time (int): Initial position embedding time dimension.
+            pos_emb_type (str): Type of position embedding.
+            vt_num_attention_heads (int): Number of attention heads in vision tower.
+            vt_num_hidden_layers (int): Number of hidden layers in vision tower.
+            vt_hidden_size (int): Hidden size of vision tower.
+            vt_intermediate_size (int): Intermediate size in vision tower FFN.
+            merge_kernel_size (tuple): Kernel size for patch merging.
+            video_attn_type (str): Type of video attention.
+            merge_type (str): Type of merge operation.
+            _attn_implementation (str): Attention implementation type.
+        MM Projector Parameters (from MultiModalProjectorConfig):
+            mm_projector_type (str): Type of multimodal projector.
+            mm_hidden_size (int): Hidden size from vision tower (should match vt_hidden_size).
+            projector_hidden_act (str): Activation function for projector.
+            projector_ln_eps (float): Layer norm epsilon for projector.
+        Other Parameters:
+            ignore_index (int): The ignore index for the loss function.
+            media_placeholder_token_id (int): The token ID to use for media placeholders.
+            pad_token_id (int): The token ID to use for padding.
+    """
+    model_type = "kimi_k25"
+    def __init__(
+        self,
+        text_config: dict | DeepseekV3Config = None,
+        vision_config: dict | KimiK25VisionConfig = None,
+        # Other parameters
+        ignore_index: int = -100,
+        media_placeholder_token_id: int = 163605,
+        pad_token_id: int = 0,
+        use_unified_vision_chunk: bool = True,
+        video_placeholder="<|kimi_k25_video_placeholder|>",
+        **kwargs,
+    ):
+        if isinstance(text_config, dict):
+            text_config = DeepseekV3Config(**text_config)
+        if isinstance(vision_config, dict):
+            vision_config = KimiK25VisionConfig(**vision_config)
+        self.text_config = text_config
+        self.vision_config = vision_config
+        # Other config
+        self.ignore_index = ignore_index
+        self.media_placeholder_token_id = media_placeholder_token_id
+        self.use_unified_vision_chunk = use_unified_vision_chunk
+        self.video_placeholder = video_placeholder
+        if getattr(self.text_config, "quantization_config", None) is not None:
+            self.quantization_config = self.text_config.quantization_config
+        super().__init__(pad_token_id=pad_token_id, **kwargs)

docs/deploy_guidance.md ADDED Viewed

	@@ -0,0 +1,94 @@

+# Kimi-K2.6 Deployment Guide
+> [!Note]
+> This guide only provides some examples of deployment commands for Kimi-K2.6, which may not be the optimal configuration. Since inference engines are still being updated frequenty,  please continue to follow the guidance from their homepage if you want to achieve better inference performance.
+> [!Note]
+> Kimi-K2.6 has the same architecture as Kimi-K2.5, and the deployment method can be directly reused.
+## vLLM Deployment
+You can refer to https://recipes.vllm.ai/moonshotai/Kimi-K2.5 for the newest deployment guide.
+This model is available in nightly vLLM wheel:
+```
+uv pip install -U vllm \
+    --torch-backend=auto \
+    --extra-index-url https://wheels.vllm.ai/nightly
+```
+Nightly wheels may be unstable and are considered experimental. For stable production use, we recommend vLLM 0.19.1, which has been manually verified.
+Here is the example to serve this model on a H200 single node with TP8 via vLLM:
+```bash
+vllm serve $MODEL_PATH -tp 8 --mm-encoder-tp-mode data --trust-remote-code --tool-call-parser kimi_k2 --reasoning-parser kimi_k2
+```
+**Key notes**
+- `--tool-call-parser kimi_k2`: Required for enabling tool calling
+- `--reasoning-parser kimi_k2`: Kimi-K2.6 enables thinking mode by default. Make sure to pass this for correct reasoning processing.
+## SGLang Deployment
+You can refer to https://cookbook.sglang.io/autoregressive/Moonshotai/Kimi-K2.5 for the newest deployment guide.
+This model is available in SGLang latest main:
+```
+pip install "sglang @ git+https://github.com/sgl-project/sglang.git#subdirectory=python"
+pip install nvidia-cudnn-cu12==9.16.0.29
+```
+Similarly, here is the example for it to run with TP8 on H200 in a single node via SGLang:
+``` bash
+sglang serve --model-path $MODEL_PATH --tp 8 --trust-remote-code --tool-call-parser kimi_k2 --reasoning-parser kimi_k2
+```
+**Key parameter notes:**
+- `--tool-call-parser kimi_k2`: Required when enabling tool usage.
+- `--reasoning-parser kimi_k2`: Required for correctly processing reasoning content.
+## KTransformers Deployment
+### KTransformers+SGLang Inference Deployment
+Launch with KTransformers + SGLang for CPU+GPU heterogeneous inference:
+```
+python -m sglang.launch_server \
+  --host 0.0.0.0 \
+  --port 31245 \
+  --model /path/to/kimi-k2.6 \
+  --kt-weight-path /path/to/kimi-k2.6 \
+  --kt-cpuinfer 96 \
+  --kt-threadpool-count 2 \
+  --kt-num-gpu-experts 30 \
+  --kt-method RAWINT4 \
+  --kt-gpu-prefill-token-threshold 400 \
+  --trust-remote-code \
+  --mem-fraction-static 0.94 \
+  --served-model-name Kimi-K2.6 \
+  --enable-mixed-chunk \
+  --tensor-parallel-size 4 \
+  --enable-p2p-check \
+  --disable-shared-experts-fusion \
+  --chunked-prefill-size 32658 \
+  --max-total-tokens 50000 \
+  --attention-backend flashinfer
+```
+Achieves 640.12 tokens/s Prefill and 24.51 tokens/s Decode (48-way concurrency) on 8× NVIDIA L20 + 2× Intel 6454S.
+More details: https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/Kimi-K2.5.md .
+### KTransformers+LLaMA-Factory Fine-tuning Deployment
+You can use below command to run LoRA SFT with KT+llamafactory.
+```
+# For LoRA SFT
+USE_KT=1 llamafactory-cli train examples/train_lora/kimik2_lora_sft_kt.yaml
+# For Chat with model after LoRA SFT
+llamafactory-cli chat examples/inference/kimik2_lora_sft_kt.yaml
+# For API with model after LoRA SFT
+llamafactory-cli api examples/inference/kimik2_lora_sft_kt.yaml
+```
+This achieves end-to-end LoRA SFT Throughput: 44.55 token/s on 2× NVIDIA 4090 + Intel 8488C with 1.97T RAM and 200G swap memory.
+More details refer to https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/SFT_Installation_Guide_KimiK2.5.md .

figures/demo_video.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:09b4d925aa0a7c712feef50765355f0625d8f6d46ea302fd98db9609e9070047
+size 270100

figures/kimi-logo.png ADDED Viewed

generation_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_length": 262144,
+  "eos_token_id": 163586
+}

kimi_k25_processor.py ADDED Viewed

	@@ -0,0 +1,165 @@

+from transformers.feature_extraction_utils import BatchFeature
+from transformers.processing_utils import ProcessorMixin
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class KimiK25Processor(ProcessorMixin):
+    r"""
+    Constructs a KimiK25 processor which wraps a KimiK25 image processor and a tokenizer into a single processor.
+    [`KimiK25Processor`] offers all the functionalities of [`KimiK25ImageProcessor`] and [`TikTokenTokenizer`]. See the
+    [`~KimiK25Processor.__call__`] and [`~KimiK25Processor.decode`] for more information.
+    Args:
+        image_processor ([`KimiK25ImageProcessor`], *optional*):
+            The image processor is a required input.
+        tokenizer ([`TikTokenTokenizer`], *optional*):
+            The tokenizer is a required input.
+        chat_template (`str`, *optional*): A Jinja template which will be used to convert lists of messages
+            in a chat into a tokenizable string.
+    """
+    attributes = ["image_processor", "tokenizer"]
+    valid_kwargs = ["chat_template"]
+    image_processor_class = "AutoImageProcessor"
+    tokenizer_class = "AutoTokenizer"
+    def __init__(
+        self,
+        image_processor=None,
+        tokenizer=None,
+        chat_template=None,
+        **kwargs,
+    ):
+        super().__init__(image_processor,
+                         tokenizer,
+                         chat_template=chat_template)
+        self.media_processor = image_processor
+        # A special temporal placeholder to be replaced by actual video placeholders
+        self.video_placeholder = "<|kimi_k25_video_placeholder|>"
+    def update_raw_text(self, text: str, video_prompts: list[str]) -> str:
+        # replace video prompt in text with video chunk prompts
+        video_count = text.count(self.video_placeholder)
+        if video_count == 0:
+            return text
+        assert video_count == len(video_prompts)
+        text_parts = text.split(self.video_placeholder)
+        assert len(text_parts) == len(video_prompts) + 1
+        text = "".join([
+            text_parts[i] + video_prompts[i] for i in range(len(video_prompts))
+        ])
+        text += text_parts[-1]
+        return text
+    def preprocess_medias(self, medias: list[dict]) -> list[dict]:
+        updated_medias = []
+        video_prompts = []
+        for media in medias:
+            if media['type'] == 'image':
+                updated_medias.append(media)
+            elif media['type'] == 'video':
+                video_chunks = self.media_processor.split_video_chunks(
+                    media['video'])
+                updated_medias.extend(video_chunks)
+                video_prompts.append("".join(
+                    [vc['prompt'] for vc in video_chunks]))
+            else:
+                raise ValueError(f"unsupported media type: {media['type']}")
+        return updated_medias, video_prompts
+    def __call__(self,
+                 messages: list[dict] = None,
+                 medias: list[dict] = None,
+                 text: str = None,
+                 return_tensors: str = "pt",
+                 **kwargs) -> BatchFeature:
+        """
+        Process multimodal inputs for Kimi-K2.5 model.
+        This processor accepts ordered messages and extracts both media and text in a single pass.
+        text will be automatically updated if video input detected in messages
+        Args:
+            messages: List of message dicts with 'role' and 'content' fields.
+                     If provided, medias and text will be extracted automatically.
+            medias: Pre-extracted list of media dicts. If None, extracted from messages.
+            text: Pre-formatted text string. If None, generated via apply_chat_template.
+            return_tensors: Format of returned tensors ('pt', 'np', 'tf'). Default: 'pt'.
+            **kwargs: Additional arguments passed to tokenizer.apply_chat_template.
+        Returns:
+            BatchFeature with fields: input_ids, attention_mask, pixel_values, grid_thws.
+        """
+        if messages is None and (medias is None or text is None):
+            raise ValueError(
+                "Provide either 'messages' or both 'medias' and 'text'")
+        if medias is not None and text is not None:
+            updated_medias, video_prompts = self.preprocess_medias(medias)
+            preprocessed = self.media_processor.preprocess(
+                updated_medias, return_tensors=return_tensors)
+            text = self.update_raw_text(text, video_prompts)
+            text_inputs = self.tokenizer(text, return_tensors=return_tensors)
+            return BatchFeature(data={**text_inputs, **preprocessed.data})
+        if medias is None:
+            medias = self._extract_medias_from_messages(messages)
+        updated_medias, video_prompts = self.preprocess_medias(medias)
+        preprocessed = self.media_processor.preprocess(
+            updated_medias, return_tensors=return_tensors)
+        # Generate text if not provided
+        if text is None:
+            text = self.tokenizer.apply_chat_template(messages, **kwargs)
+        text = self.update_raw_text(text, video_prompts)
+        text_inputs = self.tokenizer(text, return_tensors=return_tensors)
+        return BatchFeature(data={**text_inputs, **preprocessed.data})
+    @staticmethod
+    def _extract_medias_from_messages(messages: list[dict]) -> list[dict]:
+        """
+        Extract media items from messages in a single pass.
+        This is an optimized version that processes messages only once.
+        Kept as internal method since external callers should use __call__.
+        """
+        medias = []
+        for msg in messages:
+            if msg['role'] != 'user' or not msg.get('content'):
+                continue
+            for content_part in msg['content']:
+                if not isinstance(content_part, dict):
+                    continue
+                content_type = content_part.get('type')
+                if content_type in ['video_url', 'video']:
+                    medias.append({
+                        'type': 'video',
+                        'video': content_part['video_url']['url'],
+                        'first_frame_timestamp': 0.0
+                    })
+                elif content_type in ['image_url', 'image']:
+                    medias.append({
+                        'type': 'image',
+                        'image': content_part['image_url'],
+                    })
+        return medias
+    def apply_chat_template(self, messages, **kwargs):
+        return self.tokenizer.apply_chat_template(messages, **kwargs)
+    def batch_decode(self, *args, **kwargs):
+        return self.tokenizer.batch_decode(*args, **kwargs)
+    def decode(self, *args, **kwargs):
+        return self.tokenizer.decode(*args, **kwargs)
+    @property
+    def model_input_names(self):
+        return ['input_ids', 'attention_mask', 'pixel_values', 'grid_thws']

kimi_k25_vision_processing.py ADDED Viewed

	@@ -0,0 +1,251 @@

+"""Image processor class for Kimi-K2.5.
+"""
+import json
+from typing import Any, Dict, Optional, Union
+import numpy as np
+import torch
+from PIL import Image
+from transformers.image_processing_utils import (BaseImageProcessor,
+                                                 BatchFeature)
+from transformers.utils import TensorType
+from .media_utils import (MediaInput, VideoChunkInput, _to_tensor,
+                          ensure_media_type, get_video_meta, image_to_np,
+                          navit_patchify, navit_resize_image,
+                          navit_resize_video, normalize,
+                          real_sample_fps_and_max_num_frames, timestamp_as_str)
+try:
+    from mecord import VideoReader
+except ImportError:
+    VideoReader = None
+def resampling(video_bytes: bytes,
+               sample_indices: list[int],
+               key_indices=None,
+               frame_time_info=None,
+               num_threads=4) -> str:
+    video = VideoReader(video_bytes,
+                        num_threads=num_threads,
+                        frame_time_info=frame_time_info,
+                        key_indices=key_indices)
+    # extract target frames
+    frames = video[sample_indices]
+    frames = [Image.fromarray(frame) for frame in frames]
+    return frames
+class KimiK25VisionProcessor(BaseImageProcessor):
+    model_type = "kimi_k25"
+    def __init__(
+        self,
+        media_proc_cfg: dict,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.media_proc_cfg = media_proc_cfg
+        self.num_frames_per_chunk = media_proc_cfg[
+            'temporal_merge_kernel_size']
+    def media_tokens_calculator(self, media: MediaInput):
+        media = ensure_media_type(media)
+        ret = self.get_resize_config(media)
+        return ret['num_tokens']
+    @classmethod
+    def make_chunk_prompt(cls, timestamp_text: str) -> str:
+        return f"{timestamp_text}<|media_begin|>video<|media_content|><|media_pad|><|media_end|>"
+    def split_video_chunks(self,
+                           video_url: str | bytes) -> list[list[Image.Image]]:
+        # video_url should be base64 str or bytes
+        video_spec = get_video_meta(video_url)
+        sample_fps = min(self.media_proc_cfg['sample_fps'], video_spec.fps)
+        sampled_nframes = max(
+            round(video_spec.num_frames * sample_fps / video_spec.fps), 1)
+        frame_inds = np.linspace(0, video_spec.num_frames - 1,
+                                 sampled_nframes).round().astype(int)
+        frame_inds = frame_inds.tolist()
+        sampled_frame_ids = []
+        temporal_merge_kernel_size = self.media_proc_cfg[
+            "temporal_merge_kernel_size"]
+        num_chunks = 0
+        chunk_timestamp = []
+        for i in range(0, len(frame_inds), temporal_merge_kernel_size):
+            sampled_frame_ids.extend(frame_inds[i:i +
+                                                temporal_merge_kernel_size])
+            start_time = frame_inds[i] / float(video_spec.fps)
+            timestamp_text = timestamp_as_str(
+                start_time, self.media_proc_cfg["timestamp_mode"])
+            chunk_timestamp.append(timestamp_text)
+            num_chunks += 1
+        sampled_frames = resampling(video_url, sampled_frame_ids)
+        chunks = []
+        for chunk_id in range(num_chunks):
+            chunk = sampled_frames[chunk_id *
+                                   temporal_merge_kernel_size:(chunk_id + 1) *
+                                   temporal_merge_kernel_size]
+            chunks.append(
+                VideoChunkInput(type="video_chunk",
+                                video_chunk=chunk,
+                                prompt=self.make_chunk_prompt(
+                                    chunk_timestamp[chunk_id])))
+        return chunks
+    def get_resize_config(self, media_input: MediaInput) -> dict:
+        if media_input['type'] == 'image':
+            w, h = media_input['image'].size
+            ret = navit_resize_image(
+                w, h, self.media_proc_cfg['patch_size'],
+                self.media_proc_cfg['merge_kernel_size'],
+                self.media_proc_cfg['in_patch_limit'],
+                self.media_proc_cfg['patch_limit_on_one_side'],
+                self.media_proc_cfg['fixed_output_tokens'])
+            return ret
+        elif media_input['type'] == 'video_chunk':
+            frame = media_input['video_chunk'][0]
+            width, height = frame.size
+            num_frames = len(media_input["video_chunk"])
+            fps = 1.0
+            sample_fps, max_num_frames_each_video = real_sample_fps_and_max_num_frames(
+                media_input["type"],
+                self.media_proc_cfg['sample_fps'],
+                self.media_proc_cfg['max_num_frames_each_video'],
+            )
+            in_patch_limit_each_frame = self.media_proc_cfg[
+                'in_patch_limit_each_frame']
+            if in_patch_limit_each_frame is None:
+                in_patch_limit_each_frame = self.media_proc_cfg[
+                    'in_patch_limit']
+            ret = navit_resize_video(
+                width,
+                height,
+                num_frames,
+                fps,
+                sample_fps,
+                self.media_proc_cfg['patch_size'],
+                self.media_proc_cfg['merge_kernel_size'],
+                in_patch_limit_each_frame,
+                self.media_proc_cfg['patch_limit_on_one_side'],
+                self.media_proc_cfg['in_patch_limit_video'],
+                max_num_frames_each_video,
+                self.media_proc_cfg['fixed_output_tokens'],
+            )
+            return ret
+        else:
+            raise ValueError("Unsupported type: {}".format(
+                media_input['type']))
+    def resize_image(self, image: Image.Image, new_width: int, new_height: int,
+                     pad_width: int, pad_height: int) -> np.ndarray:
+        image_np = image_to_np(image, (new_width, new_height), "resize")
+        image_np = np.pad(
+            image_np,
+            ((0, pad_height), (0, pad_width), (0, 0)),
+            mode="constant",
+            constant_values=0,
+        )
+        return image_np
+    def preprocess(
+        self,
+        medias: list[MediaInput],
+        return_tensors: Optional[Union[str, TensorType]] = None,
+    ) -> BatchFeature:
+        """
+        Preprocess a atom vision input (images/video_chunk) into model-ready tensors.
+        Args:
+            medias: List of MediaInput.
+            return_tensors: Desired output format ('pt', 'np', 'tf', or None).
+        Returns:
+            BatchFeature containing 'pixel_values' and 'grid_thws' tensors.
+        """
+        if not isinstance(medias, list):
+            medias = [medias]
+        if medias:
+            pixel_values = []
+            for item in medias:
+                item = ensure_media_type(item)
+                resize_config = self.get_resize_config(item)
+                new_width, new_height, pad_width, pad_height = resize_config[
+                    'new_width'], resize_config['new_height'], resize_config[
+                        'pad_width'], resize_config['pad_height']
+                if item['type'] == 'image':
+                    image = item['image']
+                    image_np = self.resize_image(image, new_width, new_height,
+                                                 pad_width, pad_height)
+                    pixel_values.append(np.expand_dims(image_np, axis=0))
+                elif item['type'] == 'video_chunk':
+                    pixels = []
+                    for frame in item['video_chunk']:
+                        frame_np = self.resize_image(frame, new_width,
+                                                     new_height, pad_width,
+                                                     pad_height)
+                        pixels.append(frame_np)
+                    pixel_values.append(np.stack(pixels, axis=0))
+                else:
+                    raise ValueError("Unsupported type: {}".format(
+                        item['type']))
+            normalized_pixel_values = []
+            image_std_inv = 1.0 / np.array(self.media_proc_cfg['image_std'])
+            image_mean = np.array(self.media_proc_cfg['image_mean'])
+            for pixels in pixel_values:
+                pixels = normalize(pixels, image_mean, image_std_inv)
+                pixels_and_thw = navit_patchify(
+                    pixels,
+                    self.media_proc_cfg['patch_size'],
+                )
+                normalized_pixel_values.append(pixels_and_thw)
+            pixel_values = torch.cat([
+                _to_tensor(pixel_value['pixel_values'])
+                for pixel_value in normalized_pixel_values
+            ])
+            grid_thws = torch.cat([
+                _to_tensor(pixel_value['grid_thw'],
+                           dtype=torch.int64).unsqueeze(0)
+                for pixel_value in normalized_pixel_values
+            ])
+            data = {
+                'pixel_values': pixel_values,
+                'grid_thws': grid_thws,
+            }
+        else:
+            data = {}
+        return BatchFeature(data=data, tensor_type=return_tensors)
+    def __repr__(self):
+        return f"KimiK25VisionProcessor(media_proc_cfg={self.media_proc_cfg})"
+    def to_dict(self) -> Dict[str, Any]:
+        output = super().to_dict()
+        output["media_proc_cfg"] = self.media_proc_cfg
+        if "media_processor" in output:
+            del output["media_processor"]
+        return output
+    @classmethod
+    def from_dict(cls, config_dict: Dict[str, Any], **kwargs):
+        config = config_dict.copy()
+        media_proc_cfg = config.pop("media_proc_cfg", {})
+        return cls(media_proc_cfg=media_proc_cfg, **config, **kwargs)
+    def to_json_string(self):
+        dictionary = self.to_dict()
+        for key, value in dictionary.items():
+            if hasattr(value, 'tolist'):
+                dictionary[key] = value.tolist()
+        return json.dumps(dictionary, indent=2, sort_keys=True) + "\n"

media_utils.py ADDED Viewed

	@@ -0,0 +1,368 @@

+import base64
+import io
+import math
+import os
+from datetime import datetime, timezone
+from typing import List, Literal, Optional, TypedDict
+import numpy as np
+from PIL import Image
+from pydantic import BaseModel, Field
+try:
+    from mecord import VideoReader
+except ImportError:
+    VideoReader = None
+class VideoSpec(BaseModel):
+    media_type: str = Literal['video']
+    height: int = Field(..., gt=0, description="video frame height")
+    width: int = Field(..., gt=0, description="video frame width")
+    num_frames: int = Field(..., gt=0, description="num frames")
+    fps: float = Field(..., gt=0, description="average fps")
+    # optional, help to accelerate video reading
+    key_indices: list[int] = Field(None, description="key indices")
+    frame_time_info: dict = Field(None, description="frame time info")
+class ImageInput(TypedDict):
+    type: Literal['image']
+    image: Image.Image
+class VideoChunkInput(TypedDict):
+    type: Literal['video_chunk']
+    video_chunk: List[Image.Image]
+    prompt: Optional[str] = None
+MediaInput = ImageInput | VideoChunkInput
+def get_video_meta(video_src: bytes | str | os.PathLike,
+                   accurate: bool = True) -> dict:
+    """Get the dimensions of a video."""
+    if isinstance(video_src, os.PathLike):
+        video_src = str(video_src)
+    # if b64 string, decode to bytes
+    if isinstance(video_src,
+                  str) and video_src.startswith('data:video/mp4;base64,'):
+        video_src = base64.b64decode(video_src.split(',')[1])
+    video = VideoReader(video_src, auto_init=accurate, num_threads=1)
+    assert video.num_frames > 0, "Invalid video format."
+    assert video.original_width > 0 and video.original_height > 0, (
+        "Invalid video format.")
+    assert video.avg_fps > 0, "Invalid video format."
+    return VideoSpec(media_type='video',
+                     height=video.original_height,
+                     width=video.original_width,
+                     num_frames=video.num_frames,
+                     fps=video.avg_fps,
+                     key_indices=video.key_indices,
+                     frame_time_info=video.frame_time_info)
+def timestamp_as_str(timestamp: float,
+                     timestamp_mode: str = "hh:mm:ss.fff") -> str:
+    """Convert a timestamp to a string in the format of HH:MM:SS.mmm."""
+    if timestamp_mode == "hh:mm:ss.fff":
+        return (datetime.fromtimestamp(timestamp,
+                                       tz=timezone.utc).strftime("%H:%M:%S") +
+                f".{int((timestamp % 1) * 1000):03d}")
+    elif timestamp_mode == "mm:ss.fff":
+        return (datetime.fromtimestamp(timestamp,
+                                       tz=timezone.utc).strftime("%M:%S") +
+                f".{int((timestamp % 1) * 1000):03d}")
+    elif timestamp_mode == "mm:ss":
+        return datetime.fromtimestamp(timestamp,
+                                      tz=timezone.utc).strftime("%M:%S")
+    else:
+        raise ValueError(f"Invalid timestamp mode: {timestamp_mode}")
+def navit_resize_image(
+    width: int,
+    height: int,
+    patch_size: int,
+    merge_kernel_size: int,
+    in_patch_limit: int,
+    patch_limit_on_one_side: int,
+    fixed_output_tokens: int | None,
+):
+    # Apply the patch limits.
+    s1 = math.sqrt(
+        in_patch_limit /
+        (max(1.0, width // patch_size) * max(1.0, height // patch_size)))
+    s2 = patch_limit_on_one_side * patch_size / width
+    s3 = patch_limit_on_one_side * patch_size / height
+    scale = min(1.0, s1, s2, s3)
+    new_w, new_h = max(1, int(width * scale)), max(1, int(height * scale))
+    new_w = min(new_w, patch_limit_on_one_side * patch_size)
+    new_h = min(new_h, patch_limit_on_one_side * patch_size)
+    # Calculate the padding to make the height and width divisible by the merge kernel size and patch size.
+    factor = merge_kernel_size * patch_size
+    pad_height = (factor - new_h % factor) % factor
+    pad_width = (factor - new_w % factor) % factor
+    if fixed_output_tokens is not None:
+        num_tokens = fixed_output_tokens
+    else:
+        # Calculate new dimensions after padding and patching
+        token_height = (new_h + pad_height) // factor
+        token_width = (new_w + pad_width) // factor
+        assert token_height * merge_kernel_size <= patch_limit_on_one_side, (
+            f"token_height {token_height} * merge_kernel_size {merge_kernel_size} > patch_limit_on_one_side {patch_limit_on_one_side}"
+        )
+        assert token_width * merge_kernel_size <= patch_limit_on_one_side, (
+            f"token_width {token_width} * merge_kernel_size {merge_kernel_size} > patch_limit_on_one_side {patch_limit_on_one_side}"
+        )
+        num_tokens = token_height * token_width
+    return {
+        "num_tokens": num_tokens,
+        "new_width": new_w,
+        "new_height": new_h,
+        "pad_width": pad_width,
+        "pad_height": pad_height,
+        "sampled_nframes": 1,
+    }
+def navit_resize_video(
+    width: int,
+    height: int,
+    nframes: int,
+    avg_fps: float,
+    sample_fps: float,
+    patch_size: int,
+    merge_kernel_size: int,
+    in_patch_limit_each_frame: int,
+    patch_limit_on_one_side: int,
+    in_patch_limit_total: int | None,
+    max_num_frames_each_video: int | None,
+    fixed_output_tokens_each_frame: int | None,
+):
+    sample_fps = min(sample_fps, avg_fps)
+    # Calculate the number of frames to sample based on target FPS
+    sampled_nframes = max(round(nframes * sample_fps / avg_fps), 1)
+    if max_num_frames_each_video is not None:
+        sampled_nframes = min(sampled_nframes, max_num_frames_each_video)
+    if in_patch_limit_total is not None:
+        in_patch_limit_each_frame = min(
+            round(in_patch_limit_total / sampled_nframes),
+            in_patch_limit_each_frame)
+    ret = navit_resize_image(
+        width,
+        height,
+        patch_size,
+        merge_kernel_size,
+        in_patch_limit_each_frame,
+        patch_limit_on_one_side,
+        fixed_output_tokens_each_frame,
+    )
+    ret["sampled_nframes"] = sampled_nframes
+    return ret
+def real_sample_fps_and_max_num_frames(
+    type_name: Literal["video", "video_chunk"],
+    sample_fps: float,
+    max_num_frames_each_video: int | None,
+) -> tuple[int, int | None]:
+    if type_name == "video":
+        return sample_fps, max_num_frames_each_video
+    elif type_name == "video_chunk":
+        max_num_frames_each_video = None
+        sample_fps = math.inf
+        return sample_fps, max_num_frames_each_video
+    else:
+        return math.inf, None
+def _to_pil(data: str | bytes):
+    if isinstance(data, Image.Image):
+        return data.convert("RGB")
+    elif isinstance(data, str):
+        if data.startswith("data:"):
+            raw_base64 = data.split(",")[1]
+            return Image.open(io.BytesIO(
+                base64.b64decode(raw_base64))).convert("RGB")
+        else:
+            return Image.open(data).convert("RGB")
+    elif isinstance(data, bytes):
+        return Image.open(io.BytesIO(data)).convert("RGB")
+    else:
+        raise ValueError(f"Unsupported data type: {type(data)}")
+def ensure_media_type(media: MediaInput) -> MediaInput:
+    if media['type'] == 'image':
+        media['image'] = _to_pil(media['image'])
+        return media
+    elif media['type'] == 'video_chunk':
+        media['video_chunk'] = [
+            _to_pil(frame) for frame in media['video_chunk']
+        ]
+        return media
+    else:
+        raise ValueError(f"Unsupported media type: {media['type']}")
+def image_to_np(
+    image: Image.Image,
+    resize_to: tuple[int, int] | None = None,
+    mode: str = "resize",
+    raise_error_for_ill_resize: bool = True,
+) -> np.ndarray:
+    """Convert an image to a numpy array.
+    Args:
+        content: The image to convert.
+        resize_to: The size to resize the image to.
+        mode: The mode to resize the image to.
+        raise_error_for_ill_resize: Whether to raise an error for ill-sized resize.
+    Returns:
+        A numpy array.
+    """
+    assert isinstance(image, Image.Image), "image must be a PIL Image"
+    if resize_to is not None:
+        if mode == "resize":
+            image = image.resize(resize_to, resample=Image.Resampling.BICUBIC)
+        elif mode == "rescale_and_pad_to_center":
+            scale = min(resize_to[0] / image.width,
+                        resize_to[1] / image.height, 1.0)
+            new_width = round(image.width * scale)
+            new_height = round(image.height * scale)
+            if new_width == 0 or new_height == 0:
+                if raise_error_for_ill_resize:
+                    raise ValueError(
+                        f"Invalid resize to: {resize_to}, from image size: {image.size}"
+                    )
+                else:
+                    return np.zeros((resize_to[1], resize_to[0], 3),
+                                    dtype=np.uint8)
+            image = image.resize((new_width, new_height),
+                                 resample=Image.Resampling.BICUBIC)
+            padding_left = (resize_to[0] - new_width) // 2
+            padding_right = resize_to[0] - new_width - padding_left
+            padding_top = (resize_to[1] - new_height) // 2
+            padding_bottom = resize_to[1] - new_height - padding_top
+            image = np.asarray(image)
+            image = np.pad(
+                image,
+                ((padding_top, padding_bottom), (padding_left, padding_right),
+                 (0, 0)),
+                mode="constant",
+                constant_values=0,
+            )
+            assert image.shape == (resize_to[1], resize_to[0], 3)
+        elif mode == "rescale_and_pad_to_rightbottom":
+            scale = min(resize_to[0] / image.width,
+                        resize_to[1] / image.height, 1.0)
+            new_width = round(image.width * scale)
+            new_height = round(image.height * scale)
+            if new_width == 0 or new_height == 0:
+                if raise_error_for_ill_resize:
+                    raise ValueError(
+                        f"Invalid resize to: {resize_to}, from image size: {image.size}"
+                    )
+                else:
+                    return np.zeros((resize_to[1], resize_to[0], 3),
+                                    dtype=np.uint8)
+            image = image.resize((new_width, new_height),
+                                 resample=Image.Resampling.BICUBIC)
+            padding_right = resize_to[0] - new_width
+            padding_bottom = resize_to[1] - new_height
+            image = np.asarray(image)
+            image = np.pad(
+                image,
+                ((0, padding_bottom), (0, padding_right), (0, 0)),
+                mode="constant",
+                constant_values=0,
+            )
+            assert image.shape == (resize_to[1], resize_to[0], 3)
+        else:
+            raise ValueError(f"Invalid mode: {mode}")
+    if isinstance(image, Image.Image):
+        return np.asarray(image)
+    else:
+        return image
+def navit_patchify(pixel_values: np.ndarray,
+                   patch_size: int) -> dict[str, np.ndarray]:
+    """Reshape the pixel values to a navit shape.
+    Args:
+        pixel_values: np.ndarray, shape (t, h, w, c)
+        patch_size: int
+    Returns:
+        dict[str, np.ndarray]
+        - patches: np.ndarray, shape (t * h//patch_size * w//patch_size, c, patch_size, patch_size)
+        - grid_thw: np.ndarray, (t, h//patch_size, w//patch_size)
+    """
+    T, H, W, C = pixel_values.shape
+    assert C == 3, "pixel_values must have 3 channels"
+    patches = pixel_values.reshape(T, H // patch_size, patch_size,
+                                   W // patch_size, patch_size, C)
+    # (T, H//patch_size, W//patch_size, C, patch_size, patch_size)
+    patches = patches.transpose(0, 1, 3, 5, 2, 4)
+    patches = patches.reshape(-1, C, patch_size, patch_size)
+    grid_thw = np.array([T, H // patch_size, W // patch_size])
+    return {"pixel_values": patches, "grid_thw": grid_thw}
+def normalize(x: np.ndarray,
+              mean,
+              std_inv,
+              pixels_dtype: np.dtype = np.float32) -> np.ndarray:
+    """Normalize the image.
+    Args:
+        x: The image to normalize. The shape is (..., 3). The dtype is uint8. The range is [0, 255].
+        mean: The mean of the image.
+        std_inv: The inverse of the std of the image.
+        pixels_dtype: The dtype of the image.
+    Returns:
+        The normalized image. The shape is (..., 3). The dtype is determined by the pixels_dtype.
+    """
+    x = (x / 255.0).astype(pixels_dtype)
+    x -= mean
+    x *= std_inv
+    return x
+def _to_tensor(data, **kwargs):
+    import torch
+    if isinstance(data, np.ndarray):
+        return torch.from_numpy(data).to(**kwargs)
+    elif isinstance(data, torch.Tensor):
+        return data.to(**kwargs)
+    elif isinstance(data, list):
+        return [_to_tensor(item, **kwargs) for item in data]
+    elif isinstance(data, tuple):
+        return tuple(_to_tensor(item, **kwargs) for item in data)
+    elif isinstance(data, dict):
+        return {k: _to_tensor(v, **kwargs) for k, v in data.items()}
+    elif data is None:
+        return None
+    else:
+        raise ValueError(f"Unsupported data type: {type(data)}")

model-00001-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cb6e037206c1936876c33f348bc644fd6f9f4d7ac973f8906359977c1eaebd43
+size 995001888

model-00002-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a609fccff2406505ae575850a51470474e8bb3eb825ddd208f84aae31cfb4960
+size 9809047464

model-00003-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f79ae8d220ecbba674c1224bbb36e2ad826ccc365765a89fab57555af47a8540
+size 9809047464

model-00004-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:efa041ff6fd295b3a5b8bc4fe7db502d32e6f887d650353425c3074388ecaa33
+size 9809047464

model-00005-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6f772c9bf434aea41b6f6dc05eb32da4e8259548a3145d79e26607b4a9d1753b
+size 9809047464

model-00006-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5cf4c4593e5a7e5b4ababc8991245ab53121be414558099dc12dd01ab00eb920
+size 9809047464

model-00007-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:66775118dec5d4f549aa19d0c8cd07dcc42c48d4d9dd6f289f772ec5d692a3af
+size 9809047464

model-00008-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cdca9b399dce8e82e0cf5bf9c59f600ea9f6001c56b84c78d9364cb8f55f48a5
+size 9809047464

model-00009-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4f8c2dcde7c51a3eca91bb892eacd8bcbcd7724842254a0000bb1c45faf92eb2
+size 9809047464

model-00010-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:311ffdf698fb27c03697598cd2c652a1ec210d9a7351d5c2de1b14d605635195
+size 9809047464

model-00011-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b2a93273e323843ad067e00dc859f63d90f62bd2e81ef0a77ffe49a9f39a3607
+size 9809050936

model-00012-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fa3200710275f7e8d60cdb91d1d623477fda55d04bda5e32be29b5809383f4ca
+size 9809050936

model-00013-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8925783dd26187bbff4f9030aeff392c3fe3d931409daa341d673bc93393965d
+size 9809050936

model-00014-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ee309d11f1c4194db36e961293d6e7b9d3c3c8227da5064ffc32b806ac03cc91
+size 9809050936

model-00015-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4859b4dfe8eb7998bbda5e453f9336b139881d403b0a8c128b98842c9c27800d
+size 9809050936

model-00016-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4345e4706a01e63aef6a43025bc101ebbfb8e8c235ea44604fb9c5ad7038b4cc
+size 9809050936

model-00017-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e0b6e9d578c7f0eaa3e91aa917fb3cbdcaf5637caa3ad97e9dc002bcb35feb22
+size 9809050936

model-00018-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1bef116d936999f347b83403ee13fe479f51214a87592409121614793d7e65e7
+size 9809050936

model-00019-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:71b624faee1df5cc99bad259b018befb0bdbbdd160af3d130487bf1165d48b9a
+size 9809050936

model-00020-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9c6ee52ccb57bd2fdba6077a18faaf3ea6da1b3536597af4fef6c5cf713bec34
+size 9809050936

model-00021-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fb99455eb90ddd9ab343538496c3e90b40868c0029a344d79b0b94f73820dd1f
+size 9809050936

model-00022-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f98118e5b5e700ad57b510f3c80183fc2e5e3dcd1badf43a959ce63140ae4e61
+size 9809050936

model-00023-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0a4f4fb4557c2954dc07ffd8d81af93587ab784e6810eca093fef6b0fab03d3c
+size 9809050936

model-00024-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9fa6f9aba1f02a552a66d0bcf2808fdc9192425b5554684b637d7e9d9312fb3d
+size 9809050936

model-00025-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:98a0dc92227ed3beb2a1491558d2fa631dca256a5c28e1895b1bf28ae4c9731b
+size 9809050936

model-00026-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6b6e19204d6f74af7ff2c78c67a33991f5ee4105e0262d99eda6ba6bc0362630
+size 9809050936

model-00027-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8476d21ab258330c16020124d634d2cd1f22a5a7859d3dd778132dea0271e810
+size 9809050936

model-00028-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ee16b8cf5c8a9efb0c5fd89c282622e5cfbf4f49fb28d5b3638240434dadef7a
+size 9809050936

model-00029-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:87e0e7b5caf5b39314caf4158b666d0b0c9fe00a96dd09ea545563f4a6f408ed
+size 9809050936

model-00030-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3f8d186de5354913a9aab43b26395c97bdf27b695586260bbd6492ba8ab795dc
+size 9809050936

model-00031-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:696da932f44c248e96ca8b616424ee87e0609fb32ef640bd086f296665ad4ad8
+size 9809050936

model-00032-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eee21740e4d75ba175965f322dd8d2747411fedc1b59b9a0559630d7c94d06c6
+size 9809050936

model-00033-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cc9a7b5737151ca68fed23cb16308aff1527d8932be59d507318795025a9034e
+size 9809050936

model-00034-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:783faa4cbeac1c7336070596c5a3f603bd49193c1ab6da74eb29d6fed935d905
+size 9809050936

model-00035-of-000064.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0256b0e5c83716c66871f0b63b0b87fe92aab595355aae583d5a674a691411e2
+size 9809050936