inclusionAI
/

Ling-2.6-flash

@@ -1,9 +1,8 @@
----
-license: mit
-language:
-- en
-pipeline_tag: text-generation
----
 ## Ling-2.6-flash: Faster Responses, Stronger Execution, Higher Token Efficiency
 ### Introduction
 Today, we announce the official open-source release of **Ling-2.6-flash**, an **instruct model** with **104B total parameters** and **7.4B active parameters**.
@@ -36,7 +35,7 @@ Beyond agent tasks, Ling-2.6-flash also delivers strong performance across **gen
 > + **<font style="color:rgb(38, 38, 38);">PinchBench</font>**<font style="color:rgb(38, 38, 38);">: Comparative scores are retrieved directly from the official PinchBench leaderboard (as of April 20, 2026), adhering to their evaluation modes (potentially Reasoning Mode). </font>
 > + **<font style="color:rgb(38, 38, 38);">Claw-Eval</font>**<font style="color:rgb(38, 38, 38);">: Comparative scores are sourced from the official Claw-Eval leaderboard (version dated 2026-03-25), adhering to their evaluation modes (potentially Reasoning Mode). Official scores for GPT-OSS-120B and GPT-5.4-mini are currently unavailable and have been omitted.</font>
 > + **<font style="color:rgb(38, 38, 38);">TAU2-Bench</font>**<font style="color:rgb(38, 38, 38);">: Evaluations are conducted using official v1.0.0 code and datasets. Following the GLM-5 evaluation protocol, we applied minor prompt adjustments in the Retail and Telecom domains to ensure users express requests clearly and to prevent premature session termination. Additionally, GPT-5.2 was utilized as the User Agent across all evaluated domains.</font>
-> + **<font style="color:rgb(38, 38, 38);">IFBench</font>**<font style="color:rgb(38, 38, 38);">: Scores for GPT-OSS-120B (low) and GPT-5.4-mini (Non-Reasoning) are sourced from the AA(Artificial Analysis) Leaderboard. All other model performance data are based on internal evaluation results.</font>
 >
 ### Architecture
@@ -66,9 +65,9 @@ Whether the workload involves **long-context understanding** or **extended text
 ```bash
 pip install uv
-uv venv ~/my_sglang_env
-source ~/my_sglang_env/bin/activate
 uv pip install sglang
 ```
@@ -92,7 +91,17 @@ python -m sglang.launch_server \
 ```
 **2. Inference with MTP (Multi-Token Prediction)**
 ```bash
 python -m sglang.launch_server \
     --model-path $MODEL_PATH \
@@ -120,9 +129,44 @@ curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
   -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
 ```
 ### Limitations & Future Plans
 Ling-2.6-flash has already made meaningful progress in our pursuit of an extreme intelligence-efficiency tradeoff. The model has improved substantially in key areas such as **tool use, multi-step planning, and long-horizon task execution**. Combined with systematic optimizations in inference efficiency and interaction experience, Ling-2.6-flash is now better equipped to handle **large-scale, high-frequency automated workloads**, delivering stronger real-world value in production settings.
 At the same time, we are fully aware that pushing intelligence efficiency to the limit comes with tradeoffs. In some highly complex scenarios, the model can still exhibit **tool hallucinations** due to limited reasoning depth. In addition, there is still room for improvement in areas such as **natural bilingual switching between Chinese and English** and **compliance with highly complex instructions**.
-Looking ahead, we will continue exploring the frontier of intelligence efficiency. While preserving the model’s high-efficiency inference characteristics, we aim to further improve the balance between **output quality** and **token efficiency**, and to continuously strengthen the model’s **stability, usability, and interaction experience across a wider range of real-world scenarios**.

+---
+license: mit
+language:
+- en
+---
 ## Ling-2.6-flash: Faster Responses, Stronger Execution, Higher Token Efficiency
 ### Introduction
 Today, we announce the official open-source release of **Ling-2.6-flash**, an **instruct model** with **104B total parameters** and **7.4B active parameters**.
 > + **<font style="color:rgb(38, 38, 38);">PinchBench</font>**<font style="color:rgb(38, 38, 38);">: Comparative scores are retrieved directly from the official PinchBench leaderboard (as of April 20, 2026), adhering to their evaluation modes (potentially Reasoning Mode). </font>
 > + **<font style="color:rgb(38, 38, 38);">Claw-Eval</font>**<font style="color:rgb(38, 38, 38);">: Comparative scores are sourced from the official Claw-Eval leaderboard (version dated 2026-03-25), adhering to their evaluation modes (potentially Reasoning Mode). Official scores for GPT-OSS-120B and GPT-5.4-mini are currently unavailable and have been omitted.</font>
 > + **<font style="color:rgb(38, 38, 38);">TAU2-Bench</font>**<font style="color:rgb(38, 38, 38);">: Evaluations are conducted using official v1.0.0 code and datasets. Following the GLM-5 evaluation protocol, we applied minor prompt adjustments in the Retail and Telecom domains to ensure users express requests clearly and to prevent premature session termination. Additionally, GPT-5.2 was utilized as the User Agent across all evaluated domains.</font>
+> + **<font style="color:rgb(38, 38, 38);">IFBench</font>**<font style="color:rgb(38, 38, 38);">: Scores for GPT-OSS-120B (low) and GPT-5.4-mini (Non-Reasoning) are sourced from the AA (Artificial Analysis) Leaderboard. All other model performance data are based on internal evaluation results.</font>
 >
 ### Architecture
 ```bash
 pip install uv
+uv venv ~/my_ling_env
+source ~/my_ling_env/bin/activate
 uv pip install sglang
 ```
 ```
 **2. Inference with MTP (Multi-Token Prediction)**
+_The current official SGLang implementation of MTP contains a bug. For better inference performance, we recommend installing our patched version. Our fix is currently under review and is expected to be merged into the official SGLang library shortly._
+**Install our SGLang**
+```bash
+git clone -b ling_2_6 git@github.com:antgroup/sglang.git
+cd sglang
+pip install --upgrade pip
+pip install -e "python"
+```
+Start server
 ```bash
 python -m sglang.launch_server \
     --model-path $MODEL_PATH \
   -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
 ```
+#### vLLM
+##### Environment Preparation
+```bash
+pip install uv
+uv venv ~/my_ling_env
+source ~/my_ling_env/bin/activate
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto
+```
+#### Run inference
+**Server**
+```bash
+vllm serve $MODEL_PATH \
+    --port $PORT \
+    --served-model-name my_model \
+    --trust-remote-code --tensor-parallel-size 4 \
+    --gpu-memory-utilization 0.85
+```
+**Client**
+```bash
+curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
+```
 ### Limitations & Future Plans
 Ling-2.6-flash has already made meaningful progress in our pursuit of an extreme intelligence-efficiency tradeoff. The model has improved substantially in key areas such as **tool use, multi-step planning, and long-horizon task execution**. Combined with systematic optimizations in inference efficiency and interaction experience, Ling-2.6-flash is now better equipped to handle **large-scale, high-frequency automated workloads**, delivering stronger real-world value in production settings.
 At the same time, we are fully aware that pushing intelligence efficiency to the limit comes with tradeoffs. In some highly complex scenarios, the model can still exhibit **tool hallucinations** due to limited reasoning depth. In addition, there is still room for improvement in areas such as **natural bilingual switching between Chinese and English** and **compliance with highly complex instructions**.
+Looking ahead, we will continue exploring the frontier of intelligence efficiency. While preserving the model’s high-efficiency inference characteristics, we aim to further improve the balance between **output quality** and **token efficiency**, and to continuously strengthen the model’s **stability, usability, and interaction experience across a wider range of real-world scenarios**.