caizhi1 commited on
Commit
85b9552
·
verified ·
1 Parent(s): 48b05ee

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -10
README.md CHANGED
@@ -1,9 +1,8 @@
1
- ---
2
- license: mit
3
- language:
4
- - en
5
- pipeline_tag: text-generation
6
- ---
7
  ## Ling-2.6-flash: Faster Responses, Stronger Execution, Higher Token Efficiency
8
  ### Introduction
9
  Today, we announce the official open-source release of **Ling-2.6-flash**, an **instruct model** with **104B total parameters** and **7.4B active parameters**.
@@ -36,7 +35,7 @@ Beyond agent tasks, Ling-2.6-flash also delivers strong performance across **gen
36
  > + **<font style="color:rgb(38, 38, 38);">PinchBench</font>**<font style="color:rgb(38, 38, 38);">: Comparative scores are retrieved directly from the official PinchBench leaderboard (as of April 20, 2026), adhering to their evaluation modes (potentially Reasoning Mode). </font>
37
  > + **<font style="color:rgb(38, 38, 38);">Claw-Eval</font>**<font style="color:rgb(38, 38, 38);">: Comparative scores are sourced from the official Claw-Eval leaderboard (version dated 2026-03-25), adhering to their evaluation modes (potentially Reasoning Mode). Official scores for GPT-OSS-120B and GPT-5.4-mini are currently unavailable and have been omitted.</font>
38
  > + **<font style="color:rgb(38, 38, 38);">TAU2-Bench</font>**<font style="color:rgb(38, 38, 38);">: Evaluations are conducted using official v1.0.0 code and datasets. Following the GLM-5 evaluation protocol, we applied minor prompt adjustments in the Retail and Telecom domains to ensure users express requests clearly and to prevent premature session termination. Additionally, GPT-5.2 was utilized as the User Agent across all evaluated domains.</font>
39
- > + **<font style="color:rgb(38, 38, 38);">IFBench</font>**<font style="color:rgb(38, 38, 38);">: Scores for GPT-OSS-120B (low) and GPT-5.4-mini (Non-Reasoning) are sourced from the AA(Artificial Analysis) Leaderboard. All other model performance data are based on internal evaluation results.</font>
40
  >
41
 
42
  ### Architecture
@@ -66,9 +65,9 @@ Whether the workload involves **long-context understanding** or **extended text
66
  ```bash
67
  pip install uv
68
 
69
- uv venv ~/my_sglang_env
70
 
71
- source ~/my_sglang_env/bin/activate
72
 
73
  uv pip install sglang
74
  ```
@@ -92,7 +91,17 @@ python -m sglang.launch_server \
92
  ```
93
 
94
  **2. Inference with MTP (Multi-Token Prediction)**
 
95
 
 
 
 
 
 
 
 
 
 
96
  ```bash
97
  python -m sglang.launch_server \
98
  --model-path $MODEL_PATH \
@@ -120,9 +129,44 @@ curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
120
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
121
  ```
122
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  ### Limitations & Future Plans
124
  Ling-2.6-flash has already made meaningful progress in our pursuit of an extreme intelligence-efficiency tradeoff. The model has improved substantially in key areas such as **tool use, multi-step planning, and long-horizon task execution**. Combined with systematic optimizations in inference efficiency and interaction experience, Ling-2.6-flash is now better equipped to handle **large-scale, high-frequency automated workloads**, delivering stronger real-world value in production settings.
125
 
126
  At the same time, we are fully aware that pushing intelligence efficiency to the limit comes with tradeoffs. In some highly complex scenarios, the model can still exhibit **tool hallucinations** due to limited reasoning depth. In addition, there is still room for improvement in areas such as **natural bilingual switching between Chinese and English** and **compliance with highly complex instructions**.
127
 
128
- Looking ahead, we will continue exploring the frontier of intelligence efficiency. While preserving the model’s high-efficiency inference characteristics, we aim to further improve the balance between **output quality** and **token efficiency**, and to continuously strengthen the model’s **stability, usability, and interaction experience across a wider range of real-world scenarios**.
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ ---
 
6
  ## Ling-2.6-flash: Faster Responses, Stronger Execution, Higher Token Efficiency
7
  ### Introduction
8
  Today, we announce the official open-source release of **Ling-2.6-flash**, an **instruct model** with **104B total parameters** and **7.4B active parameters**.
 
35
  > + **<font style="color:rgb(38, 38, 38);">PinchBench</font>**<font style="color:rgb(38, 38, 38);">: Comparative scores are retrieved directly from the official PinchBench leaderboard (as of April 20, 2026), adhering to their evaluation modes (potentially Reasoning Mode). </font>
36
  > + **<font style="color:rgb(38, 38, 38);">Claw-Eval</font>**<font style="color:rgb(38, 38, 38);">: Comparative scores are sourced from the official Claw-Eval leaderboard (version dated 2026-03-25), adhering to their evaluation modes (potentially Reasoning Mode). Official scores for GPT-OSS-120B and GPT-5.4-mini are currently unavailable and have been omitted.</font>
37
  > + **<font style="color:rgb(38, 38, 38);">TAU2-Bench</font>**<font style="color:rgb(38, 38, 38);">: Evaluations are conducted using official v1.0.0 code and datasets. Following the GLM-5 evaluation protocol, we applied minor prompt adjustments in the Retail and Telecom domains to ensure users express requests clearly and to prevent premature session termination. Additionally, GPT-5.2 was utilized as the User Agent across all evaluated domains.</font>
38
+ > + **<font style="color:rgb(38, 38, 38);">IFBench</font>**<font style="color:rgb(38, 38, 38);">: Scores for GPT-OSS-120B (low) and GPT-5.4-mini (Non-Reasoning) are sourced from the AA (Artificial Analysis) Leaderboard. All other model performance data are based on internal evaluation results.</font>
39
  >
40
 
41
  ### Architecture
 
65
  ```bash
66
  pip install uv
67
 
68
+ uv venv ~/my_ling_env
69
 
70
+ source ~/my_ling_env/bin/activate
71
 
72
  uv pip install sglang
73
  ```
 
91
  ```
92
 
93
  **2. Inference with MTP (Multi-Token Prediction)**
94
+ _The current official SGLang implementation of MTP contains a bug. For better inference performance, we recommend installing our patched version. Our fix is currently under review and is expected to be merged into the official SGLang library shortly._
95
 
96
+ **Install our SGLang**
97
+ ```bash
98
+ git clone -b ling_2_6 git@github.com:antgroup/sglang.git
99
+ cd sglang
100
+
101
+ pip install --upgrade pip
102
+ pip install -e "python"
103
+ ```
104
+ Start server
105
  ```bash
106
  python -m sglang.launch_server \
107
  --model-path $MODEL_PATH \
 
129
  -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
130
  ```
131
 
132
+ #### vLLM
133
+ ##### Environment Preparation
134
+ ```bash
135
+ pip install uv
136
+
137
+ uv venv ~/my_ling_env
138
+
139
+ source ~/my_ling_env/bin/activate
140
+
141
+ git clone https://github.com/vllm-project/vllm.git
142
+
143
+ cd vllm
144
+
145
+ VLLM_USE_PRECOMPILED=1 uv pip install --editable . --torch-backend=auto
146
+ ```
147
+
148
+ #### Run inference
149
+
150
+ **Server**
151
+ ```bash
152
+ vllm serve $MODEL_PATH \
153
+ --port $PORT \
154
+ --served-model-name my_model \
155
+ --trust-remote-code --tensor-parallel-size 4 \
156
+ --gpu-memory-utilization 0.85
157
+ ```
158
+
159
+ **Client**
160
+
161
+ ```bash
162
+ curl -s http://${MASTER_IP}:${PORT}/v1/chat/completions \
163
+ -H "Content-Type: application/json" \
164
+ -d '{"model": "auto", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'
165
+ ```
166
+
167
  ### Limitations & Future Plans
168
  Ling-2.6-flash has already made meaningful progress in our pursuit of an extreme intelligence-efficiency tradeoff. The model has improved substantially in key areas such as **tool use, multi-step planning, and long-horizon task execution**. Combined with systematic optimizations in inference efficiency and interaction experience, Ling-2.6-flash is now better equipped to handle **large-scale, high-frequency automated workloads**, delivering stronger real-world value in production settings.
169
 
170
  At the same time, we are fully aware that pushing intelligence efficiency to the limit comes with tradeoffs. In some highly complex scenarios, the model can still exhibit **tool hallucinations** due to limited reasoning depth. In addition, there is still room for improvement in areas such as **natural bilingual switching between Chinese and English** and **compliance with highly complex instructions**.
171
 
172
+ Looking ahead, we will continue exploring the frontier of intelligence efficiency. While preserving the model’s high-efficiency inference characteristics, we aim to further improve the balance between **output quality** and **token efficiency**, and to continuously strengthen the model’s **stability, usability, and interaction experience across a wider range of real-world scenarios**.