tencent
/

Hy-MT2-1.8B

@@ -83,6 +83,33 @@ For more experimental results and analysis, please refer to our [report](./HY_MT
 ---
 ## Inference and Deployment
 ### transformers
 transformers>=5.6.0
@@ -91,7 +118,7 @@ transformers>=5.6.0
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
-model_path = "tencent/Hy-MT2-30B-A3B"
 # Load tokenizer
 tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
@@ -134,7 +161,7 @@ uv pip install --editable . --torch-backend=auto
 Start the vLLM server:
 ```bash
-vllm serve tencent/Hy-MT2-30B-A3B --tensor-parallel-size 1
 ```
 ### sglang
@@ -151,74 +178,9 @@ pip3 install -e "python"
 Launch SGLang server:
 ```bash
-python3 -m sglang.launch_server --model tencent/Hy-MT2-30B-A3B --tp 1
-```
-### llama_cpp
-**❕❕ This gguf depends on our STQ kernel, which is released at [PR #22836](https://github.com/ggml-org/llama.cpp/pull/22836).**
-#### Clone llama.cpp
-```bash
-git clone https://github.com/ggml-org/llama.cpp.git
-```
-#### Enter the llama.cpp folder
-```bash
-cd llama.cpp
-```
-#### Build llama.cpp
-```bash
-cmake -B build
-cmake --build build --config Release
-```
-#### Run a completion example
-```bash
-./build/bin/llama-completion \
-  --model model.gguf  \
-  -p "Translate the following segment into Chinese, without additional explanation：Hello" \
-  --jinja \
-  -ngl 0 \
-  -n 64 -st
 ```
-#### Run the llama.cpp benchmark
-```bash
-./build/bin/llama-bench -m model_zoo/model.gguf  -ngl 0
-```
-For 1.8B and 7B, we recommend using the following parameters for inference. Note that our models do not have a default system_prompt.
-```json
-{
-  "temperature": 0.7,
-  "top_p": 0.6,
-  "top_k": 20,
-  "repetition_penalty": 1.05,
-  "max_tokens": 4096
-}
-```
-For 30B-A3B, we recommend using the following parameters for inference. Note that our models do not have a default system_prompt.
-```json
-{
-  "temperature": 0.7,
-  "top_p": 1.0,
-  "top_k": -1,
-  "repetition_penalty": 1.0,
-  "max_tokens": 4096
-}
-```
 ## Model Training
 Hy-MT2 provides a complete model training pipeline, supporting both full-parameter fine-tuning and LoRA fine-tuning, as well as multiple DeepSpeed ZeRO configurations and LLaMA-Factory integration.

 ---
 ## Inference and Deployment
+For 1.8B and 7B, we recommend using the following parameters for inference. Note that our models do not have a default system_prompt.
+```json
+{
+  "temperature": 0.7,
+  "top_p": 0.6,
+  "top_k": 20,
+  "repetition_penalty": 1.05,
+  "max_tokens": 4096
+}
+```
+For 30B-A3B, we recommend using the following parameters for inference. Note that our models do not have a default system_prompt.
+```json
+{
+  "temperature": 0.7,
+  "top_p": 1.0,
+  "top_k": -1,
+  "repetition_penalty": 1.0,
+  "max_tokens": 4096
+}
+```
 ### transformers
 transformers>=5.6.0
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
+model_path = "tencent/Hy-MT2-1.8B"
 # Load tokenizer
 tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
 Start the vLLM server:
 ```bash
+vllm serve tencent/Hy-MT2-1.8B --tensor-parallel-size 1
 ```
 ### sglang
 Launch SGLang server:
 ```bash
+python3 -m sglang.launch_server --model tencent/Hy-MT2-1.8B --tp 1
 ```
 ## Model Training
 Hy-MT2 provides a complete model training pipeline, supporting both full-parameter fine-tuning and LoRA fine-tuning, as well as multiple DeepSpeed ZeRO configurations and LLaMA-Factory integration.