Instructions to use tencent/Hy-MT2-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tencent/Hy-MT2-7B with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "translation" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("translation", model="tencent/Hy-MT2-7B")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tencent/Hy-MT2-7B") model = AutoModelForCausalLM.from_pretrained("tencent/Hy-MT2-7B") - Notebooks
- Google Colab
- Kaggle
File size: 15,424 Bytes
26173ad | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 | <p align="left">
<a href="README_CN.md">中文</a> | English
</p>
# Model Training
Hy3 preview provides processes related to model training. This section details how to process training data for model training purposes.
## Training Data Format and Processing
The training data should be formatted as a list of messages. By default, the system prompt for both training and inference is empty, but you may customize it as needed.
Below is a training data example for a translation task:
```python
# Translation task example
{"messages": [{"role": "user", "content": "将以下中文翻译为英文,只输出翻译结果,不要额外解释:\n\n实验结果证明了假设的正确性。"}, {"role": "assistant", "content": "The experimental results demonstrate the correctness of the hypothesis."}]}
```
## Quick Start
You can quickly get started by following the instructions in the Quick Start Guide.
## Model Training
### Hardware Requirements
The following are the minimum hardware requirements for each model at max_seq_length = 8192:
#### Hy-MT2-1.8B (Dense)
| Training Method | DeepSpeed Strategy | Minimum GPU Requirement |
|----------------|-------------------|------------------------|
| LoRA Fine-tuning | ZeRO-2 (no offload) | 1 GPU (24GB+) |
| Full Fine-tuning | ZeRO-2 (no offload) | 1 GPU (24GB+) |
#### Hy-MT2-7B (Dense)
| Training Method | DeepSpeed Strategy | Minimum GPU Requirement |
|----------------|-------------------|------------------------|
| LoRA Fine-tuning | ZeRO-2 (no offload) | 1 GPU (80GB+) |
| Full Fine-tuning | ZeRO-3 (no offload) | 2 GPUs (80GB+ each) |
#### Hy-MT2-30B-A3B (MoE)
| Training Method | DeepSpeed Strategy | Minimum GPU Requirement |
|----------------|-------------------|------------------------|
| LoRA Fine-tuning | ZeRO-2 (no offload) | 8 GPUs on a single machine (80GB+ each) |
| Full Fine-tuning | ZeRO-3 + offload | 8 GPUs on a single machine (80GB+ each) |
### Configure Passwordless SSH Login Between Machines (Multi-Machine Training)
> If you only use single-machine training, you can skip this section.
The following instructions use two machines as an example, with their IPs denoted as `${ip1}` and `${ip2}`. All steps should be performed inside the Docker container.
First, configure passwordless SSH for each container on every machine:
```sh
ssh-keygen # Generate id_rsa and id_rsa.pub for passwordless login
ssh-keygen -t rsa -A # Generate /etc/ssh/ssh_host_rsa_key and ssh_host_ecdsa_key for SSH listening
/usr/sbin/sshd -p 36005 -o ListenAddress=0.0.0.0 # Start SSH listening
echo "Port 36005" > ~/.ssh/config # Set SSH connection port to 36005
passwd root # Set the root password to avoid monitoring platform alerts
```
Note: `36005` is an example port. You may use any available port, but ensure it is **open** and **not occupied by other processes**.
Next, in each machine's container, execute:
```sh
cat ~/.ssh/id_rsa.pub
```
**Copy the output SSH public key and paste it into the `~/.ssh/authorized_keys` file, one key per line. This must be done on every machine.** In the end, the `~/.ssh/authorized_keys` file on each machine should be identical and contain the public keys of all machines.
Please note that for multi-node training, the code executed on each node must be identical. It is recommended to mount a shared network drive. If this is not possible, you must manually copy the dataset, scripts, and code to the same directory on each machine.
### Launch Methods
This project provides two training methods. You can choose based on your needs:
- **DeepSpeed Native Training** (based on HuggingFace Transformers Trainer): Located in the `train/deepspeed_support` directory
- **LLaMA-Factory Training**: Located in the `train/llama_factory_support` directory
#### DeepSpeed Native Training
Reference: [HuggingFace Transformers Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer)
##### Training Scripts
In the `train/deepspeed_support` directory, the training scripts for each model are as follows:
| Model | Full Fine-tuning | LoRA Fine-tuning |
|-------|-----------------|-----------------|
| Hy-MT2-1.8B (Dense) | `bash train_dense.sh 1.8B` | `bash train_dense_lora.sh 1.8B` |
| Hy-MT2-7B (Dense) | `bash train_dense.sh 7B` | `bash train_dense_lora.sh 7B` |
| Hy-MT2-30B-A3B (MoE) | `bash train.sh` | `bash train_lora.sh` |
##### Single-Machine Training
In the `train/deepspeed_support` directory, install dependencies and execute the corresponding script:
```sh
pip install -r requirements.txt
# Example: Dense 1.8B full fine-tuning
bash train_dense.sh 1.8B
```
##### Multi-Machine Training
To launch training across multiple machines, please first complete the configuration in [Configure Passwordless SSH Login Between Machines](#configure-passwordless-ssh-login-between-machines-multi-machine-training), and ensure all machines are within the same cluster.
Confirm that dependencies are installed (if not, run `pip install -r requirements.txt`), then set the `IP_LIST` environment variable in the corresponding training script:
```shell
export HOST_GPU_NUM=8
# IP list, comma separated. e.g. "192.168.1.1,192.168.1.2" or single node "192.168.1.1"
IP_LIST=${IP_LIST:-"127.0.0.1"}
```
Note: If the `IP_LIST` environment variable is not set, replace `IP_LIST` with the IP list! The format is:
```
For a single IP:
IP_LIST=${ip_1}
For multiple IPs:
IP_LIST=${ip_1},${ip_2}
```
Replace `${ip_1}` and `${ip_2}` with the actual IP addresses.
Then, on the machine with `${ip1}`, execute the corresponding training script in the `train/deepspeed_support/` directory. On first launch, you may see the following output:
```ssh
The authenticity of host '[ip]:36005 ([ip]:36005)' can't be established.
ECDSA key fingerprint is xxxxxx.
ECDSA key fingerprint is MD5:xxxxxx.
Are you sure you want to continue connecting (yes/no)?
```
Type `yes` to continue.
##### Key Parameters
The key parameters in the script are as follows:
- `--deepspeed`: Path to the DeepSpeed configuration file. Three default DeepSpeed configuration files are provided in the `train/deepspeed_support` folder: `ds_zero2_no_offload.json`, `ds_zero3_no_offload.json`, and `ds_zero3_offload.json`, with decreasing memory requirements in that order.
- `--model_name_or_path`: Path to the Hy3 preview HF pre-trained model weights to load.
- `--tokenizer_name_or_path`: Path to the tokenizer folder.
- `--train_data_file`: Path to the training file, which should be a jsonl file.
- `--output_dir`: Output directory where logs, tensorboard files, and model weights will be stored.
- `--per_device_train_batch_size`: Batch size per GPU.
- `--gradient_accumulation_steps`: Number of gradient accumulation steps. The global batch size is `per_device_train_batch_size * gradient_accumulation_steps * dp_size`.
- `--max_steps`: Total number of training steps.
- `--save_steps`: Number of steps between saving checkpoints.
- `--use_lora`: Whether to use LoRA training. Also accepts `--lora_rank`, `--lora_alpha`, and `--lora_dropout` parameters. By default, LoRA is applied to "q_proj", "k_proj", "v_proj", and "o_proj". To change this, modify the code. Note: **When using LoRA training, only the LoRA weights are saved, not the base model weights.** To merge LoRA weights, see the "LoRA Weight Merging" section below.
- `--make_moe_param_leaf_module`: When using ZeRO-3 with MoE training, treat the MoE module as a leaf module, i.e., its parameters are not partitioned by ZeRO-3. This option is expected to significantly increase memory usage.
- `--gradient_checkpointing`: Enable gradient checkpointing.
- `--train_attention_params_only`: Whether to train only attention parameters.
- `--learning_rate`: Maximum learning rate during training.
- `--min_lr`: Minimum learning rate during training.
- `--use_flash_attn`: Enable flash-attention for accelerated training.
**Notes:**
- To resume training from a previously saved checkpoint rather than loading pre-trained weights, specify `--resume_from_checkpoint` with the path to the checkpoint. Do not specify `--model_name_or_path`; this will load only the weights without the training state.
- When resuming from a checkpoint, there may be minor differences in loss due to the randomness of some non-deterministic algorithms. This is normal. See: [HuggingFace Transformers Trainer Randomness](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#randomness)
- When `--model_name_or_path` is specified, all model-related parameters will be ignored.
- Samples within a batch are padded to the length of the longest sample in the batch, but the maximum length of each sample is `max_seq_length`. Any excess will be truncated.
- If you see a warning about bias weights not being loaded, you can ignore it. Hunyuan-Large does not use bias.
##### What if GPU Memory is Insufficient?
Reference: [DeepSpeed Configuration](https://www.deepspeed.ai/docs/config-json/)
You can try modifying the DeepSpeed configuration by removing the `auto` attribute from the following parameters and reducing their values:
- `stage3_param_persistence_threshold`
- `stage3_prefetch_bucket_size`
- `stage3_max_reuse_distance`
##### LoRA Weight Merging
LoRA weights saved during training cannot be merged into the ZeRO-3 model at runtime, as ZeRO-3 partitions model weights across data parallel ranks. To merge LoRA weights into the base model, you can do so offline to obtain a merged weight file. Run `merge_lora_weight.sh` to merge the LoRA and base model weights. The parameters are:
- `--base_model_path`: Directory of the base model weights
- `--adapter_model_path`: Directory of the LoRA weights
- `--output_path`: Directory to save the merged weights
- `--save_dtype`: Data type for saving the merged weights; options are: fp16, bf16, fp32
#### LLaMA-Factory Training
If you are familiar with LLaMA-Factory, you may use it for fine-tuning. All scripts, code, and configuration files are archived in the `train/llama_factory_support` directory. Unless otherwise specified, all files mentioned below are located in this directory.
##### Installation
You can install LLaMA-Factory by downloading the source code from https://github.com/hiyouga/LLaMA-Factory/tree/main and following the instructions on the website.
##### Training Scripts and Configuration Files
The configuration files and launch scripts for each model are as follows:
| Model | Full Fine-tuning Config | LoRA Fine-tuning Config | Launch Script |
|-------|------------------------|------------------------|---------------|
| Hy-MT2-1.8B (Dense) | `hy_dense_1_8b_full_sft.yaml` | `hy_dense_1_8b_lora_sft.yaml` | `bash train_lf_dense.sh` |
| Hy-MT2-7B (Dense) | `hy_dense_7b_full_sft.yaml` | `hy_dense_7b_lora_sft.yaml` | `YAML_FILE=hy_dense_7b_full_sft.yaml bash train_lf_dense.sh` |
| Hy-MT2-30B-A3B (MoE) | `hy_v3_full_sft.yaml` | `hy_v3_lora_sft.yaml` | `bash train_lf.sh` |
> **Tip**: The Dense model launch script `train_lf_dense.sh` uses `hy_dense_1_8b_full_sft.yaml` by default. You can specify other configuration files via the `YAML_FILE` environment variable.
Key parameters in the configuration files are as follows:
**Model:**
- `model_name_or_path`: Path to the Hy-MT HF format pre-trained model weights
- `trust_remote_code`: Whether to trust remote code; Hy-MT requires this to be set to `true`
**Training Method:**
- `stage`: Training stage, currently `sft` (supervised fine-tuning)
- `finetuning_type`: Fine-tuning type, either `full` (full fine-tuning) or `lora` (LoRA fine-tuning)
- `deepspeed`: DeepSpeed configuration file path; `ds_zero3_offload.json` is recommended for full fine-tuning, `ds_zero2_offload_lora.json` for LoRA fine-tuning
**LoRA Parameters (only effective during LoRA fine-tuning):**
- `lora_rank`: LoRA rank, default `64`
- `lora_alpha`: LoRA alpha coefficient, default `128`
- `lora_dropout`: LoRA dropout ratio, default `0.05`
- `lora_target`: Target modules for LoRA, default `q_proj,k_proj,v_proj,o_proj`
**Dataset:**
- `dataset_dir`: Dataset directory path
- `dataset`: Dataset name, must be registered in `dataset_info.json` under `dataset_dir`
- `template`: Chat template; Hy-MT2-1.8B uses `hy_dense_1_8b`, Hy-MT2-7B uses `hy_dense_7b`, Hy-MT2-30B-A3B uses `hy_v3`
- `cutoff_len`: Maximum sequence length; sequences exceeding this will be truncated. For full fine-tuning, can be set to `262144` (262K); for LoRA fine-tuning, `8192` is recommended to save memory
- `max_samples`: Maximum number of samples per dataset
- `overwrite_cache`: Whether to overwrite cached preprocessed datasets
**Output:**
- `output_dir`: Output directory where logs, TensorBoard files, and weights will be stored
- `logging_steps`: Number of steps between logging
- `save_steps`: Number of steps between saving checkpoints
- `plot_loss`: Whether to plot the training loss curve
- `overwrite_output_dir`: Whether to overwrite the existing output directory
- `save_only_model`: Whether to save only model weights (excluding optimizer states, etc.)
- `report_to`: Logging tool, options: `none`, `wandb`, `tensorboard`, `swanlab`, `mlflow`
**Training Hyperparameters:**
- `per_device_train_batch_size`: Batch size per GPU
- `gradient_accumulation_steps`: Gradient accumulation steps; `per_device_train_batch_size * gradient_accumulation_steps * dp_size` equals the global batch size
- `learning_rate`: Maximum learning rate; `1.0e-5` recommended for full fine-tuning, `2.0e-4` for LoRA fine-tuning
- `num_train_epochs`: Number of training epochs
- `lr_scheduler_type`: Learning rate scheduler type; `cosine_with_min_lr` is recommended
- `lr_scheduler_kwargs.min_lr_rate`: Ratio of minimum to maximum learning rate; e.g., `0.1` means the minimum learning rate is 10% of the maximum
- `warmup_ratio`: Proportion of total training steps used for warmup
- `bf16`: Whether to use BFloat16 mixed precision training
- `gradient_checkpointing`: Whether to enable gradient checkpointing to save memory
- `ddp_timeout`: Distributed training timeout (milliseconds)
- `flash_attn`: Attention implementation; `fa2` (FlashAttention-2) is recommended, `sdpa` is also available; using `fa2` requires the flash-attn package
- `resume_from_checkpoint`: Resume training from a specified checkpoint path; set to `null` to start from scratch
##### Launch Training
For multi-machine training, please first complete the configuration in [Configure Passwordless SSH Login Between Machines](#configure-passwordless-ssh-login-between-machines-multi-machine-training) (single-machine training can skip this step).
Modify the following configuration at the beginning of the corresponding launch script:
```shell
export HOST_GPU_NUM=8
# IP list, comma separated. e.g. "192.168.1.1,192.168.1.2" or single node "192.168.1.1"
export IP_LIST=${IP_LIST:-"127.0.0.1"}
```
Note: If the `IP_LIST` environment variable is not set, replace `IP_LIST` with the IP list! The format is:
```
For a single IP:
IP_LIST=${ip_1}
For multiple IPs:
IP_LIST=${ip_1},${ip_2}
```
Replace `${ip_1}` and `${ip_2}` with the actual IP addresses.
Then, on each machine, run the corresponding launch script in the `train/llama_factory_support/` directory. For example:
```sh
# Dense 1.8B full fine-tuning
bash train_lf_dense.sh
# Dense 7B LoRA fine-tuning
YAML_FILE=hy_dense_7b_lora_sft.yaml bash train_lf_dense.sh
# MoE 30B-A3B full fine-tuning
bash train_lf.sh
``` |