| W1018 06:43:53.034832 2996178 site-packages/torch/distributed/run.py:792] |
| W1018 06:43:53.034832 2996178 site-packages/torch/distributed/run.py:792] ***************************************** |
| W1018 06:43:53.034832 2996178 site-packages/torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
| W1018 06:43:53.034832 2996178 site-packages/torch/distributed/run.py:792] ***************************************** |
| [2025-10-18 06:44:00,629] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| [2025-10-18 06:44:00,969] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| [2025-10-18 06:44:01,009] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| [2025-10-18 06:44:01,026] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| [2025-10-18 06:44:01,031] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| [2025-10-18 06:44:01,033] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| [2025-10-18 06:44:01,039] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| [2025-10-18 06:44:01,040] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
| [2025-10-18 06:44:04,890] [INFO] [comm.py:669:init_distributed] cdb=None |
| [2025-10-18 06:44:05,492] [INFO] [comm.py:669:init_distributed] cdb=None |
| [2025-10-18 06:44:05,518] [INFO] [comm.py:669:init_distributed] cdb=None |
| [2025-10-18 06:44:05,602] [INFO] [comm.py:669:init_distributed] cdb=None |
| [2025-10-18 06:44:05,625] [INFO] [comm.py:669:init_distributed] cdb=None |
| [2025-10-18 06:44:05,666] [INFO] [comm.py:669:init_distributed] cdb=None |
| [2025-10-18 06:44:05,666] [INFO] [comm.py:700:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl |
| [2025-10-18 06:44:05,673] [INFO] [comm.py:669:init_distributed] cdb=None |
| [2025-10-18 06:44:05,829] [INFO] [comm.py:669:init_distributed] cdb=None |
| [INFO|2025-10-18 06:44:06] llamafactory.hparams.parser:406 >> Process rank: 5, world size: 8, device: cuda:5, distributed training: True, compute dtype: torch.bfloat16 |
| [INFO|2025-10-18 06:44:06] llamafactory.hparams.parser:406 >> Process rank: 0, world size: 8, device: cuda:0, distributed training: True, compute dtype: torch.bfloat16 |
| [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file vocab.json |
| [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file merges.txt |
| [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file tokenizer.json |
| [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file added_tokens.json |
| [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file special_tokens_map.json |
| [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file tokenizer_config.json |
| [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:06,954 >> loading file chat_template.jinja |
| [INFO|tokenization_utils_base.py:2323] 2025-10-18 06:44:07,355 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. |
| [INFO|configuration_utils.py:691] 2025-10-18 06:44:07,357 >> loading configuration file /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17/config.json |
| [INFO|configuration_utils.py:765] 2025-10-18 06:44:07,359 >> Model config Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151645, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "max_position_embeddings": 32768, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": 131072, |
| "tie_word_embeddings": false, |
| "torch_dtype": "bfloat16", |
| "transformers_version": "4.51.3", |
| "use_cache": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
|
|
| [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file vocab.json |
| [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file merges.txt |
| [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file tokenizer.json |
| [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file added_tokens.json |
| [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file special_tokens_map.json |
| [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file tokenizer_config.json |
| [INFO|tokenization_utils_base.py:2058] 2025-10-18 06:44:07,360 >> loading file chat_template.jinja |
| [rank5]:[W1018 06:44:07.811198098 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 5] using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [INFO|tokenization_utils_base.py:2323] 2025-10-18 06:44:07,759 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. |
| [INFO|2025-10-18 06:44:07] llamafactory.data.loader:143 >> Loading dataset /mmu_nlp_ssd/dongguanting/tool_light_data/method7-qwen2.5-7b-instruct-llama-factory-sft-edition17.json... |
| [INFO|2025-10-18 06:44:07] llamafactory.hparams.parser:406 >> Process rank: 4, world size: 8, device: cuda:4, distributed training: True, compute dtype: torch.bfloat16 |
| [INFO|2025-10-18 06:44:07] llamafactory.hparams.parser:406 >> Process rank: 1, world size: 8, device: cuda:1, distributed training: True, compute dtype: torch.bfloat16 |
| [INFO|2025-10-18 06:44:08] llamafactory.hparams.parser:406 >> Process rank: 7, world size: 8, device: cuda:7, distributed training: True, compute dtype: torch.bfloat16 |
| [INFO|2025-10-18 06:44:08] llamafactory.hparams.parser:406 >> Process rank: 6, world size: 8, device: cuda:6, distributed training: True, compute dtype: torch.bfloat16 |
| [INFO|2025-10-18 06:44:08] llamafactory.hparams.parser:406 >> Process rank: 2, world size: 8, device: cuda:2, distributed training: True, compute dtype: torch.bfloat16 |
| [INFO|2025-10-18 06:44:08] llamafactory.hparams.parser:406 >> Process rank: 3, world size: 8, device: cuda:3, distributed training: True, compute dtype: torch.bfloat16 |
| [rank3]:[W1018 06:44:08.022060437 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank4]:[W1018 06:44:08.046568749 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 4] using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank1]:[W1018 06:44:08.058006051 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank6]:[W1018 06:44:08.087474291 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 6] using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank7]:[W1018 06:44:08.088871679 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 7] using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| [rank2]:[W1018 06:44:08.124219695 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
| Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard. |
|
Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 15077 examples [00:01, 11581.17 examples/s]
Generating train split: 15077 examples [00:01, 11559.35 examples/s] |
|
Converting format of dataset (num_proc=16): 0%| | 0/15077 [00:00<?, ? examples/s]
Converting format of dataset (num_proc=16): 81%|ββββββββ | 12142/15077 [00:00<00:00, 120710.70 examples/s]
Converting format of dataset (num_proc=16): 100%|ββββββββββ| 15077/15077 [00:00<00:00, 64696.89 examples/s] |
| [rank0]:[W1018 06:44:33.061247415 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. |
|
Running tokenizer on dataset (num_proc=16): 0%| | 0/15077 [00:00<?, ? examples/s]
Running tokenizer on dataset (num_proc=16): 6%|β | 943/15077 [00:04<01:06, 213.18 examples/s]
Running tokenizer on dataset (num_proc=16): 19%|ββ | 2829/15077 [00:04<00:16, 737.08 examples/s]
Running tokenizer on dataset (num_proc=16): 31%|ββββ | 4715/15077 [00:04<00:07, 1420.42 examples/s]
Running tokenizer on dataset (num_proc=16): 50%|βββββ | 7541/15077 [00:05<00:03, 2290.17 examples/s]
Running tokenizer on dataset (num_proc=16): 69%|βββββββ | 10367/15077 [00:05<00:01, 3602.20 examples/s]
Running tokenizer on dataset (num_proc=16): 81%|βββββββββ | 12251/15077 [00:05<00:00, 4645.95 examples/s]
Running tokenizer on dataset (num_proc=16): 94%|ββββββββββ| 14135/15077 [00:06<00:00, 4702.45 examples/s]
Running tokenizer on dataset (num_proc=16): 100%|ββββββββββ| 15077/15077 [00:06<00:00, 2277.50 examples/s] |
| training example: |
| input_ids: |
| [151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465, 553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 2610, 525, 264, 10950, 17847, 429, 646, 11625, 279, 2661, 3405, 3019, 553, 3019, 448, 279, 1492, 315, 279, 58218, 2711, 5392, 323, 10135, 39299, 5392, 13, 16246, 264, 3405, 11, 498, 1184, 311, 1156, 1744, 911, 279, 32711, 1882, 304, 279, 3971, 323, 1221, 3410, 279, 4226, 13, 11954, 7274, 11, 498, 646, 19873, 279, 58218, 2711, 5392, 311, 2711, 323, 10135, 39299, 5392, 311, 11047, 279, 6888, 3491, 369, 2097, 1995, 911, 3151, 13347, 421, 4362, 13, 576, 32711, 1882, 323, 4226, 525, 43810, 2878, 366, 26865, 29, 690, 26865, 29, 323, 366, 9217, 29, 690, 9217, 29, 9492, 15576, 11, 323, 279, 2711, 3239, 323, 1102, 525, 43810, 2878, 366, 1836, 29, 690, 1836, 29, 323, 366, 1382, 29, 690, 1382, 29, 9492, 15576, 13, 1752, 3110, 11, 366, 26865, 29, 1096, 374, 279, 32711, 1882, 13, 690, 26865, 29, 366, 1836, 29, 2711, 3239, 1588, 690, 1836, 29, 366, 1382, 29, 2711, 1102, 1588, 690, 1382, 29, 366, 26865, 29, 1096, 374, 279, 32711, 1882, 13, 690, 26865, 29, 366, 12669, 29, 10135, 2038, 1588, 690, 12669, 29, 366, 1382, 29, 10135, 39299, 1102, 1588, 690, 1382, 29, 366, 26865, 29, 1096, 374, 279, 32711, 1882, 13, 690, 26865, 29, 366, 9217, 29, 576, 1590, 4226, 374, 1124, 58, 1124, 79075, 90, 9217, 1588, 92, 1124, 60, 690, 9217, 14276, 758, 279, 1537, 949, 315, 279, 4226, 11, 279, 1590, 4734, 4226, 374, 43810, 2878, 1124, 79075, 6257, 448, 44536, 3561, 624, 4498, 1521, 11038, 468, 13, 16390, 594, 6981, 2746, 30, 151645, 198, 151644, 77091, 198, 13708, 766, 93376, 8253, 979, 11038, 468, 13, 16390, 594, 6981, 8469, 11, 358, 686, 1401, 705, 6032, 31177, 1995, 389, 1435, 3918, 26865, 1784, 1836, 43960, 2253, 468, 13, 16390, 6981, 4545, 2400, 522, 1836, 1784, 1382, 29, 334, 19357, 8085, 56177, 28715, 389, 279, 3897, 9293, 11, 11038, 468, 13, 16390, 748, 6981, 11, 11038, 506, 13, 16390, 11, 5823, 3123, 389, 6652, 220, 17, 15, 11, 220, 16, 24, 22, 16, 11, 4152, 311, 264, 12654, 13, 1096, 1995, 4041, 5961, 504, 279, 13656, 3565, 911, 11038, 506, 13, 16390, 2878, 1493, 8173, 3918, 1382, 1784, 9217, 16357, 1590, 4226, 374, 1124, 79075, 90, 32146, 220, 17, 15, 11, 220, 16, 24, 22, 16, 92, 7110, 522, 9217, 29, 151645, 198] |
| inputs: |
| <|im_start|>system |
| You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|> |
| <|im_start|>user |
| You are a helpful assistant that can solve the given question step by step with the help of the wikipedia search tool and python interpreter tool. Given a question, you need to first think about the reasoning process in the mind and then provide the answer. During thinking, you can invoke the wikipedia search tool to search and python interpreter tool to calculate the math problem for fact information about specific topics if needed. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags respectively, and the search query and result are enclosed within <search> </search> and <result> </result> tags respectively. For example, <think> This is the reasoning process. </think> <search> search query here </search> <result> search result here </result> <think> This is the reasoning process. </think> <python> python code here </python> <result> python interpreter result here </result> <think> This is the reasoning process. </think> <answer> The final answer is \[ \boxed{answer here} \] </answer>. In the last part of the answer, the final exact answer is enclosed within \boxed{} with latex format. |
| When did Roy E. Disney's father die?<|im_end|> |
| <|im_start|>assistant |
| <think>To determine when Roy E. Disney's father died, I will look up biographical information on him.</think><search>Roy E. Disney father death date</search><result>**Final Information** |
|
|
| Based on the provided documents, Roy E. Disneyβs father, Roy O. Disney, passed away on December 20, 1971, due to a stroke. This information comes directly from the historical details about Roy O. Disney within these sources.</result><answer>The final answer is \boxed{December 20, 1971}.\</answer><|im_end|> |
|
|
| label_ids: |
| [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 13708, 766, 93376, 8253, 979, 11038, 468, 13, 16390, 594, 6981, 8469, 11, 358, 686, 1401, 705, 6032, 31177, 1995, 389, 1435, 3918, 26865, 1784, 1836, 43960, 2253, 468, 13, 16390, 6981, 4545, 2400, 522, 1836, 1784, 1382, 29, 334, 19357, 8085, 56177, 28715, 389, 279, 3897, 9293, 11, 11038, 468, 13, 16390, 748, 6981, 11, 11038, 506, 13, 16390, 11, 5823, 3123, 389, 6652, 220, 17, 15, 11, 220, 16, 24, 22, 16, 11, 4152, 311, 264, 12654, 13, 1096, 1995, 4041, 5961, 504, 279, 13656, 3565, 911, 11038, 506, 13, 16390, 2878, 1493, 8173, 3918, 1382, 1784, 9217, 16357, 1590, 4226, 374, 1124, 79075, 90, 32146, 220, 17, 15, 11, 220, 16, 24, 22, 16, 92, 7110, 522, 9217, 29, 151645, 198] |
| labels: |
| <think>To determine when Roy E. Disney's father died, I will look up biographical information on him.</think><search>Roy E. Disney father death date</search><result>**Final Information** |
| |
| Based on the provided documents, Roy E. Disneyβs father, Roy O. Disney, passed away on December 20, 1971, due to a stroke. This information comes directly from the historical details about Roy O. Disney within these sources.</result><answer>The final answer is \boxed{December 20, 1971}.\</answer><|im_end|> |
| |
| [INFO|configuration_utils.py:691] 2025-10-18 06:45:06,145 >> loading configuration file /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17/config.json |
| [INFO|configuration_utils.py:765] 2025-10-18 06:45:06,147 >> Model config Qwen2Config { |
| "architectures": [ |
| "Qwen2ForCausalLM" |
| ], |
| "attention_dropout": 0.0, |
| "bos_token_id": 151643, |
| "eos_token_id": 151645, |
| "hidden_act": "silu", |
| "hidden_size": 3584, |
| "initializer_range": 0.02, |
| "intermediate_size": 18944, |
| "max_position_embeddings": 32768, |
| "max_window_layers": 28, |
| "model_type": "qwen2", |
| "num_attention_heads": 28, |
| "num_hidden_layers": 28, |
| "num_key_value_heads": 4, |
| "rms_norm_eps": 1e-06, |
| "rope_scaling": null, |
| "rope_theta": 1000000.0, |
| "sliding_window": 131072, |
| "tie_word_embeddings": false, |
| "torch_dtype": "bfloat16", |
| "transformers_version": "4.51.3", |
| "use_cache": false, |
| "use_sliding_window": false, |
| "vocab_size": 152064 |
| } |
| |
| [INFO|2025-10-18 06:45:06] llamafactory.model.model_utils.kv_cache:143 >> KV cache is disabled during training. |
| Applied Liger kernels to Qwen2 |
| Applied Liger kernels to Qwen2 |
| Applied Liger kernels to Qwen2 |
| Applied Liger kernels to Qwen2Applied Liger kernels to Qwen2 |
| |
| Applied Liger kernels to Qwen2Applied Liger kernels to Qwen2Applied Liger kernels to Qwen2 |
| |
| |
| [INFO|2025-10-18 06:45:06] llamafactory.model.model_utils.liger_kernel:143 >> Liger kernel has been applied to the model. |
| [INFO|modeling_utils.py:1121] 2025-10-18 06:45:07,293 >> loading weights file /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17/model.safetensors.index.json |
| [INFO|modeling_utils.py:3726] 2025-10-18 06:45:07,308 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model |
| [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 |
| [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 |
| [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 |
| [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 |
| [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 |
| [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 |
| [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 |
| [2025-10-18 06:45:07,308] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 |
| [INFO|configuration_utils.py:1142] 2025-10-18 06:45:07,321 >> Generate config GenerationConfig { |
| "bos_token_id": 151643, |
| "eos_token_id": 151645, |
| "use_cache": false |
| } |
| |
| Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. |
| Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. |
| Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. |
| [WARNING|logging.py:328] 2025-10-18 06:45:07,616 >> Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. |
| Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. |
| Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. |
| Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. |
| Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered. |
| [2025-10-18 06:45:09,887] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 339, num_elems = 7.62B |
|
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|βββ | 1/4 [00:02<00:07, 2.42s/it]
Loading checkpoint shards: 25%|βββ | 1/4 [00:02<00:07, 2.42s/it]
Loading checkpoint shards: 25%|βββ | 1/4 [00:02<00:07, 2.42s/it]
Loading checkpoint shards: 25%|βββ | 1/4 [00:02<00:07, 2.42s/it]
Loading checkpoint shards: 25%|βββ | 1/4 [00:02<00:07, 2.42s/it]
Loading checkpoint shards: 25%|βββ | 1/4 [00:02<00:07, 2.41s/it]
Loading checkpoint shards: 25%|βββ | 1/4 [00:02<00:06, 2.26s/it]
Loading checkpoint shards: 25%|βββ | 1/4 [00:02<00:07, 2.57s/it]
Loading checkpoint shards: 50%|βββββ | 2/4 [00:17<00:19, 9.81s/it]
Loading checkpoint shards: 50%|βββββ | 2/4 [00:17<00:19, 9.81s/it]
Loading checkpoint shards: 50%|βββββ | 2/4 [00:17<00:19, 9.74s/it]
Loading checkpoint shards: 50%|βββββ | 2/4 [00:17<00:19, 9.81s/it]
Loading checkpoint shards: 50%|βββββ | 2/4 [00:17<00:19, 9.81s/it]
Loading checkpoint shards: 50%|βββββ | 2/4 [00:17<00:19, 9.81s/it]
Loading checkpoint shards: 50%|βββββ | 2/4 [00:17<00:19, 9.81s/it]
Loading checkpoint shards: 50%|βββββ | 2/4 [00:17<00:19, 9.84s/it]
Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:30<00:11, 11.14s/it]
Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:30<00:11, 11.14s/it]
Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:30<00:11, 11.14s/it]
Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:30<00:11, 11.14s/it]
Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:30<00:11, 11.14s/it]
Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:30<00:11, 11.14s/it]
Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:29<00:11, 11.11s/it]
Loading checkpoint shards: 75%|ββββββββ | 3/4 [00:30<00:11, 11.15s/it]
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 7.73s/it]
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 8.13s/it] |
|
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 7.75s/it]
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 8.17s/it] |
|
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 7.75s/it]
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 7.75s/it]
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 8.17s/it] |
|
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 7.75s/it]
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 7.75s/it]
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 8.17s/it] |
|
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 8.17s/it] |
|
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 8.17s/it] |
|
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 7.75s/it]
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 8.17s/it] |
|
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 7.73s/it]
Loading checkpoint shards: 100%|ββββββββββ| 4/4 [00:32<00:00, 8.17s/it] |
| [INFO|modeling_utils.py:4930] 2025-10-18 06:45:42,606 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM. |
| |
| [INFO|modeling_utils.py:4938] 2025-10-18 06:45:42,606 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17. |
| If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training. |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. |
| def forward( |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. |
| def forward( |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. |
| def forward( |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. |
| def forward( |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. |
| def forward( |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. |
| def forward( |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. |
| def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. |
| def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. |
| def forward( |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. |
| def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. |
| def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. |
| def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. |
| def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. |
| def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": |
| [INFO|configuration_utils.py:1095] 2025-10-18 06:45:42,608 >> loading configuration file /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/base_model/method6_qwen2.5-7b_qwen3-4b_distill_qwen2.5-7b-it_difficulty-scale_method17/generation_config.json |
| [INFO|configuration_utils.py:1142] 2025-10-18 06:45:42,608 >> Generate config GenerationConfig { |
| "bos_token_id": 151643, |
| "do_sample": true, |
| "eos_token_id": [ |
| 151645, |
| 151643 |
| ], |
| "pad_token_id": 151643, |
| "repetition_penalty": 1.05, |
| "temperature": 0.7, |
| "top_k": 20, |
| "top_p": 0.8 |
| } |
| |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. |
| def forward( |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/llamafactory/model/model_utils/checkpointing.py:64: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. |
| def backward(ctx: "torch.autograd.Function", grad_output: "torch.Tensor") -> "torch.Tensor": |
| [INFO|2025-10-18 06:45:42] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled. |
| [INFO|2025-10-18 06:45:42] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference. |
| [INFO|2025-10-18 06:45:42] llamafactory.model.adapter:143 >> DeepSpeed ZeRO3 detected, remaining trainable params in float32. |
| [INFO|2025-10-18 06:45:42] llamafactory.model.adapter:143 >> Fine-tuning method: Full |
| [INFO|2025-10-18 06:45:42] llamafactory.model.loader:143 >> trainable params: 7,615,616,512 || all params: 7,615,616,512 || trainable%: 100.0000 |
| [INFO|trainer.py:748] 2025-10-18 06:45:42,648 >> Using auto half precision backend |
| [INFO|deepspeed.py:380] 2025-10-18 06:45:43,067 >> Detected ZeRO Offload and non-DeepSpeed optimizers: This combination should work as long as the custom optimizer has both CPU and GPU implementation (except LAMB) |
| Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination |
| Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... |
| Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination |
| Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... |
| Detected CUDA files, patching ldflags |
| Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja... |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. |
| If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. |
| warnings.warn( |
| Building extension module cpu_adam... |
| Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
| ninja: no work to do. |
| Loading extension module cpu_adam... |
| Time to load cpu_adam op: 2.425875186920166 seconds |
| Loading extension module cpu_adam... |
| Time to load cpu_adam op: 2.4944374561309814 seconds |
| Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination |
| Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... |
| Detected CUDA files, patching ldflags |
| Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja... |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. |
| If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. |
| warnings.warn( |
| Building extension module cpu_adam... |
| Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
| ninja: no work to do. |
| Loading extension module cpu_adam... |
| Time to load cpu_adam op: 2.5600526332855225 seconds |
| Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination |
| Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... |
| Detected CUDA files, patching ldflags |
| Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja... |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. |
| If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. |
| warnings.warn( |
| Building extension module cpu_adam... |
| Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
| ninja: no work to do. |
| Loading extension module cpu_adam... |
| Time to load cpu_adam op: 2.7998712062835693 seconds |
| Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination |
| Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... |
| Detected CUDA files, patching ldflags |
| Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja... |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. |
| If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. |
| warnings.warn( |
| Building extension module cpu_adam... |
| Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
| ninja: no work to do. |
| Loading extension module cpu_adam... |
| Time to load cpu_adam op: 2.860788345336914 seconds |
| Adam Optimizer #0 is created with AVX512 arithmetic capability. |
| Config: alpha=0.000005, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1 |
| [2025-10-18 06:45:47,795] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed info: version=0.16.7, git-hash=unknown, git-branch=unknown |
| [2025-10-18 06:45:47,795] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 8 |
| [2025-10-18 06:45:47,804] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False |
| [2025-10-18 06:45:47,805] [INFO] [logging.py:107:log_dist] [Rank 0] Using client Optimizer as basic optimizer |
| [2025-10-18 06:45:47,805] [INFO] [logging.py:107:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer |
| [2025-10-18 06:45:47,818] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam |
| [2025-10-18 06:45:47,818] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> |
| [2025-10-18 06:45:47,818] [INFO] [logging.py:107:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False |
| [2025-10-18 06:45:47,818] [INFO] [logging.py:107:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer |
| Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination |
| Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... |
| Detected CUDA files, patching ldflags |
| Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja... |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. |
| If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. |
| warnings.warn( |
| Building extension module cpu_adam... |
| Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
| ninja: no work to do. |
| Loading extension module cpu_adam... |
| Time to load cpu_adam op: 2.975823402404785 seconds |
| Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination |
| Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... |
| Installed CUDA version 12.3 does not match the version torch was compiled with 12.4 but since the APIs are compatible, accepting this combination |
| Using /root/.cache/torch_extensions/py310_cu124 as PyTorch extensions root... |
| Detected CUDA files, patching ldflags |
| Emitting ninja build file /root/.cache/torch_extensions/py310_cu124/cpu_adam/build.ninja... |
| /mmu_nlp_ssd/makai05/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. |
| If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. |
| warnings.warn( |
| Building extension module cpu_adam... |
| Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
| ninja: no work to do. |
| Loading extension module cpu_adam... |
| Time to load cpu_adam op: 3.004814624786377 seconds |
| Loading extension module cpu_adam... |
| Time to load cpu_adam op: 3.098824977874756 seconds |
| [2025-10-18 06:45:48,100] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning |
| [2025-10-18 06:45:48,101] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 3.05 GB CA 0.0 GB Max_CA 3 GB |
| [2025-10-18 06:45:48,101] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 80.18 GB, percent = 4.0% |
| [2025-10-18 06:45:48,103] [INFO] [stage3.py:170:__init__] Reduce bucket size 12845056 |
| [2025-10-18 06:45:48,103] [INFO] [stage3.py:171:__init__] Prefetch bucket size 11560550 |
| [2025-10-18 06:45:48,355] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] |
| [2025-10-18 06:45:48,356] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB |
| [2025-10-18 06:45:48,356] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 80.18 GB, percent = 4.0% |
| Parameter Offload: Total persistent parameters: 333312 in 141 params |
| [2025-10-18 06:45:48,621] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end] |
| [2025-10-18 06:45:48,622] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB |
| [2025-10-18 06:45:48,622] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 80.18 GB, percent = 4.0% |
| [2025-10-18 06:45:48,836] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions |
| [2025-10-18 06:45:48,837] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB |
| [2025-10-18 06:45:48,837] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 80.18 GB, percent = 4.0% |
| [2025-10-18 06:45:51,184] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 2 |
| [2025-10-18 06:45:51,185] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB |
| [2025-10-18 06:45:51,186] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 102.07 GB, percent = 5.1% |
| [2025-10-18 06:45:51,455] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions |
| [2025-10-18 06:45:51,456] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB |
| [2025-10-18 06:45:51,456] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 105.93 GB, percent = 5.3% |
| [2025-10-18 06:45:54,718] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions |
| [2025-10-18 06:45:54,719] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB |
| [2025-10-18 06:45:54,719] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 124.92 GB, percent = 6.2% |
| [2025-10-18 06:45:54,956] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states |
| [2025-10-18 06:45:54,956] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB |
| [2025-10-18 06:45:54,957] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 128.81 GB, percent = 6.4% |
| [2025-10-18 06:46:01,399] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states |
| [2025-10-18 06:46:01,400] [INFO] [utils.py:782:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB |
| [2025-10-18 06:46:01,400] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 157.56 GB, percent = 7.8% |
| [2025-10-18 06:46:01,401] [INFO] [stage3.py:534:_setup_for_real_optimizer] optimizer state initialized |
| [2025-10-18 06:46:04,410] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer |
| [2025-10-18 06:46:04,411] [INFO] [utils.py:782:see_memory_usage] MA 0.02 GB Max_MA 2.06 GB CA 2.06 GB Max_CA 2 GB |
| [2025-10-18 06:46:04,411] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 174.14 GB, percent = 8.6% |
| [2025-10-18 06:46:04,411] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3 |
| [2025-10-18 06:46:04,411] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None |
| [2025-10-18 06:46:04,411] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed LR Scheduler = None |
| [2025-10-18 06:46:04,411] [INFO] [logging.py:107:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)] |
| [2025-10-18 06:46:04,412] [INFO] [config.py:1003:print] DeepSpeedEngine configuration: |
| [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] activation_checkpointing_config { |
| "partition_activations": false, |
| "contiguous_memory_optimization": false, |
| "cpu_checkpointing": false, |
| "number_checkpoints": null, |
| "synchronize_checkpoint_boundary": false, |
| "profile": false |
| } |
| [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} |
| [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] amp_enabled .................. False |
| [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] amp_params ................... False |
| [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] autotuning_config ............ { |
| "enabled": false, |
| "start_step": null, |
| "end_step": null, |
| "metric_path": null, |
| "arg_mappings": null, |
| "metric": "throughput", |
| "model_info": null, |
| "results_dir": "autotuning_results", |
| "exps_dir": "autotuning_exps", |
| "overwrite": true, |
| "fast": true, |
| "start_profile_step": 3, |
| "end_profile_step": 5, |
| "tuner_type": "gridsearch", |
| "tuner_early_stopping": 5, |
| "tuner_num_trials": 50, |
| "model_info_path": null, |
| "mp_size": 1, |
| "max_train_batch_size": null, |
| "min_train_batch_size": 1, |
| "max_train_micro_batch_size_per_gpu": 1.024000e+03, |
| "min_train_micro_batch_size_per_gpu": 1, |
| "num_tuning_micro_batch_sizes": 3 |
| } |
| [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] bfloat16_enabled ............. True |
| [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] bfloat16_immediate_grad_update True |
| [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] checkpoint_parallel_write_pipeline False |
| [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] checkpoint_tag_validation_enabled True |
| [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] checkpoint_tag_validation_fail False |
| [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f1037640af0> |
| [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] communication_data_type ...... None |
| [2025-10-18 06:46:04,413] [INFO] [config.py:1007:print] compile_config ............... deepcompile=False free_activation=False offload_activation=False offload_opt_states=False double_buffer=True symmetric_memory=False debug_log=False offload_parameters=False sync_before_reduce=False sync_after_reduce=False sync_before_allgather=False sync_after_allgather=False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] curriculum_enabled_legacy .... False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] curriculum_params_legacy ..... False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'pin_memory': False, 'curriculum_learning': {'enabled': False}, 'dynamic_batching': {'enabled': False, 'lr_scaling_method': 'linear', 'min_batch_size': 1, 'max_batch_size': None, 'sequence_picking_order': 'dataloader', 'verbose': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] data_efficiency_enabled ...... False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] dataloader_drop_last ......... False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] disable_allgather ............ False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] dump_state ................... False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] dynamic_loss_scale_args ...... None |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_enabled ........... False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_gas_boundary_resolution 1 |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_layer_name ........ bert.encoder.layer |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_layer_num ......... 0 |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_max_iter .......... 100 |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_stability ......... 1e-06 |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_tol ............... 0.01 |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] eigenvalue_verbose ........... False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] elasticity_enabled ........... False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] flops_profiler_config ........ { |
| "enabled": false, |
| "recompute_fwd_factor": 0.0, |
| "profile_step": 1, |
| "module_depth": -1, |
| "top_modules": 1, |
| "detailed": true, |
| "output_file": null |
| } |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] fp16_auto_cast ............... None |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] fp16_enabled ................. False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] fp16_master_weights_and_gradients False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] global_rank .................. 0 |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] grad_accum_dtype ............. None |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] gradient_accumulation_steps .. 2 |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] gradient_clipping ............ 1.0 |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] gradient_predivide_factor .... 1.0 |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] graph_harvesting ............. False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] initial_dynamic_scale ........ 1 |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] load_universal_checkpoint .... False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] loss_scale ................... 1.0 |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] memory_breakdown ............. False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] mics_hierarchial_params_gather False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] mics_shard_size .............. -1 |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] nebula_config ................ { |
| "enabled": false, |
| "persistent_storage_path": null, |
| "persistent_time_interval": 100, |
| "num_of_version_in_retention": 2, |
| "enable_nebula_load": true, |
| "load_path": null |
| } |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] optimizer_legacy_fusion ...... False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] optimizer_name ............... None |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] optimizer_params ............. None |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] pld_enabled .................. False |
| [2025-10-18 06:46:04,414] [INFO] [config.py:1007:print] pld_params ................... False |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] prescale_gradients ........... False |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] scheduler_name ............... None |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] scheduler_params ............. None |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] seq_parallel_communication_data_type torch.float32 |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] sparse_attention ............. None |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] sparse_gradients_enabled ..... False |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] steps_per_print .............. inf |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tp_overlap_comm=False tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] timers_config ................ enabled=True synchronized=True |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] train_batch_size ............. 16 |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] train_micro_batch_size_per_gpu 1 |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] use_data_before_expert_parallel_ False |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] use_node_local_storage ....... False |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] wall_clock_breakdown ......... False |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] weight_quantization_config ... None |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] world_size ................... 8 |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_allow_untested_optimizer True |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=12845056 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=11560550 param_persistence_threshold=35840 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_enabled ................. True |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_force_ds_cpu_optimizer .. True |
| [2025-10-18 06:46:04,415] [INFO] [config.py:1007:print] zero_optimization_stage ...... 3 |
| [2025-10-18 06:46:04,415] [INFO] [config.py:993:print_user_config] json = { |
| "train_batch_size": 16, |
| "train_micro_batch_size_per_gpu": 1, |
| "gradient_accumulation_steps": 2, |
| "gradient_clipping": 1.0, |
| "zero_allow_untested_optimizer": true, |
| "fp16": { |
| "enabled": false, |
| "loss_scale": 0, |
| "loss_scale_window": 1000, |
| "initial_scale_power": 16, |
| "hysteresis": 2, |
| "min_loss_scale": 1 |
| }, |
| "bf16": { |
| "enabled": true |
| }, |
| "zero_optimization": { |
| "stage": 3, |
| "offload_optimizer": { |
| "device": "cpu", |
| "pin_memory": true |
| }, |
| "offload_param": { |
| "device": "cpu", |
| "pin_memory": true |
| }, |
| "overlap_comm": true, |
| "contiguous_gradients": true, |
| "sub_group_size": 1.000000e+09, |
| "reduce_bucket_size": 1.284506e+07, |
| "stage3_prefetch_bucket_size": 1.156055e+07, |
| "stage3_param_persistence_threshold": 3.584000e+04, |
| "stage3_max_live_parameters": 1.000000e+09, |
| "stage3_max_reuse_distance": 1.000000e+09, |
| "stage3_gather_16bit_weights_on_model_save": true |
| }, |
| "steps_per_print": inf |
| } |
| [INFO|trainer.py:2414] 2025-10-18 06:46:04,417 >> ***** Running training ***** |
| [INFO|trainer.py:2415] 2025-10-18 06:46:04,417 >> Num examples = 15,077 |
| [INFO|trainer.py:2416] 2025-10-18 06:46:04,417 >> Num Epochs = 3 |
| [INFO|trainer.py:2417] 2025-10-18 06:46:04,417 >> Instantaneous batch size per device = 1 |
| [INFO|trainer.py:2420] 2025-10-18 06:46:04,417 >> Total train batch size (w. parallel, distributed & accumulation) = 16 |
| [INFO|trainer.py:2421] 2025-10-18 06:46:04,417 >> Gradient Accumulation steps = 2 |
| [INFO|trainer.py:2422] 2025-10-18 06:46:04,417 >> Total optimization steps = 2,826 |
| [INFO|trainer.py:2423] 2025-10-18 06:46:04,418 >> Number of trainable parameters = 7,615,616,512 |
|
0%| | 0/2826 [00:00<?, ?it/s]
0%| | 1/2826 [00:23<18:42:02, 23.83s/it]
0%| | 2/2826 [00:29<10:23:07, 13.24s/it]
0%| | 3/2826 [00:37<8:33:29, 10.91s/it]
0%| | 4/2826 [00:45<7:37:53, 9.74s/it]
0%| | 5/2826 [00:50<6:19:17, 8.07s/it]
0%| | 6/2826 [00:56<5:40:58, 7.25s/it]
0%| | 7/2826 [01:04<5:48:55, 7.43s/it]
0%| | 8/2826 [01:09<5:21:16, 6.84s/it]
0%| | 9/2826 [01:19<5:55:46, 7.58s/it]
0%| | 10/2826 [01:24<5:30:40, 7.05s/it]
{'loss': 0.741, 'grad_norm': 4.634474754333496, 'learning_rate': 1.5901060070671379e-07, 'epoch': 0.01} |
|
0%| | 10/2826 [01:24<5:30:40, 7.05s/it]
0%| | 11/2826 [01:32<5:41:30, 7.28s/it]
0%| | 12/2826 [01:41<6:00:21, 7.68s/it]
0%| | 13/2826 [01:46<5:26:21, 6.96s/it]
0%| | 14/2826 [01:53<5:18:26, 6.79s/it]
1%| | 15/2826 [01:59<5:20:01, 6.83s/it]
1%| | 16/2826 [02:09<5:53:30, 7.55s/it]
1%| | 17/2826 [02:14<5:18:52, 6.81s/it]
1%| | 18/2826 [02:22<5:39:16, 7.25s/it]
1%| | 19/2826 [02:29<5:35:59, 7.18s/it]
1%| | 20/2826 [02:35<5:21:22, 6.87s/it]
{'loss': 0.5551, 'grad_norm': 2.9002726078033447, 'learning_rate': 3.356890459363958e-07, 'epoch': 0.02} |
|
1%| | 20/2826 [02:35<5:21:22, 6.87s/it]
1%| | 21/2826 [02:41<5:11:04, 6.65s/it]
1%| | 22/2826 [02:46<4:49:05, 6.19s/it]
1%| | 23/2826 [02:52<4:37:04, 5.93s/it]
1%| | 24/2826 [02:58<4:43:15, 6.07s/it]
1%| | 25/2826 [03:03<4:28:32, 5.75s/it]
1%| | 26/2826 [03:09<4:28:21, 5.75s/it]
1%| | 27/2826 [03:17<5:01:43, 6.47s/it]
1%| | 28/2826 [03:22<4:42:11, 6.05s/it]
1%| | 29/2826 [03:29<4:48:31, 6.19s/it]
1%| | 30/2826 [03:35<4:48:33, 6.19s/it]
{'loss': 0.6185, 'grad_norm': 4.242003917694092, 'learning_rate': 5.123674911660778e-07, 'epoch': 0.03} |
|
1%| | 30/2826 [03:35<4:48:33, 6.19s/it]
1%| | 31/2826 [03:40<4:40:17, 6.02s/it]
1%| | 32/2826 [03:48<5:03:57, 6.53s/it]
1%| | 33/2826 [03:54<4:52:24, 6.28s/it]
1%| | 34/2826 [03:59<4:34:31, 5.90s/it]
1%| | 35/2826 [04:05<4:29:57, 5.80s/it]
1%|β | 36/2826 [04:11<4:44:59, 6.13s/it]
1%|β | 37/2826 [04:18<4:45:05, 6.13s/it]
1%|β | 38/2826 [04:23<4:31:56, 5.85s/it]
1%|β | 39/2826 [04:30<4:48:11, 6.20s/it]
1%|β | 40/2826 [04:37<4:56:46, 6.39s/it]
{'loss': 0.6358, 'grad_norm': 3.8156638145446777, 'learning_rate': 6.890459363957598e-07, 'epoch': 0.04} |
|
1%|β | 40/2826 [04:37<4:56:46, 6.39s/it]
1%|β | 41/2826 [04:42<4:41:51, 6.07s/it]
1%|β | 42/2826 [04:47<4:26:53, 5.75s/it]
2%|β | 43/2826 [04:52<4:19:12, 5.59s/it]
2%|β | 44/2826 [05:04<5:44:37, 7.43s/it]
2%|β | 45/2826 [05:10<5:24:05, 6.99s/it]
2%|β | 46/2826 [05:17<5:27:17, 7.06s/it]
2%|β | 47/2826 [05:23<5:09:38, 6.69s/it]
2%|β | 48/2826 [05:29<5:06:01, 6.61s/it]
2%|β | 49/2826 [05:36<5:09:26, 6.69s/it]
2%|β | 50/2826 [05:42<4:51:43, 6.31s/it]
{'loss': 0.5922, 'grad_norm': 3.047624349594116, 'learning_rate': 8.657243816254418e-07, 'epoch': 0.05} |
|
2%|β | 50/2826 [05:42<4:51:43, 6.31s/it]
2%|β | 51/2826 [05:47<4:42:50, 6.12s/it]
2%|β | 52/2826 [05:53<4:35:09, 5.95s/it]
2%|β | 53/2826 [06:00<4:46:37, 6.20s/it]
2%|β | 54/2826 [06:05<4:31:14, 5.87s/it]
2%|β | 55/2826 [06:10<4:20:00, 5.63s/it]
2%|β | 56/2826 [06:15<4:12:35, 5.47s/it]
2%|β | 57/2826 [06:20<4:13:32, 5.49s/it]
2%|β | 58/2826 [06:26<4:07:59, 5.38s/it]
2%|β | 59/2826 [06:32<4:19:03, 5.62s/it]
2%|β | 60/2826 [06:37<4:16:47, 5.57s/it]
{'loss': 0.6282, 'grad_norm': 2.2943954467773438, 'learning_rate': 1.0424028268551239e-06, 'epoch': 0.06} |
|
2%|β | 60/2826 [06:37<4:16:47, 5.57s/it]
2%|β | 61/2826 [06:42<4:09:52, 5.42s/it]
2%|β | 62/2826 [06:50<4:48:21, 6.26s/it]
2%|β | 63/2826 [06:56<4:32:02, 5.91s/it]
2%|β | 64/2826 [07:01<4:19:34, 5.64s/it]
2%|β | 65/2826 [07:06<4:11:27, 5.46s/it]
2%|β | 66/2826 [07:12<4:26:42, 5.80s/it]
2%|β | 67/2826 [07:17<4:17:27, 5.60s/it]
2%|β | 68/2826 [07:23<4:15:42, 5.56s/it]
2%|β | 69/2826 [07:28<4:08:49, 5.42s/it]
2%|β | 70/2826 [07:33<4:10:42, 5.46s/it]
{'loss': 0.5836, 'grad_norm': 2.831937551498413, 'learning_rate': 1.2190812720848057e-06, 'epoch': 0.07} |
|
2%|β | 70/2826 [07:33<4:10:42, 5.46s/it]
3%|β | 71/2826 [07:40<4:21:11, 5.69s/it]
3%|β | 72/2826 [07:45<4:12:07, 5.49s/it]
3%|β | 73/2826 [07:51<4:20:25, 5.68s/it]
3%|β | 74/2826 [07:57<4:30:14, 5.89s/it]
3%|β | 75/2826 [08:02<4:18:55, 5.65s/it]
3%|β | 76/2826 [08:09<4:30:07, 5.89s/it]
3%|β | 77/2826 [08:14<4:21:27, 5.71s/it]
3%|β | 78/2826 [08:19<4:15:34, 5.58s/it]
3%|β | 79/2826 [08:25<4:21:19, 5.71s/it]
3%|β | 80/2826 [08:31<4:25:37, 5.80s/it]
{'loss': 0.5836, 'grad_norm': 3.941297769546509, 'learning_rate': 1.3957597173144876e-06, 'epoch': 0.08} |
|
3%|β | 80/2826 [08:31<4:25:37, 5.80s/it]
3%|β | 81/2826 [08:37<4:24:11, 5.77s/it]
3%|β | 82/2826 [08:44<4:36:17, 6.04s/it]
3%|β | 83/2826 [08:49<4:28:40, 5.88s/it]
3%|β | 84/2826 [08:55<4:24:56, 5.80s/it]
3%|β | 85/2826 [09:00<4:15:18, 5.59s/it]
3%|β | 86/2826 [09:06<4:17:26, 5.64s/it]
3%|β | 87/2826 [09:12<4:27:09, 5.85s/it]
3%|β | 88/2826 [09:18<4:30:40, 5.93s/it]
3%|β | 89/2826 [09:24<4:35:16, 6.03s/it]
3%|β | 90/2826 [09:31<4:42:51, 6.20s/it]
{'loss': 0.4983, 'grad_norm': 2.4598379135131836, 'learning_rate': 1.5724381625441699e-06, 'epoch': 0.1} |
|
3%|β | 90/2826 [09:31<4:42:51, 6.20s/it]
3%|β | 91/2826 [09:38<4:53:12, 6.43s/it]
3%|β | 92/2826 [09:43<4:35:32, 6.05s/it]
3%|β | 93/2826 [09:48<4:25:22, 5.83s/it]
3%|β | 94/2826 [09:53<4:13:26, 5.57s/it]
3%|β | 95/2826 [09:58<4:06:08, 5.41s/it]
3%|β | 96/2826 [10:05<4:23:20, 5.79s/it]
3%|β | 97/2826 [10:10<4:14:19, 5.59s/it]
3%|β | 98/2826 [10:15<4:07:18, 5.44s/it]
4%|β | 99/2826 [10:21<4:14:18, 5.60s/it]
4%|β | 100/2826 [10:27<4:12:08, 5.55s/it]
{'loss': 0.6057, 'grad_norm': 2.533829927444458, 'learning_rate': 1.7491166077738517e-06, 'epoch': 0.11} |
|
4%|β | 100/2826 [10:27<4:12:08, 5.55s/it]
4%|β | 101/2826 [10:33<4:27:27, 5.89s/it]
4%|β | 102/2826 [10:39<4:20:10, 5.73s/it]
4%|β | 103/2826 [10:44<4:18:40, 5.70s/it]
4%|β | 104/2826 [10:49<4:09:49, 5.51s/it]
4%|β | 105/2826 [10:56<4:26:30, 5.88s/it]
4%|β | 106/2826 [11:01<4:15:18, 5.63s/it]
4%|β | 107/2826 [11:07<4:11:56, 5.56s/it]
4%|β | 108/2826 [11:13<4:21:10, 5.77s/it]
4%|β | 109/2826 [11:19<4:21:41, 5.78s/it]
4%|β | 110/2826 [11:26<4:45:02, 6.30s/it]
{'loss': 0.5135, 'grad_norm': 2.412334442138672, 'learning_rate': 1.925795053003534e-06, 'epoch': 0.12} |
|
4%|β | 110/2826 [11:26<4:45:02, 6.30s/it]
4%|β | 111/2826 [11:32<4:43:03, 6.26s/it]
4%|β | 112/2826 [11:39<4:41:34, 6.22s/it]
4%|β | 113/2826 [11:44<4:30:27, 5.98s/it]
4%|β | 114/2826 [11:50<4:32:46, 6.03s/it]
4%|β | 115/2826 [11:57<4:48:37, 6.39s/it]
4%|β | 116/2826 [12:03<4:33:35, 6.06s/it]
4%|β | 117/2826 [12:09<4:38:15, 6.16s/it]
4%|β | 118/2826 [12:14<4:24:36, 5.86s/it]
4%|β | 119/2826 [12:20<4:30:32, 6.00s/it]
4%|β | 120/2826 [12:26<4:24:18, 5.86s/it]
{'loss': 0.4844, 'grad_norm': 2.7505877017974854, 'learning_rate': 2.1024734982332157e-06, 'epoch': 0.13} |
|
4%|β | 120/2826 [12:26<4:24:18, 5.86s/it]
4%|β | 121/2826 [12:32<4:31:46, 6.03s/it]
4%|β | 122/2826 [12:39<4:33:42, 6.07s/it]
4%|β | 123/2826 [12:47<5:01:24, 6.69s/it]
4%|β | 124/2826 [12:52<4:38:39, 6.19s/it]
4%|β | 125/2826 [12:57<4:22:03, 5.82s/it]
4%|β | 126/2826 [13:02<4:16:06, 5.69s/it]
4%|β | 127/2826 [13:07<4:09:54, 5.56s/it]
5%|β | 128/2826 [13:13<4:08:13, 5.52s/it]
5%|β | 129/2826 [13:18<4:09:08, 5.54s/it]
5%|β | 130/2826 [13:25<4:24:04, 5.88s/it]
{'loss': 0.5386, 'grad_norm': 2.701307535171509, 'learning_rate': 2.279151943462898e-06, 'epoch': 0.14} |
|
5%|β | 130/2826 [13:25<4:24:04, 5.88s/it]
5%|β | 131/2826 [13:33<4:48:13, 6.42s/it]
5%|β | 132/2826 [13:39<4:47:26, 6.40s/it]
5%|β | 133/2826 [13:45<4:45:01, 6.35s/it]
5%|β | 134/2826 [13:51<4:39:17, 6.23s/it]
5%|β | 135/2826 [13:56<4:23:01, 5.86s/it]
5%|β | 136/2826 [14:01<4:12:40, 5.64s/it]
5%|β | 137/2826 [14:06<4:05:00, 5.47s/it]
5%|β | 138/2826 [14:12<4:09:47, 5.58s/it]
5%|β | 139/2826 [14:18<4:08:09, 5.54s/it]
5%|β | 140/2826 [14:23<4:02:49, 5.42s/it]
{'loss': 0.4774, 'grad_norm': 2.8261961936950684, 'learning_rate': 2.45583038869258e-06, 'epoch': 0.15} |
|
5%|β | 140/2826 [14:23<4:02:49, 5.42s/it]
5%|β | 141/2826 [14:30<4:26:39, 5.96s/it]
5%|β | 142/2826 [14:36<4:27:10, 5.97s/it]
5%|β | 143/2826 [14:42<4:25:40, 5.94s/it]
5%|β | 144/2826 [14:47<4:12:36, 5.65s/it]
5%|β | 145/2826 [14:56<4:54:58, 6.60s/it]
5%|β | 146/2826 [15:02<4:48:52, 6.47s/it]
5%|β | 147/2826 [15:08<4:49:25, 6.48s/it]
5%|β | 148/2826 [15:15<4:48:40, 6.47s/it]
5%|β | 149/2826 [15:20<4:35:42, 6.18s/it]
5%|β | 150/2826 [15:26<4:28:33, 6.02s/it]
{'loss': 0.5035, 'grad_norm': 2.4490256309509277, 'learning_rate': 2.6325088339222617e-06, 'epoch': 0.16} |
|
5%|β | 150/2826 [15:26<4:28:33, 6.02s/it]
5%|β | 151/2826 [15:32<4:26:40, 5.98s/it]
5%|β | 152/2826 [15:38<4:34:25, 6.16s/it]
5%|β | 153/2826 [15:44<4:22:26, 5.89s/it]
5%|β | 154/2826 [15:49<4:14:32, 5.72s/it]
5%|β | 155/2826 [15:54<4:06:51, 5.55s/it]
6%|β | 156/2826 [16:00<4:03:41, 5.48s/it]
6%|β | 157/2826 [16:05<3:58:43, 5.37s/it]
6%|β | 158/2826 [16:11<4:06:59, 5.55s/it]
6%|β | 159/2826 [16:16<4:02:55, 5.47s/it]
6%|β | 160/2826 [16:22<4:14:53, 5.74s/it]
{'loss': 0.4897, 'grad_norm': 2.418158769607544, 'learning_rate': 2.8091872791519436e-06, 'epoch': 0.17} |
|
6%|β | 160/2826 [16:22<4:14:53, 5.74s/it]
6%|β | 161/2826 [16:28<4:17:35, 5.80s/it]
6%|β | 162/2826 [16:33<4:06:56, 5.56s/it]
6%|β | 163/2826 [16:39<4:11:43, 5.67s/it]
6%|β | 164/2826 [16:45<4:12:17, 5.69s/it]
6%|β | 165/2826 [16:50<4:04:39, 5.52s/it]
6%|β | 166/2826 [16:55<4:02:03, 5.46s/it]
6%|β | 167/2826 [17:01<4:03:45, 5.50s/it]
6%|β | 168/2826 [17:07<4:17:27, 5.81s/it]
6%|β | 169/2826 [17:13<4:20:06, 5.87s/it]
6%|β | 170/2826 [17:19<4:09:41, 5.64s/it]
{'loss': 0.5196, 'grad_norm': 3.5972161293029785, 'learning_rate': 2.985865724381626e-06, 'epoch': 0.18} |
|
6%|β | 170/2826 [17:19<4:09:41, 5.64s/it]
6%|β | 171/2826 [17:24<4:02:06, 5.47s/it]
6%|β | 172/2826 [17:30<4:13:47, 5.74s/it]
6%|β | 173/2826 [17:36<4:19:58, 5.88s/it]
6%|β | 174/2826 [17:42<4:17:08, 5.82s/it]
6%|β | 175/2826 [17:48<4:24:57, 6.00s/it]
6%|β | 176/2826 [17:54<4:19:20, 5.87s/it]
6%|β | 177/2826 [18:00<4:28:56, 6.09s/it]
6%|β | 178/2826 [18:07<4:33:44, 6.20s/it]
6%|β | 179/2826 [18:12<4:21:13, 5.92s/it]
6%|β | 180/2826 [18:18<4:13:45, 5.75s/it]
{'loss': 0.4791, 'grad_norm': 2.814927577972412, 'learning_rate': 3.162544169611308e-06, 'epoch': 0.19} |
|
6%|β | 180/2826 [18:18<4:13:45, 5.75s/it]
6%|β | 181/2826 [18:23<4:14:05, 5.76s/it]
6%|β | 182/2826 [18:29<4:07:06, 5.61s/it]
6%|β | 183/2826 [18:34<4:02:13, 5.50s/it]
7%|β | 184/2826 [18:40<4:05:21, 5.57s/it]
7%|β | 185/2826 [18:45<4:04:42, 5.56s/it]
7%|β | 186/2826 [18:50<4:01:25, 5.49s/it]
7%|β | 187/2826 [18:56<4:08:28, 5.65s/it]
7%|β | 188/2826 [19:02<4:06:17, 5.60s/it]
7%|β | 189/2826 [19:08<4:06:21, 5.61s/it]
7%|β | 190/2826 [19:13<4:00:02, 5.46s/it]
{'loss': 0.5024, 'grad_norm': 2.6151270866394043, 'learning_rate': 3.3392226148409896e-06, 'epoch': 0.2} |
|
7%|β | 190/2826 [19:13<4:00:02, 5.46s/it]
7%|β | 191/2826 [19:18<3:54:38, 5.34s/it]
7%|β | 192/2826 [19:23<3:58:13, 5.43s/it]
7%|β | 193/2826 [19:29<4:00:43, 5.49s/it]
7%|β | 194/2826 [19:34<3:56:46, 5.40s/it]
7%|β | 195/2826 [19:40<4:07:26, 5.64s/it]
7%|β | 196/2826 [19:46<4:08:45, 5.68s/it]
7%|β | 197/2826 [19:53<4:17:45, 5.88s/it]
7%|β | 198/2826 [19:58<4:10:06, 5.71s/it]
7%|β | 199/2826 [20:04<4:20:17, 5.95s/it]
7%|β | 200/2826 [20:10<4:23:09, 6.01s/it]
{'loss': 0.5781, 'grad_norm': 2.8331387042999268, 'learning_rate': 3.5159010600706715e-06, 'epoch': 0.21} |
|
7%|β | 200/2826 [20:10<4:23:09, 6.01s/it]
7%|β | 201/2826 [20:17<4:31:53, 6.21s/it]
7%|β | 202/2826 [20:22<4:16:04, 5.86s/it]
7%|β | 203/2826 [20:28<4:20:29, 5.96s/it]
7%|β | 204/2826 [20:34<4:13:59, 5.81s/it]
7%|β | 205/2826 [20:39<4:09:38, 5.71s/it]
7%|β | 206/2826 [20:45<4:13:49, 5.81s/it]
7%|β | 207/2826 [20:52<4:17:59, 5.91s/it]
7%|β | 208/2826 [20:59<4:33:55, 6.28s/it]
7%|β | 209/2826 [21:04<4:20:06, 5.96s/it]
7%|β | 210/2826 [21:09<4:09:27, 5.72s/it]
{'loss': 0.4186, 'grad_norm': 2.433027744293213, 'learning_rate': 3.6925795053003538e-06, 'epoch': 0.22} |
|
7%|β | 210/2826 [21:09<4:09:27, 5.72s/it]
7%|β | 211/2826 [21:15<4:09:03, 5.71s/it]
8%|β | 212/2826 [21:21<4:21:42, 6.01s/it]
8%|β | 213/2826 [21:27<4:14:17, 5.84s/it]
8%|β | 214/2826 [21:32<4:06:21, 5.66s/it]
8%|β | 215/2826 [21:37<3:58:03, 5.47s/it]
8%|β | 216/2826 [21:43<3:59:05, 5.50s/it]
8%|β | 217/2826 [21:48<4:00:39, 5.53s/it]
8%|β | 218/2826 [21:54<3:59:22, 5.51s/it]
8%|β | 219/2826 [21:59<3:54:55, 5.41s/it]
8%|β | 220/2826 [22:04<3:55:50, 5.43s/it]
{'loss': 0.4819, 'grad_norm': 2.671696186065674, 'learning_rate': 3.869257950530036e-06, 'epoch': 0.23} |
|
8%|β | 220/2826 [22:04<3:55:50, 5.43s/it]
8%|β | 221/2826 [22:10<4:03:25, 5.61s/it]
8%|β | 222/2826 [22:16<4:02:32, 5.59s/it]
8%|β | 223/2826 [22:24<4:37:17, 6.39s/it]
8%|β | 224/2826 [22:30<4:29:46, 6.22s/it]
8%|β | 225/2826 [22:35<4:16:51, 5.93s/it]
8%|β | 226/2826 [22:41<4:17:50, 5.95s/it]
8%|β | 227/2826 [22:46<4:06:36, 5.69s/it]
8%|β | 228/2826 [22:52<3:59:26, 5.53s/it]
8%|β | 229/2826 [22:57<3:55:52, 5.45s/it]
8%|β | 230/2826 [23:02<3:58:37, 5.52s/it]
{'loss': 0.547, 'grad_norm': 2.5337982177734375, 'learning_rate': 4.045936395759718e-06, 'epoch': 0.24} |
|
8%|β | 230/2826 [23:02<3:58:37, 5.52s/it]
8%|β | 231/2826 [23:10<4:23:10, 6.09s/it]
8%|β | 232/2826 [23:16<4:17:01, 5.94s/it]
8%|β | 233/2826 [23:21<4:09:29, 5.77s/it]
8%|β | 234/2826 [23:28<4:22:36, 6.08s/it]
8%|β | 235/2826 [23:33<4:11:19, 5.82s/it]
8%|β | 236/2826 [23:39<4:17:55, 5.98s/it]
8%|β | 237/2826 [23:45<4:11:50, 5.84s/it]
8%|β | 238/2826 [23:50<4:08:01, 5.75s/it]
8%|β | 239/2826 [23:56<4:13:11, 5.87s/it]
8%|β | 240/2826 [24:02<4:14:32, 5.91s/it]
{'loss': 0.5603, 'grad_norm': 2.2034990787506104, 'learning_rate': 4.222614840989399e-06, 'epoch': 0.25} |
|
8%|β | 240/2826 [24:02<4:14:32, 5.91s/it]
9%|β | 241/2826 [24:08<4:04:06, 5.67s/it]
9%|β | 242/2826 [24:14<4:16:47, 5.96s/it]
9%|β | 243/2826 [24:20<4:17:00, 5.97s/it]
9%|β | 244/2826 [24:27<4:21:32, 6.08s/it]
9%|β | 245/2826 [24:32<4:13:30, 5.89s/it]
9%|β | 246/2826 [24:37<4:01:58, 5.63s/it]
9%|β | 247/2826 [24:42<3:55:58, 5.49s/it]
9%|β | 248/2826 [24:47<3:50:30, 5.36s/it]
9%|β | 249/2826 [24:54<4:14:51, 5.93s/it]
9%|β | 250/2826 [25:00<4:07:52, 5.77s/it]
{'loss': 0.4483, 'grad_norm': 2.2893121242523193, 'learning_rate': 4.399293286219082e-06, 'epoch': 0.27} |
|
9%|β | 250/2826 [25:00<4:07:52, 5.77s/it]
9%|β | 251/2826 [25:05<3:56:30, 5.51s/it]
9%|β | 252/2826 [25:10<3:49:51, 5.36s/it]
9%|β | 253/2826 [25:15<3:46:14, 5.28s/it]
9%|β | 254/2826 [25:21<3:56:34, 5.52s/it]
9%|β | 255/2826 [25:27<4:04:17, 5.70s/it]
9%|β | 256/2826 [25:33<4:06:25, 5.75s/it]
9%|β | 257/2826 [25:39<4:08:51, 5.81s/it]
9%|β | 258/2826 [25:44<3:59:43, 5.60s/it]
9%|β | 259/2826 [25:49<3:55:48, 5.51s/it]
9%|β | 260/2826 [25:55<3:53:48, 5.47s/it]
{'loss': 0.5178, 'grad_norm': 1.8757219314575195, 'learning_rate': 4.575971731448763e-06, 'epoch': 0.28} |
|
9%|β | 260/2826 [25:55<3:53:48, 5.47s/it]
9%|β | 261/2826 [26:00<3:55:24, 5.51s/it]
9%|β | 262/2826 [26:05<3:49:49, 5.38s/it]
9%|β | 263/2826 [26:11<3:56:35, 5.54s/it]
9%|β | 264/2826 [26:17<3:55:06, 5.51s/it]
9%|β | 265/2826 [26:22<3:50:02, 5.39s/it]
9%|β | 266/2826 [26:27<3:49:46, 5.39s/it]
9%|β | 267/2826 [26:33<4:00:44, 5.64s/it]
9%|β | 268/2826 [26:39<3:54:51, 5.51s/it]
10%|β | 269/2826 [26:44<3:52:03, 5.45s/it]
10%|β | 270/2826 [26:51<4:12:07, 5.92s/it]
{'loss': 0.5264, 'grad_norm': 2.3748602867126465, 'learning_rate': 4.752650176678445e-06, 'epoch': 0.29} |
|
10%|β | 270/2826 [26:51<4:12:07, 5.92s/it]
10%|β | 271/2826 [26:56<4:06:02, 5.78s/it]
10%|β | 272/2826 [27:03<4:11:26, 5.91s/it]
10%|β | 273/2826 [27:09<4:15:58, 6.02s/it]
10%|β | 274/2826 [27:15<4:13:23, 5.96s/it]
10%|β | 275/2826 [27:20<4:02:08, 5.70s/it]
10%|β | 276/2826 [27:27<4:16:38, 6.04s/it]
10%|β | 277/2826 [27:32<4:05:47, 5.79s/it]
10%|β | 278/2826 [27:38<4:04:57, 5.77s/it]
10%|β | 279/2826 [27:44<4:13:22, 5.97s/it]
10%|β | 280/2826 [27:50<4:10:21, 5.90s/it]
{'loss': 0.5124, 'grad_norm': 3.0481033325195312, 'learning_rate': 4.929328621908128e-06, 'epoch': 0.3} |
|
10%|β | 280/2826 [27:50<4:10:21, 5.90s/it]
10%|β | 281/2826 [27:56<4:16:44, 6.05s/it]
10%|β | 282/2826 [28:02<4:15:20, 6.02s/it]
10%|β | 283/2826 [28:09<4:24:00, 6.23s/it]
10%|β | 284/2826 [28:14<4:14:50, 6.02s/it]
10%|β | 285/2826 [28:20<4:06:38, 5.82s/it]
10%|β | 286/2826 [28:25<4:00:40, 5.69s/it]
10%|β | 287/2826 [28:30<3:56:00, 5.58s/it]
10%|β | 288/2826 [28:37<4:08:47, 5.88s/it]
10%|β | 289/2826 [28:42<4:00:37, 5.69s/it]
10%|β | 290/2826 [28:47<3:51:46, 5.48s/it]
{'loss': 0.4977, 'grad_norm': 2.682847023010254, 'learning_rate': 4.99993132201408e-06, 'epoch': 0.31} |
|
10%|β | 290/2826 [28:47<3:51:46, 5.48s/it]
10%|β | 291/2826 [28:53<3:51:06, 5.47s/it]
10%|β | 292/2826 [28:58<3:51:50, 5.49s/it]
10%|β | 293/2826 [29:05<4:07:39, 5.87s/it]
10%|β | 294/2826 [29:11<4:07:28, 5.86s/it]
10%|β | 295/2826 [29:16<4:02:30, 5.75s/it]
10%|β | 296/2826 [29:21<3:51:53, 5.50s/it]
11%|β | 297/2826 [29:28<4:03:10, 5.77s/it]
11%|β | 298/2826 [29:33<3:55:54, 5.60s/it]
11%|β | 299/2826 [29:38<3:51:48, 5.50s/it]
11%|β | 300/2826 [29:45<4:03:29, 5.78s/it]
{'loss': 0.5005, 'grad_norm': 2.472842216491699, 'learning_rate': 4.9995116368759e-06, 'epoch': 0.32} |
|
11%|β | 300/2826 [29:45<4:03:29, 5.78s/it]
11%|β | 301/2826 [29:51<4:09:56, 5.94s/it]
11%|β | 302/2826 [29:57<4:16:26, 6.10s/it]
11%|β | 303/2826 [30:04<4:20:03, 6.18s/it]
11%|β | 304/2826 [30:10<4:19:04, 6.16s/it]
11%|β | 305/2826 [30:16<4:16:51, 6.11s/it]
11%|β | 306/2826 [30:22<4:12:04, 6.00s/it]
11%|β | 307/2826 [30:31<4:49:36, 6.90s/it]
11%|β | 308/2826 [30:36<4:37:29, 6.61s/it]
11%|β | 309/2826 [30:42<4:18:02, 6.15s/it]
11%|β | 310/2826 [30:47<4:15:14, 6.09s/it]
{'loss': 0.4857, 'grad_norm': 2.582815647125244, 'learning_rate': 4.998710485009401e-06, 'epoch': 0.33} |
|
11%|β | 310/2826 [30:47<4:15:14, 6.09s/it]
11%|β | 311/2826 [30:53<4:12:00, 6.01s/it]
11%|β | 312/2826 [30:59<4:05:28, 5.86s/it]
11%|β | 313/2826 [31:04<3:56:02, 5.64s/it]
11%|β | 314/2826 [31:09<3:50:40, 5.51s/it]
11%|β | 315/2826 [31:15<3:55:37, 5.63s/it]
11%|β | 316/2826 [31:20<3:49:21, 5.48s/it]
11%|β | 317/2826 [31:27<4:05:01, 5.86s/it]
11%|ββ | 318/2826 [31:33<4:05:35, 5.88s/it]
11%|ββ | 319/2826 [31:38<3:58:38, 5.71s/it]
11%|ββ | 320/2826 [31:44<4:02:17, 5.80s/it]
{'loss': 0.4637, 'grad_norm': 2.3572824001312256, 'learning_rate': 4.99752798868358e-06, 'epoch': 0.34} |
|
11%|ββ | 320/2826 [31:44<4:02:17, 5.80s/it]
11%|ββ | 321/2826 [31:50<4:01:26, 5.78s/it]
11%|ββ | 322/2826 [31:56<4:05:42, 5.89s/it]
11%|ββ | 323/2826 [32:03<4:19:27, 6.22s/it]
11%|ββ | 324/2826 [32:09<4:20:20, 6.24s/it]
12%|ββ | 325/2826 [32:15<4:07:38, 5.94s/it]
12%|ββ | 326/2826 [32:21<4:14:54, 6.12s/it]
12%|ββ | 327/2826 [32:26<4:02:33, 5.82s/it]
12%|ββ | 328/2826 [32:31<3:54:28, 5.63s/it]
12%|ββ | 329/2826 [32:37<3:59:40, 5.76s/it]
12%|ββ | 330/2826 [32:43<3:54:50, 5.65s/it]
{'loss': 0.4775, 'grad_norm': 2.3432295322418213, 'learning_rate': 4.99596432836689e-06, 'epoch': 0.35} |
|
12%|ββ | 330/2826 [32:43<3:54:50, 5.65s/it]
12%|ββ | 331/2826 [32:48<3:48:56, 5.51s/it]
12%|ββ | 332/2826 [32:53<3:43:28, 5.38s/it]
12%|ββ | 333/2826 [33:00<3:59:21, 5.76s/it]
12%|ββ | 334/2826 [33:05<3:50:32, 5.55s/it]
12%|ββ | 335/2826 [33:10<3:50:33, 5.55s/it]
12%|ββ | 336/2826 [33:16<3:56:10, 5.69s/it]
12%|ββ | 337/2826 [33:21<3:48:12, 5.50s/it]
12%|ββ | 338/2826 [33:29<4:08:49, 6.00s/it]
12%|ββ | 339/2826 [33:34<4:04:46, 5.91s/it]
12%|ββ | 340/2826 [33:39<3:53:35, 5.64s/it]
{'loss': 0.5779, 'grad_norm': 2.7486777305603027, 'learning_rate': 4.994019742699705e-06, 'epoch': 0.36} |
|
12%|ββ | 340/2826 [33:39<3:53:35, 5.64s/it]
12%|ββ | 341/2826 [33:45<3:53:09, 5.63s/it]
12%|ββ | 342/2826 [33:52<4:12:17, 6.09s/it]
12%|ββ | 343/2826 [33:58<4:08:33, 6.01s/it]
12%|ββ | 344/2826 [34:03<3:55:55, 5.70s/it]
12%|ββ | 345/2826 [34:10<4:11:33, 6.08s/it]
12%|ββ | 346/2826 [34:15<4:02:03, 5.86s/it]
12%|ββ | 347/2826 [34:21<3:57:32, 5.75s/it]
12%|ββ | 348/2826 [34:26<3:51:23, 5.60s/it]
12%|ββ | 349/2826 [34:32<3:59:55, 5.81s/it]
12%|ββ | 350/2826 [34:37<3:52:14, 5.63s/it]
{'loss': 0.5057, 'grad_norm': 2.3831562995910645, 'learning_rate': 4.991694528457891e-06, 'epoch': 0.37} |
|
12%|ββ | 350/2826 [34:37<3:52:14, 5.63s/it]
12%|ββ | 351/2826 [34:43<3:48:23, 5.54s/it]
12%|ββ | 352/2826 [34:49<4:01:40, 5.86s/it]
12%|ββ | 353/2826 [34:55<3:53:33, 5.67s/it]
13%|ββ | 354/2826 [35:01<3:57:47, 5.77s/it]
13%|ββ | 355/2826 [35:08<4:21:59, 6.36s/it]
13%|ββ | 356/2826 [35:14<4:18:34, 6.28s/it]
13%|ββ | 357/2826 [35:20<4:03:13, 5.91s/it]
13%|ββ | 358/2826 [35:25<3:53:06, 5.67s/it]
13%|ββ | 359/2826 [35:30<3:50:19, 5.60s/it]
13%|ββ | 360/2826 [35:37<4:01:45, 5.88s/it]
{'loss': 0.5313, 'grad_norm': 2.5414721965789795, 'learning_rate': 4.988989040507518e-06, 'epoch': 0.38} |
|
13%|ββ | 360/2826 [35:37<4:01:45, 5.88s/it]
13%|ββ | 361/2826 [35:42<3:54:48, 5.72s/it]
13%|ββ | 362/2826 [35:49<4:13:39, 6.18s/it]
13%|ββ | 363/2826 [35:55<4:05:20, 5.98s/it]
13%|ββ | 364/2826 [36:01<4:03:14, 5.93s/it]
13%|ββ | 365/2826 [36:06<3:56:21, 5.76s/it]
13%|ββ | 366/2826 [36:12<4:02:19, 5.91s/it]
13%|ββ | 367/2826 [36:18<4:03:11, 5.93s/it]
13%|ββ | 368/2826 [36:23<3:52:32, 5.68s/it]
13%|ββ | 369/2826 [36:29<3:56:29, 5.78s/it]
13%|ββ | 370/2826 [36:35<3:54:38, 5.73s/it]
{'loss': 0.4441, 'grad_norm': 2.4140472412109375, 'learning_rate': 4.985903691750697e-06, 'epoch': 0.39} |
|
13%|ββ | 370/2826 [36:35<3:54:38, 5.73s/it]
13%|ββ | 371/2826 [36:41<3:58:25, 5.83s/it]
13%|ββ | 372/2826 [36:46<3:49:17, 5.61s/it]
13%|ββ | 373/2826 [36:52<3:58:00, 5.82s/it]
13%|ββ | 374/2826 [36:58<3:55:29, 5.76s/it]
13%|ββ | 375/2826 [37:06<4:26:54, 6.53s/it]
13%|ββ | 376/2826 [37:11<4:09:42, 6.12s/it]
13%|ββ | 377/2826 [37:18<4:10:37, 6.14s/it]
13%|ββ | 378/2826 [37:23<4:01:09, 5.91s/it]
13%|ββ | 379/2826 [37:28<3:51:10, 5.67s/it]
13%|ββ | 380/2826 [37:33<3:43:35, 5.48s/it]
{'loss': 0.4778, 'grad_norm': 2.4907593727111816, 'learning_rate': 4.982438953062572e-06, 'epoch': 0.4} |
|
13%|ββ | 380/2826 [37:33<3:43:35, 5.48s/it]
13%|ββ | 381/2826 [37:39<3:50:14, 5.65s/it]
14%|ββ | 382/2826 [37:44<3:42:51, 5.47s/it]
14%|ββ | 383/2826 [37:52<4:05:57, 6.04s/it]
14%|ββ | 384/2826 [37:59<4:27:26, 6.57s/it]
14%|ββ | 385/2826 [38:05<4:14:52, 6.26s/it]
14%|ββ | 386/2826 [38:11<4:09:40, 6.14s/it]
14%|ββ | 387/2826 [38:17<4:14:10, 6.25s/it]
14%|ββ | 388/2826 [38:23<4:05:39, 6.05s/it]
14%|ββ | 389/2826 [38:29<4:11:46, 6.20s/it]
14%|ββ | 390/2826 [38:35<4:00:58, 5.94s/it]
{'loss': 0.4848, 'grad_norm': 2.579932928085327, 'learning_rate': 4.978595353219449e-06, 'epoch': 0.41} |
|
14%|ββ | 390/2826 [38:35<4:00:58, 5.94s/it]
14%|ββ | 391/2826 [38:40<3:54:40, 5.78s/it]
14%|ββ | 392/2826 [38:46<3:49:40, 5.66s/it]
14%|ββ | 393/2826 [38:52<4:03:09, 6.00s/it]
14%|ββ | 394/2826 [38:57<3:50:05, 5.68s/it]
14%|ββ | 395/2826 [39:04<4:01:48, 5.97s/it]
14%|ββ | 396/2826 [39:11<4:12:46, 6.24s/it]
14%|ββ | 397/2826 [39:16<4:03:52, 6.02s/it]
14%|ββ | 398/2826 [39:23<4:07:49, 6.12s/it]
14%|ββ | 399/2826 [39:28<3:55:41, 5.83s/it]
14%|ββ | 400/2826 [39:33<3:51:53, 5.74s/it]
{'loss': 0.4891, 'grad_norm': 2.5512266159057617, 'learning_rate': 4.974373478818098e-06, 'epoch': 0.42} |
|
14%|ββ | 400/2826 [39:33<3:51:53, 5.74s/it]
14%|ββ | 401/2826 [39:39<3:46:08, 5.60s/it]
14%|ββ | 402/2826 [39:44<3:40:59, 5.47s/it]
14%|ββ | 403/2826 [39:49<3:42:37, 5.51s/it]
14%|ββ | 404/2826 [39:54<3:37:15, 5.38s/it]
14%|ββ | 405/2826 [40:00<3:42:19, 5.51s/it]
14%|ββ | 406/2826 [40:06<3:50:04, 5.70s/it]
14%|ββ | 407/2826 [40:12<3:45:21, 5.59s/it]
14%|ββ | 408/2826 [40:20<4:14:48, 6.32s/it]
14%|ββ | 409/2826 [40:26<4:18:53, 6.43s/it]
15%|ββ | 410/2826 [40:32<4:12:26, 6.27s/it]
{'loss': 0.4954, 'grad_norm': 2.3293063640594482, 'learning_rate': 4.969773974186235e-06, 'epoch': 0.44} |
|
15%|ββ | 410/2826 [40:32<4:12:26, 6.27s/it]
15%|ββ | 411/2826 [40:37<3:57:24, 5.90s/it]
15%|ββ | 412/2826 [40:42<3:46:52, 5.64s/it]
15%|ββ | 413/2826 [40:48<3:44:12, 5.58s/it]
15%|ββ | 414/2826 [40:53<3:44:44, 5.59s/it]
15%|ββ | 415/2826 [41:00<3:53:09, 5.80s/it]
15%|ββ | 416/2826 [41:06<3:56:39, 5.89s/it]
15%|ββ | 417/2826 [41:11<3:47:26, 5.66s/it]
15%|ββ | 418/2826 [41:17<3:54:30, 5.84s/it]
15%|ββ | 419/2826 [41:22<3:45:20, 5.62s/it]
15%|ββ | 420/2826 [41:28<3:45:17, 5.62s/it]
{'loss': 0.5353, 'grad_norm': 2.6347479820251465, 'learning_rate': 4.964797541284175e-06, 'epoch': 0.45} |
|
15%|ββ | 420/2826 [41:28<3:45:17, 5.62s/it]
15%|ββ | 421/2826 [41:35<4:03:45, 6.08s/it]
15%|ββ | 422/2826 [41:40<3:50:48, 5.76s/it]
15%|ββ | 423/2826 [41:47<4:04:43, 6.11s/it]
15%|ββ | 424/2826 [41:52<3:53:13, 5.83s/it]
15%|ββ | 425/2826 [41:57<3:45:27, 5.63s/it]
15%|ββ | 426/2826 [42:02<3:38:24, 5.46s/it]
15%|ββ | 427/2826 [42:08<3:43:24, 5.59s/it]
15%|ββ | 428/2826 [42:14<3:39:38, 5.50s/it]
15%|ββ | 429/2826 [42:19<3:35:08, 5.39s/it]
15%|ββ | 430/2826 [42:25<3:44:05, 5.61s/it]
{'loss': 0.5726, 'grad_norm': 2.7719151973724365, 'learning_rate': 4.959444939597712e-06, 'epoch': 0.46} |
|
15%|ββ | 430/2826 [42:25<3:44:05, 5.61s/it]
15%|ββ | 431/2826 [42:31<3:48:34, 5.73s/it]
15%|ββ | 432/2826 [42:37<3:51:21, 5.80s/it]
15%|ββ | 433/2826 [42:42<3:47:31, 5.70s/it]
15%|ββ | 434/2826 [42:49<3:53:20, 5.85s/it]
15%|ββ | 435/2826 [42:56<4:07:34, 6.21s/it]
15%|ββ | 436/2826 [43:01<3:54:23, 5.88s/it]
15%|ββ | 437/2826 [43:08<4:05:38, 6.17s/it]
15%|ββ | 438/2826 [43:13<3:52:01, 5.83s/it]
16%|ββ | 439/2826 [43:18<3:47:20, 5.71s/it]
16%|ββ | 440/2826 [43:24<3:48:23, 5.74s/it]
{'loss': 0.5642, 'grad_norm': 2.1757211685180664, 'learning_rate': 4.953716986022204e-06, 'epoch': 0.47} |
|
16%|ββ | 440/2826 [43:24<3:48:23, 5.74s/it]
16%|ββ | 441/2826 [43:31<4:04:01, 6.14s/it]
16%|ββ | 442/2826 [43:37<3:58:23, 6.00s/it]
16%|ββ | 443/2826 [43:42<3:53:30, 5.88s/it]
16%|ββ | 444/2826 [43:48<3:52:52, 5.87s/it]
16%|ββ | 445/2826 [43:54<3:52:04, 5.85s/it]
16%|ββ | 446/2826 [44:01<4:04:00, 6.15s/it]
16%|ββ | 447/2826 [44:07<4:03:58, 6.15s/it]
16%|ββ | 448/2826 [44:12<3:56:22, 5.96s/it]
16%|ββ | 449/2826 [44:18<3:48:29, 5.77s/it]
16%|ββ | 450/2826 [44:23<3:43:41, 5.65s/it]
{'loss': 0.4429, 'grad_norm': 2.432244300842285, 'learning_rate': 4.947614554737904e-06, 'epoch': 0.48} |
|
16%|ββ | 450/2826 [44:23<3:43:41, 5.65s/it]
16%|ββ | 451/2826 [44:29<3:41:26, 5.59s/it]
16%|ββ | 452/2826 [44:35<3:55:20, 5.95s/it]
16%|ββ | 453/2826 [44:41<3:54:21, 5.93s/it]
16%|ββ | 454/2826 [44:47<3:55:45, 5.96s/it]
16%|ββ | 455/2826 [44:52<3:44:48, 5.69s/it]
16%|ββ | 456/2826 [44:57<3:38:37, 5.53s/it]
16%|ββ | 457/2826 [45:05<4:00:57, 6.10s/it]
16%|ββ | 458/2826 [45:11<3:55:45, 5.97s/it]
16%|ββ | 459/2826 [45:16<3:45:13, 5.71s/it]
16%|ββ | 460/2826 [45:22<3:48:47, 5.80s/it]
{'loss': 0.4683, 'grad_norm': 1.972844123840332, 'learning_rate': 4.941138577076538e-06, 'epoch': 0.49} |
|
16%|ββ | 460/2826 [45:22<3:48:47, 5.80s/it]
16%|ββ | 461/2826 [45:29<4:08:32, 6.31s/it]
16%|ββ | 462/2826 [45:35<4:00:35, 6.11s/it]
16%|ββ | 463/2826 [45:42<4:18:41, 6.57s/it]
16%|ββ | 464/2826 [45:49<4:20:29, 6.62s/it]
16%|ββ | 465/2826 [45:54<4:03:24, 6.19s/it]
16%|ββ | 466/2826 [46:01<4:05:41, 6.25s/it]
17%|ββ | 467/2826 [46:07<4:01:02, 6.13s/it]
17%|ββ | 468/2826 [46:13<4:09:24, 6.35s/it]
17%|ββ | 469/2826 [46:19<3:54:58, 5.98s/it]
17%|ββ | 470/2826 [46:24<3:46:25, 5.77s/it]
{'loss': 0.4385, 'grad_norm': 2.484992742538452, 'learning_rate': 4.934290041379182e-06, 'epoch': 0.5} |
|
17%|ββ | 470/2826 [46:24<3:46:25, 5.77s/it]
17%|ββ | 471/2826 [46:31<4:00:41, 6.13s/it]
17%|ββ | 472/2826 [46:37<3:57:47, 6.06s/it]
17%|ββ | 473/2826 [46:42<3:50:59, 5.89s/it]
17%|ββ | 474/2826 [46:48<3:44:01, 5.72s/it]
17%|ββ | 475/2826 [46:53<3:42:44, 5.68s/it]
17%|ββ | 476/2826 [46:59<3:48:07, 5.82s/it]
17%|ββ | 477/2826 [47:04<3:40:30, 5.63s/it]
17%|ββ | 478/2826 [47:10<3:36:48, 5.54s/it]
17%|ββ | 479/2826 [47:15<3:31:28, 5.41s/it]
17%|ββ | 480/2826 [47:22<3:50:51, 5.90s/it]
{'loss': 0.4935, 'grad_norm': 2.0424418449401855, 'learning_rate': 4.92706999284541e-06, 'epoch': 0.51} |
|
17%|ββ | 480/2826 [47:22<3:50:51, 5.90s/it]
17%|ββ | 481/2826 [47:27<3:44:52, 5.75s/it]
17%|ββ | 482/2826 [47:34<3:58:19, 6.10s/it]
17%|ββ | 483/2826 [47:40<3:58:41, 6.11s/it]
17%|ββ | 484/2826 [47:46<3:51:25, 5.93s/it]
17%|ββ | 485/2826 [47:51<3:41:43, 5.68s/it]
17%|ββ | 486/2826 [47:57<3:44:23, 5.75s/it]
17%|ββ | 487/2826 [48:03<3:52:34, 5.97s/it]
17%|ββ | 488/2826 [48:10<4:00:52, 6.18s/it]
17%|ββ | 489/2826 [48:15<3:51:08, 5.93s/it]
17%|ββ | 490/2826 [48:21<3:46:56, 5.83s/it]
{'loss': 0.4548, 'grad_norm': 2.3754308223724365, 'learning_rate': 4.9194795333737925e-06, 'epoch': 0.52} |
|
17%|ββ | 490/2826 [48:21<3:46:56, 5.83s/it]
17%|ββ | 491/2826 [48:30<4:17:57, 6.63s/it]
17%|ββ | 492/2826 [48:36<4:13:59, 6.53s/it]
17%|ββ | 493/2826 [48:42<4:10:37, 6.45s/it]
17%|ββ | 494/2826 [48:48<4:01:09, 6.20s/it]
18%|ββ | 495/2826 [48:53<3:49:27, 5.91s/it]
18%|ββ | 496/2826 [48:58<3:40:05, 5.67s/it]
18%|ββ | 497/2826 [49:05<3:50:25, 5.94s/it]
18%|ββ | 498/2826 [49:10<3:41:06, 5.70s/it]
18%|ββ | 499/2826 [49:16<3:45:16, 5.81s/it]
18%|ββ | 500/2826 [49:21<3:36:56, 5.60s/it]
{'loss': 0.5486, 'grad_norm': 3.0801432132720947, 'learning_rate': 4.911519821393718e-06, 'epoch': 0.53} |
|
18%|ββ | 500/2826 [49:21<3:36:56, 5.60s/it]
18%|ββ | 501/2826 [49:26<3:32:25, 5.48s/it]
18%|ββ | 502/2826 [49:32<3:32:22, 5.48s/it]
18%|ββ | 503/2826 [49:38<3:45:29, 5.82s/it]
18%|ββ | 504/2826 [49:45<3:54:36, 6.06s/it]
18%|ββ | 505/2826 [49:51<3:51:05, 5.97s/it]
18%|ββ | 506/2826 [49:56<3:41:08, 5.72s/it]
18%|ββ | 507/2826 [50:02<3:43:18, 5.78s/it]
18%|ββ | 508/2826 [50:07<3:37:37, 5.63s/it]
18%|ββ | 509/2826 [50:12<3:31:29, 5.48s/it]
18%|ββ | 510/2826 [50:17<3:27:07, 5.37s/it]
{'loss': 0.5121, 'grad_norm': 2.2712507247924805, 'learning_rate': 4.9031920716886035e-06, 'epoch': 0.54} |
|
18%|ββ | 510/2826 [50:17<3:27:07, 5.37s/it]
18%|ββ | 511/2826 [50:22<3:26:13, 5.35s/it]
18%|ββ | 512/2826 [50:28<3:30:05, 5.45s/it]
18%|ββ | 513/2826 [50:33<3:25:15, 5.32s/it]
18%|ββ | 514/2826 [50:38<3:23:00, 5.27s/it]
18%|ββ | 515/2826 [50:45<3:41:37, 5.75s/it]
18%|ββ | 516/2826 [50:51<3:46:24, 5.88s/it]
18%|ββ | 517/2826 [50:57<3:38:45, 5.68s/it]
18%|ββ | 518/2826 [51:04<3:57:15, 6.17s/it]
18%|ββ | 519/2826 [51:10<4:00:33, 6.26s/it]
18%|ββ | 520/2826 [51:16<3:56:14, 6.15s/it]
{'loss': 0.4495, 'grad_norm': 2.0000548362731934, 'learning_rate': 4.894497555210499e-06, 'epoch': 0.55} |
|
18%|ββ | 520/2826 [51:16<3:56:14, 6.15s/it]
18%|ββ | 521/2826 [51:21<3:43:53, 5.83s/it]
18%|ββ | 522/2826 [51:27<3:39:53, 5.73s/it]
19%|ββ | 523/2826 [51:32<3:35:51, 5.62s/it]
19%|ββ | 524/2826 [51:39<3:44:31, 5.85s/it]
19%|ββ | 525/2826 [51:44<3:35:09, 5.61s/it]
19%|ββ | 526/2826 [51:50<3:48:00, 5.95s/it]
19%|ββ | 527/2826 [51:55<3:37:42, 5.68s/it]
19%|ββ | 528/2826 [52:01<3:30:43, 5.50s/it]
19%|ββ | 529/2826 [52:07<3:38:48, 5.72s/it]
19%|ββ | 530/2826 [52:13<3:43:21, 5.84s/it]
{'loss': 0.5028, 'grad_norm': 2.590303897857666, 'learning_rate': 4.8854375988861134e-06, 'epoch': 0.56} |
|
19%|ββ | 530/2826 [52:13<3:43:21, 5.84s/it]
19%|ββ | 531/2826 [52:18<3:35:36, 5.64s/it]
19%|ββ | 532/2826 [52:23<3:29:42, 5.49s/it]
19%|ββ | 533/2826 [52:29<3:37:36, 5.69s/it]
19%|ββ | 534/2826 [52:35<3:36:27, 5.67s/it]
19%|ββ | 535/2826 [52:40<3:29:50, 5.50s/it]
19%|ββ | 536/2826 [52:47<3:48:28, 5.99s/it]
19%|ββ | 537/2826 [52:52<3:38:04, 5.72s/it]
19%|ββ | 538/2826 [52:57<3:30:39, 5.52s/it]
19%|ββ | 539/2826 [53:04<3:39:14, 5.75s/it]
19%|ββ | 540/2826 [53:09<3:32:30, 5.58s/it]
{'loss': 0.5193, 'grad_norm': 2.377298355102539, 'learning_rate': 4.87601358541431e-06, 'epoch': 0.57} |
|
19%|ββ | 540/2826 [53:09<3:32:30, 5.58s/it]
19%|ββ | 541/2826 [53:15<3:37:09, 5.70s/it]
19%|ββ | 542/2826 [53:21<3:38:14, 5.73s/it]
19%|ββ | 543/2826 [53:26<3:38:50, 5.75s/it]
19%|ββ | 544/2826 [53:32<3:40:01, 5.79s/it]
19%|ββ | 545/2826 [53:38<3:35:35, 5.67s/it]
19%|ββ | 546/2826 [53:44<3:38:27, 5.75s/it]
19%|ββ | 547/2826 [53:49<3:35:02, 5.66s/it]
19%|ββ | 548/2826 [53:56<3:49:37, 6.05s/it]
19%|ββ | 549/2826 [54:02<3:46:33, 5.97s/it]
19%|ββ | 550/2826 [54:07<3:35:42, 5.69s/it]
{'loss': 0.545, 'grad_norm': 2.966008186340332, 'learning_rate': 4.8662269530550825e-06, 'epoch': 0.58} |
|
19%|ββ | 550/2826 [54:07<3:35:42, 5.69s/it]
19%|ββ | 551/2826 [54:13<3:44:14, 5.91s/it]
20%|ββ | 552/2826 [54:19<3:47:00, 5.99s/it]
20%|ββ | 553/2826 [54:26<3:54:01, 6.18s/it]
20%|ββ | 554/2826 [54:31<3:45:16, 5.95s/it]
20%|ββ | 555/2826 [54:36<3:34:54, 5.68s/it]
20%|ββ | 556/2826 [54:42<3:28:14, 5.50s/it]
20%|ββ | 557/2826 [54:47<3:22:34, 5.36s/it]
20%|ββ | 558/2826 [54:52<3:26:50, 5.47s/it]
20%|ββ | 559/2826 [54:57<3:22:48, 5.37s/it]
20%|ββ | 560/2826 [55:03<3:27:45, 5.50s/it]
{'loss': 0.5219, 'grad_norm': 2.250293254852295, 'learning_rate': 4.856079195410046e-06, 'epoch': 0.59} |
|
20%|ββ | 560/2826 [55:03<3:27:45, 5.50s/it]
20%|ββ | 561/2826 [55:09<3:25:15, 5.44s/it]
20%|ββ | 562/2826 [55:14<3:21:54, 5.35s/it]
20%|ββ | 563/2826 [55:21<3:38:55, 5.80s/it]
20%|ββ | 564/2826 [55:27<3:45:49, 5.99s/it]
20%|ββ | 565/2826 [55:32<3:38:34, 5.80s/it]
20%|ββ | 566/2826 [55:39<3:49:49, 6.10s/it]
20%|ββ | 567/2826 [55:46<3:57:56, 6.32s/it]
20%|ββ | 568/2826 [55:53<4:01:09, 6.41s/it]
20%|ββ | 569/2826 [55:59<4:02:37, 6.45s/it]
20%|ββ | 570/2826 [56:04<3:48:57, 6.09s/it]
{'loss': 0.4725, 'grad_norm': 2.437361240386963, 'learning_rate': 4.845571861194501e-06, 'epoch': 0.6} |
|
20%|ββ | 570/2826 [56:04<3:48:57, 6.09s/it]
20%|ββ | 571/2826 [56:10<3:43:02, 5.93s/it]
20%|ββ | 572/2826 [56:15<3:32:34, 5.66s/it]
20%|ββ | 573/2826 [56:21<3:32:07, 5.65s/it]
20%|ββ | 574/2826 [56:27<3:41:44, 5.91s/it]
20%|ββ | 575/2826 [56:34<3:49:25, 6.12s/it]
20%|ββ | 576/2826 [56:39<3:44:49, 6.00s/it]
20%|ββ | 577/2826 [56:45<3:40:14, 5.88s/it]
20%|ββ | 578/2826 [56:52<3:51:27, 6.18s/it]
20%|ββ | 579/2826 [56:59<4:02:06, 6.46s/it]
21%|ββ | 580/2826 [57:04<3:47:09, 6.07s/it]
{'loss': 0.4232, 'grad_norm': 2.435994863510132, 'learning_rate': 4.834706554001065e-06, 'epoch': 0.62} |
|
21%|ββ | 580/2826 [57:04<3:47:09, 6.07s/it]
21%|ββ | 581/2826 [57:10<3:46:22, 6.05s/it]
21%|ββ | 582/2826 [57:16<3:44:56, 6.01s/it]
21%|ββ | 583/2826 [57:21<3:36:49, 5.80s/it]
21%|ββ | 584/2826 [57:27<3:28:57, 5.59s/it]
21%|ββ | 585/2826 [57:32<3:30:43, 5.64s/it]
21%|ββ | 586/2826 [57:40<3:55:05, 6.30s/it]
21%|ββ | 587/2826 [57:45<3:43:04, 5.98s/it]
21%|ββ | 588/2826 [57:50<3:33:39, 5.73s/it]
21%|ββ | 589/2826 [57:56<3:31:59, 5.69s/it]
21%|ββ | 590/2826 [58:02<3:35:32, 5.78s/it]
{'loss': 0.4834, 'grad_norm': 2.705902099609375, 'learning_rate': 4.823484932054937e-06, 'epoch': 0.63} |
|
21%|ββ | 590/2826 [58:02<3:35:32, 5.78s/it]
21%|ββ | 591/2826 [58:09<3:44:11, 6.02s/it]
21%|ββ | 592/2826 [58:14<3:35:03, 5.78s/it]
21%|ββ | 593/2826 [58:20<3:39:36, 5.90s/it]
21%|ββ | 594/2826 [58:26<3:36:20, 5.82s/it]
21%|ββ | 595/2826 [58:31<3:28:41, 5.61s/it]
21%|ββ | 596/2826 [58:36<3:25:41, 5.53s/it]
21%|ββ | 597/2826 [58:41<3:20:33, 5.40s/it]
21%|ββ | 598/2826 [58:47<3:18:54, 5.36s/it]
21%|ββ | 599/2826 [58:52<3:25:20, 5.53s/it]
21%|ββ | 600/2826 [58:58<3:20:36, 5.41s/it]
{'loss': 0.5302, 'grad_norm': 2.1471517086029053, 'learning_rate': 4.811908707960832e-06, 'epoch': 0.64} |
|
21%|ββ | 600/2826 [58:58<3:20:36, 5.41s/it]
21%|βββ | 601/2826 [59:03<3:18:26, 5.35s/it]
21%|βββ | 602/2826 [59:08<3:14:59, 5.26s/it]
21%|βββ | 603/2826 [59:15<3:32:37, 5.74s/it]
21%|βββ | 604/2826 [59:20<3:29:18, 5.65s/it]
21%|βββ | 605/2826 [59:26<3:29:32, 5.66s/it]
21%|βββ | 606/2826 [59:31<3:22:07, 5.46s/it]
21%|βββ | 607/2826 [59:36<3:18:09, 5.36s/it]
22%|βββ | 608/2826 [59:41<3:14:28, 5.26s/it]
22%|βββ | 609/2826 [59:47<3:22:29, 5.48s/it]
22%|βββ | 610/2826 [59:55<3:45:57, 6.12s/it]
{'loss': 0.494, 'grad_norm': 2.0760443210601807, 'learning_rate': 4.799979648441602e-06, 'epoch': 0.65} |
|
22%|βββ | 610/2826 [59:55<3:45:57, 6.12s/it]
22%|βββ | 611/2826 [1:00:00<3:40:40, 5.98s/it]
22%|βββ | 612/2826 [1:00:07<3:45:01, 6.10s/it]
22%|βββ | 613/2826 [1:00:12<3:34:41, 5.82s/it]
22%|βββ | 614/2826 [1:00:18<3:37:59, 5.91s/it]
22%|βββ | 615/2826 [1:00:23<3:34:09, 5.81s/it]
22%|βββ | 616/2826 [1:00:30<3:38:30, 5.93s/it]
22%|βββ | 617/2826 [1:00:36<3:44:02, 6.09s/it]
22%|βββ | 618/2826 [1:00:42<3:36:24, 5.88s/it]
22%|βββ | 619/2826 [1:00:47<3:33:58, 5.82s/it]
22%|βββ | 620/2826 [1:00:52<3:25:26, 5.59s/it]
{'loss': 0.487, 'grad_norm': 2.334944009780884, 'learning_rate': 4.787699574068611e-06, 'epoch': 0.66} |
|
22%|βββ | 620/2826 [1:00:52<3:25:26, 5.59s/it]
22%|βββ | 621/2826 [1:00:57<3:20:49, 5.46s/it]
22%|βββ | 622/2826 [1:01:03<3:16:54, 5.36s/it]
22%|βββ | 623/2826 [1:01:08<3:23:09, 5.53s/it]
22%|βββ | 624/2826 [1:01:14<3:27:42, 5.66s/it]
22%|βββ | 625/2826 [1:01:22<3:51:48, 6.32s/it]
22%|βββ | 626/2826 [1:01:28<3:47:35, 6.21s/it]
22%|βββ | 627/2826 [1:01:34<3:40:19, 6.01s/it]
22%|βββ | 628/2826 [1:01:39<3:28:45, 5.70s/it]
22%|βββ | 629/2826 [1:01:47<3:53:46, 6.38s/it]
22%|βββ | 630/2826 [1:01:52<3:45:36, 6.16s/it]
{'loss': 0.4911, 'grad_norm': 2.3444855213165283, 'learning_rate': 4.775070358983881e-06, 'epoch': 0.67} |
|
22%|βββ | 630/2826 [1:01:52<3:45:36, 6.16s/it]
22%|βββ | 631/2826 [1:01:59<3:46:16, 6.19s/it]
22%|βββ | 632/2826 [1:02:04<3:38:43, 5.98s/it]
22%|βββ | 633/2826 [1:02:11<3:42:53, 6.10s/it]
22%|βββ | 634/2826 [1:02:16<3:39:35, 6.01s/it]
22%|βββ | 635/2826 [1:02:21<3:29:22, 5.73s/it]
23%|βββ | 636/2826 [1:02:27<3:23:13, 5.57s/it]
23%|βββ | 637/2826 [1:02:32<3:23:05, 5.57s/it]
23%|βββ | 638/2826 [1:02:39<3:34:31, 5.88s/it]
23%|βββ | 639/2826 [1:02:44<3:30:56, 5.79s/it]
23%|βββ | 640/2826 [1:02:49<3:23:19, 5.58s/it]
{'loss': 0.4744, 'grad_norm': 2.127737045288086, 'learning_rate': 4.7620939306140696e-06, 'epoch': 0.68} |
|
23%|βββ | 640/2826 [1:02:49<3:23:19, 5.58s/it]
23%|βββ | 641/2826 [1:02:55<3:27:12, 5.69s/it]
23%|βββ | 642/2826 [1:03:00<3:20:26, 5.51s/it]
23%|βββ | 643/2826 [1:03:08<3:47:34, 6.25s/it]
23%|βββ | 644/2826 [1:03:14<3:34:50, 5.91s/it]
23%|βββ | 645/2826 [1:03:19<3:27:00, 5.69s/it]
23%|βββ | 646/2826 [1:03:26<3:41:49, 6.11s/it]
23%|βββ | 647/2826 [1:03:32<3:43:35, 6.16s/it]
23%|βββ | 648/2826 [1:03:37<3:32:05, 5.84s/it]
23%|βββ | 649/2826 [1:03:42<3:23:47, 5.62s/it]
23%|βββ | 650/2826 [1:03:47<3:17:11, 5.44s/it]
{'loss': 0.4789, 'grad_norm': 2.2132568359375, 'learning_rate': 4.748772269376312e-06, 'epoch': 0.69} |
|
23%|βββ | 650/2826 [1:03:47<3:17:11, 5.44s/it]
23%|βββ | 651/2826 [1:03:53<3:16:57, 5.43s/it]
23%|βββ | 652/2826 [1:03:59<3:23:10, 5.61s/it]
23%|βββ | 653/2826 [1:04:05<3:34:07, 5.91s/it]
23%|βββ | 654/2826 [1:04:12<3:36:23, 5.98s/it]
23%|βββ | 655/2826 [1:04:17<3:29:18, 5.78s/it]
23%|βββ | 656/2826 [1:04:23<3:31:03, 5.84s/it]
23%|βββ | 657/2826 [1:04:28<3:23:09, 5.62s/it]
23%|βββ | 658/2826 [1:04:34<3:27:43, 5.75s/it]
23%|βββ | 659/2826 [1:04:39<3:21:46, 5.59s/it]
23%|βββ | 660/2826 [1:04:46<3:33:07, 5.90s/it]
{'loss': 0.488, 'grad_norm': 1.9452372789382935, 'learning_rate': 4.735107408375977e-06, 'epoch': 0.7} |
|
23%|βββ | 660/2826 [1:04:46<3:33:07, 5.90s/it]
23%|βββ | 661/2826 [1:04:53<3:43:28, 6.19s/it]
23%|βββ | 662/2826 [1:04:58<3:31:37, 5.87s/it]
23%|βββ | 663/2826 [1:05:03<3:22:27, 5.62s/it]
23%|βββ | 664/2826 [1:05:08<3:16:46, 5.46s/it]
24%|βββ | 665/2826 [1:05:14<3:18:12, 5.50s/it]
24%|βββ | 666/2826 [1:05:19<3:13:54, 5.39s/it]
24%|βββ | 667/2826 [1:05:25<3:23:58, 5.67s/it]
24%|βββ | 668/2826 [1:05:30<3:17:42, 5.50s/it]
24%|βββ | 669/2826 [1:05:35<3:13:39, 5.39s/it]
24%|βββ | 670/2826 [1:05:40<3:12:34, 5.36s/it]
{'loss': 0.4462, 'grad_norm': 2.7268893718719482, 'learning_rate': 4.721101433096381e-06, 'epoch': 0.71} |
|
24%|βββ | 670/2826 [1:05:40<3:12:34, 5.36s/it]
24%|βββ | 671/2826 [1:05:47<3:20:19, 5.58s/it]
24%|βββ | 672/2826 [1:05:52<3:15:03, 5.43s/it]
24%|βββ | 673/2826 [1:05:58<3:24:07, 5.69s/it]
24%|βββ | 674/2826 [1:06:04<3:31:42, 5.90s/it]
24%|βββ | 675/2826 [1:06:09<3:23:25, 5.67s/it]
24%|βββ | 676/2826 [1:06:15<3:26:38, 5.77s/it]
24%|βββ | 677/2826 [1:06:22<3:30:54, 5.89s/it]
24%|βββ | 678/2826 [1:06:28<3:36:49, 6.06s/it]
24%|βββ | 679/2826 [1:06:34<3:33:14, 5.96s/it]
24%|βββ | 680/2826 [1:06:41<3:43:49, 6.26s/it]
{'loss': 0.5087, 'grad_norm': 2.1095452308654785, 'learning_rate': 4.706756481080511e-06, 'epoch': 0.72} |
|
24%|βββ | 680/2826 [1:06:41<3:43:49, 6.26s/it]
24%|βββ | 681/2826 [1:06:46<3:35:57, 6.04s/it]
24%|βββ | 682/2826 [1:06:52<3:28:07, 5.82s/it]
24%|βββ | 683/2826 [1:06:57<3:27:49, 5.82s/it]
24%|βββ | 684/2826 [1:07:05<3:41:54, 6.22s/it]
24%|βββ | 685/2826 [1:07:10<3:28:59, 5.86s/it]
24%|βββ | 686/2826 [1:07:15<3:25:57, 5.77s/it]
24%|βββ | 687/2826 [1:07:21<3:23:20, 5.70s/it]
24%|βββ | 688/2826 [1:07:26<3:16:02, 5.50s/it]
24%|βββ | 689/2826 [1:07:31<3:09:25, 5.32s/it]
24%|βββ | 690/2826 [1:07:36<3:12:46, 5.41s/it]
{'loss': 0.5304, 'grad_norm': 2.278555154800415, 'learning_rate': 4.692074741604795e-06, 'epoch': 0.73} |
|
24%|βββ | 690/2826 [1:07:36<3:12:46, 5.41s/it]
24%|βββ | 691/2826 [1:07:42<3:18:40, 5.58s/it]
24%|βββ | 692/2826 [1:07:48<3:15:39, 5.50s/it]
25%|βββ | 693/2826 [1:07:54<3:24:44, 5.76s/it]
25%|βββ | 694/2826 [1:08:00<3:29:48, 5.90s/it]
25%|βββ | 695/2826 [1:08:06<3:23:32, 5.73s/it]
25%|βββ | 696/2826 [1:08:11<3:17:25, 5.56s/it]
25%|βββ | 697/2826 [1:08:17<3:30:17, 5.93s/it]
25%|βββ | 698/2826 [1:08:23<3:23:51, 5.75s/it]
25%|βββ | 699/2826 [1:08:28<3:16:07, 5.53s/it]
25%|βββ | 700/2826 [1:08:33<3:11:26, 5.40s/it]
{'loss': 0.5177, 'grad_norm': 2.455960512161255, 'learning_rate': 4.677058455344989e-06, 'epoch': 0.74} |
|
25%|βββ | 700/2826 [1:08:33<3:11:26, 5.40s/it]
25%|βββ | 701/2826 [1:08:38<3:11:11, 5.40s/it]
25%|βββ | 702/2826 [1:08:46<3:34:46, 6.07s/it]
25%|βββ | 703/2826 [1:08:52<3:31:48, 5.99s/it]
25%|βββ | 704/2826 [1:08:58<3:38:31, 6.18s/it]
25%|βββ | 705/2826 [1:09:04<3:29:38, 5.93s/it]
25%|βββ | 706/2826 [1:09:11<3:39:23, 6.21s/it]
25%|βββ | 707/2826 [1:09:17<3:41:51, 6.28s/it]
25%|βββ | 708/2826 [1:09:23<3:39:06, 6.21s/it]
25%|βββ | 709/2826 [1:09:30<3:44:42, 6.37s/it]
25%|βββ | 710/2826 [1:09:37<3:52:46, 6.60s/it]
{'loss': 0.4841, 'grad_norm': 2.1136856079101562, 'learning_rate': 4.661709914034209e-06, 'epoch': 0.75} |
|
25%|βββ | 710/2826 [1:09:37<3:52:46, 6.60s/it]
25%|βββ | 711/2826 [1:09:44<3:57:13, 6.73s/it]
25%|βββ | 712/2826 [1:09:51<3:56:08, 6.70s/it]
25%|βββ | 713/2826 [1:09:57<3:53:06, 6.62s/it]
25%|βββ | 714/2826 [1:10:02<3:38:14, 6.20s/it]
25%|βββ | 715/2826 [1:10:07<3:27:22, 5.89s/it]
25%|βββ | 716/2826 [1:10:13<3:22:05, 5.75s/it]
25%|βββ | 717/2826 [1:10:19<3:29:24, 5.96s/it]
25%|βββ | 718/2826 [1:10:25<3:25:07, 5.84s/it]
25%|βββ | 719/2826 [1:10:30<3:21:22, 5.73s/it]
25%|βββ | 720/2826 [1:10:38<3:37:30, 6.20s/it]
{'loss': 0.4544, 'grad_norm': 2.296614646911621, 'learning_rate': 4.646031460113175e-06, 'epoch': 0.76} |
|
25%|βββ | 720/2826 [1:10:38<3:37:30, 6.20s/it]
26%|βββ | 721/2826 [1:10:43<3:27:51, 5.92s/it]
26%|βββ | 722/2826 [1:10:49<3:29:29, 5.97s/it]
26%|βββ | 723/2826 [1:10:54<3:20:48, 5.73s/it]
26%|βββ | 724/2826 [1:10:59<3:13:28, 5.52s/it]
26%|βββ | 725/2826 [1:11:05<3:19:32, 5.70s/it]
26%|βββ | 726/2826 [1:11:10<3:14:08, 5.55s/it]
26%|βββ | 727/2826 [1:11:17<3:20:28, 5.73s/it]
26%|βββ | 728/2826 [1:11:22<3:13:59, 5.55s/it]
26%|βββ | 729/2826 [1:11:27<3:08:52, 5.40s/it]
26%|βββ | 730/2826 [1:11:32<3:08:15, 5.39s/it]
{'loss': 0.4715, 'grad_norm': 1.8733782768249512, 'learning_rate': 4.630025486372715e-06, 'epoch': 0.77} |
|
26%|βββ | 730/2826 [1:11:32<3:08:15, 5.39s/it]
26%|βββ | 731/2826 [1:11:38<3:16:06, 5.62s/it]
26%|βββ | 732/2826 [1:11:44<3:16:38, 5.63s/it]
26%|βββ | 733/2826 [1:11:49<3:10:19, 5.46s/it]
26%|βββ | 734/2826 [1:11:54<3:06:23, 5.35s/it]
26%|βββ | 735/2826 [1:11:59<3:03:47, 5.27s/it]
26%|βββ | 736/2826 [1:12:05<3:06:16, 5.35s/it]
26%|βββ | 737/2826 [1:12:11<3:18:30, 5.70s/it]
26%|βββ | 738/2826 [1:12:17<3:14:16, 5.58s/it]
26%|βββ | 739/2826 [1:12:22<3:10:19, 5.47s/it]
26%|βββ | 740/2826 [1:12:28<3:13:57, 5.58s/it]
{'loss': 0.4824, 'grad_norm': 2.526837110519409, 'learning_rate': 4.613694435588589e-06, 'epoch': 0.79} |
|
26%|βββ | 740/2826 [1:12:28<3:13:57, 5.58s/it]
26%|βββ | 741/2826 [1:12:33<3:10:35, 5.48s/it]
26%|βββ | 742/2826 [1:12:39<3:11:40, 5.52s/it]
26%|βββ | 743/2826 [1:12:45<3:19:04, 5.73s/it]
26%|βββ | 744/2826 [1:12:52<3:33:39, 6.16s/it]
26%|βββ | 745/2826 [1:12:57<3:26:52, 5.96s/it]
26%|βββ | 746/2826 [1:13:03<3:20:29, 5.78s/it]
26%|βββ | 747/2826 [1:13:08<3:18:28, 5.73s/it]
26%|βββ | 748/2826 [1:13:14<3:12:42, 5.56s/it]
27%|βββ | 749/2826 [1:13:19<3:12:00, 5.55s/it]
27%|βββ | 750/2826 [1:13:25<3:15:38, 5.65s/it]
{'loss': 0.4852, 'grad_norm': 2.2026150226593018, 'learning_rate': 4.597040800148679e-06, 'epoch': 0.8} |
|
27%|βββ | 750/2826 [1:13:25<3:15:38, 5.65s/it]
27%|βββ | 751/2826 [1:13:30<3:09:26, 5.48s/it]
27%|βββ | 752/2826 [1:13:36<3:14:51, 5.64s/it]
27%|βββ | 753/2826 [1:13:41<3:08:40, 5.46s/it]
27%|βββ | 754/2826 [1:13:46<3:04:28, 5.34s/it]
27%|βββ | 755/2826 [1:13:51<3:03:20, 5.31s/it]
27%|βββ | 756/2826 [1:13:58<3:15:18, 5.66s/it]
27%|βββ | 757/2826 [1:14:03<3:11:08, 5.54s/it]
27%|βββ | 758/2826 [1:14:10<3:19:44, 5.80s/it]
27%|βββ | 759/2826 [1:14:16<3:30:47, 6.12s/it]
27%|βββ | 760/2826 [1:14:21<3:19:45, 5.80s/it]
{'loss': 0.4134, 'grad_norm': 2.214277744293213, 'learning_rate': 4.580067121672607e-06, 'epoch': 0.81} |
|
27%|βββ | 760/2826 [1:14:21<3:19:45, 5.80s/it]
27%|βββ | 761/2826 [1:14:27<3:16:01, 5.70s/it]
27%|βββ | 762/2826 [1:14:32<3:12:29, 5.60s/it]
27%|βββ | 763/2826 [1:14:37<3:08:07, 5.47s/it]
27%|βββ | 764/2826 [1:14:43<3:08:46, 5.49s/it]
27%|βββ | 765/2826 [1:14:50<3:27:24, 6.04s/it]
27%|βββ | 766/2826 [1:14:55<3:17:52, 5.76s/it]
27%|βββ | 767/2826 [1:15:02<3:25:25, 5.99s/it]
27%|βββ | 768/2826 [1:15:08<3:26:57, 6.03s/it]
27%|βββ | 769/2826 [1:15:14<3:22:23, 5.90s/it]
27%|βββ | 770/2826 [1:15:20<3:29:34, 6.12s/it]
{'loss': 0.4493, 'grad_norm': 2.623305559158325, 'learning_rate': 4.562775990623847e-06, 'epoch': 0.82} |
|
27%|βββ | 770/2826 [1:15:20<3:29:34, 6.12s/it]
27%|βββ | 771/2826 [1:15:26<3:23:43, 5.95s/it]
27%|βββ | 772/2826 [1:15:32<3:22:26, 5.91s/it]
27%|βββ | 773/2826 [1:15:37<3:17:11, 5.76s/it]
27%|βββ | 774/2826 [1:15:44<3:25:45, 6.02s/it]
27%|βββ | 775/2826 [1:15:49<3:22:25, 5.92s/it]
27%|βββ | 776/2826 [1:15:55<3:15:14, 5.71s/it]
27%|βββ | 777/2826 [1:16:01<3:17:52, 5.79s/it]
28%|βββ | 778/2826 [1:16:06<3:14:38, 5.70s/it]
28%|βββ | 779/2826 [1:16:12<3:19:28, 5.85s/it]
28%|βββ | 780/2826 [1:16:18<3:20:27, 5.88s/it]
{'loss': 0.5255, 'grad_norm': 2.9433794021606445, 'learning_rate': 4.5451700459143735e-06, 'epoch': 0.83} |
|
28%|βββ | 780/2826 [1:16:18<3:20:27, 5.88s/it]
28%|βββ | 781/2826 [1:16:25<3:25:19, 6.02s/it]
28%|βββ | 782/2826 [1:16:31<3:33:16, 6.26s/it]
28%|βββ | 783/2826 [1:16:37<3:30:04, 6.17s/it]
28%|βββ | 784/2826 [1:16:43<3:20:48, 5.90s/it]
28%|βββ | 785/2826 [1:16:48<3:13:36, 5.69s/it]
28%|βββ | 786/2826 [1:16:54<3:16:01, 5.77s/it]
28%|βββ | 787/2826 [1:17:00<3:17:12, 5.80s/it]
28%|βββ | 788/2826 [1:17:05<3:11:27, 5.64s/it]
28%|βββ | 789/2826 [1:17:10<3:09:15, 5.57s/it]
28%|βββ | 790/2826 [1:17:17<3:21:24, 5.94s/it]
{'loss': 0.4503, 'grad_norm': 2.143739938735962, 'learning_rate': 4.527251974501923e-06, 'epoch': 0.84} |
|
28%|βββ | 790/2826 [1:17:17<3:21:24, 5.94s/it]
28%|βββ | 791/2826 [1:17:23<3:25:32, 6.06s/it]
28%|βββ | 792/2826 [1:17:32<3:49:08, 6.76s/it]
28%|βββ | 793/2826 [1:17:37<3:32:09, 6.26s/it]
28%|βββ | 794/2826 [1:17:42<3:20:51, 5.93s/it]
28%|βββ | 795/2826 [1:17:48<3:16:48, 5.81s/it]
28%|βββ | 796/2826 [1:17:53<3:10:20, 5.63s/it]
28%|βββ | 797/2826 [1:17:59<3:15:07, 5.77s/it]
28%|βββ | 798/2826 [1:18:05<3:19:41, 5.91s/it]
28%|βββ | 799/2826 [1:18:11<3:13:41, 5.73s/it]
28%|βββ | 800/2826 [1:18:17<3:26:14, 6.11s/it]
{'loss': 0.4636, 'grad_norm': 2.1592986583709717, 'learning_rate': 4.509024510979917e-06, 'epoch': 0.85} |
|
28%|βββ | 800/2826 [1:18:17<3:26:14, 6.11s/it]
28%|βββ | 801/2826 [1:18:24<3:29:49, 6.22s/it]
28%|βββ | 802/2826 [1:18:30<3:26:00, 6.11s/it]
28%|βββ | 803/2826 [1:18:35<3:19:53, 5.93s/it]
28%|βββ | 804/2826 [1:18:42<3:25:27, 6.10s/it]
28%|βββ | 805/2826 [1:18:47<3:20:20, 5.95s/it]
29%|βββ | 806/2826 [1:18:53<3:14:27, 5.78s/it]
29%|βββ | 807/2826 [1:19:00<3:26:15, 6.13s/it]
29%|βββ | 808/2826 [1:19:05<3:20:04, 5.95s/it]
29%|βββ | 809/2826 [1:19:13<3:36:29, 6.44s/it]
29%|βββ | 810/2826 [1:19:18<3:24:30, 6.09s/it]
{'loss': 0.4685, 'grad_norm': 2.2622759342193604, 'learning_rate': 4.4904904371601176e-06, 'epoch': 0.86} |
|
29%|βββ | 810/2826 [1:19:18<3:24:30, 6.09s/it]
29%|βββ | 811/2826 [1:19:24<3:24:46, 6.10s/it]
29%|βββ | 812/2826 [1:19:29<3:14:40, 5.80s/it]
29%|βββ | 813/2826 [1:19:36<3:25:28, 6.12s/it]
29%|βββ | 814/2826 [1:19:43<3:28:16, 6.21s/it]
29%|βββ | 815/2826 [1:19:48<3:17:00, 5.88s/it]
29%|βββ | 816/2826 [1:19:55<3:27:35, 6.20s/it]
29%|βββ | 817/2826 [1:20:02<3:40:45, 6.59s/it]
29%|βββ | 818/2826 [1:20:10<3:50:48, 6.90s/it]
29%|βββ | 819/2826 [1:20:15<3:32:37, 6.36s/it]
29%|βββ | 820/2826 [1:20:20<3:22:27, 6.06s/it]
{'loss': 0.5248, 'grad_norm': 2.3408522605895996, 'learning_rate': 4.4716525816480816e-06, 'epoch': 0.87} |
|
29%|βββ | 820/2826 [1:20:20<3:22:27, 6.06s/it]
29%|βββ | 821/2826 [1:20:26<3:16:31, 5.88s/it]
29%|βββ | 822/2826 [1:20:33<3:30:28, 6.30s/it]
29%|βββ | 823/2826 [1:20:39<3:29:36, 6.28s/it]
29%|βββ | 824/2826 [1:20:44<3:16:30, 5.89s/it]
29%|βββ | 825/2826 [1:20:50<3:10:33, 5.71s/it]
29%|βββ | 826/2826 [1:20:56<3:15:50, 5.88s/it]
29%|βββ | 827/2826 [1:21:01<3:11:04, 5.74s/it]
29%|βββ | 828/2826 [1:21:08<3:24:38, 6.15s/it]
29%|βββ | 829/2826 [1:21:14<3:21:40, 6.06s/it]
29%|βββ | 830/2826 [1:21:20<3:23:18, 6.11s/it]
{'loss': 0.4747, 'grad_norm': 2.5351459980010986, 'learning_rate': 4.4525138194114644e-06, 'epoch': 0.88} |
|
29%|βββ | 830/2826 [1:21:20<3:23:18, 6.11s/it]
29%|βββ | 831/2826 [1:21:28<3:33:50, 6.43s/it]
29%|βββ | 832/2826 [1:21:35<3:48:02, 6.86s/it]
29%|βββ | 833/2826 [1:21:42<3:45:13, 6.78s/it]
30%|βββ | 834/2826 [1:21:47<3:27:54, 6.26s/it]
30%|βββ | 835/2826 [1:21:54<3:36:10, 6.51s/it]
30%|βββ | 836/2826 [1:22:01<3:35:27, 6.50s/it]
30%|βββ | 837/2826 [1:22:09<3:56:47, 7.14s/it]
30%|βββ | 838/2826 [1:22:15<3:42:13, 6.71s/it]
30%|βββ | 839/2826 [1:22:21<3:39:52, 6.64s/it]
30%|βββ | 840/2826 [1:22:27<3:25:15, 6.20s/it]
{'loss': 0.4198, 'grad_norm': 2.4038591384887695, 'learning_rate': 4.4330770713412555e-06, 'epoch': 0.89} |
|
30%|βββ | 840/2826 [1:22:27<3:25:15, 6.20s/it]
30%|βββ | 841/2826 [1:22:33<3:23:29, 6.15s/it]
30%|βββ | 842/2826 [1:22:41<3:41:23, 6.70s/it]
30%|βββ | 843/2826 [1:22:47<3:38:46, 6.62s/it]
30%|βββ | 844/2826 [1:22:54<3:36:54, 6.57s/it]
30%|βββ | 845/2826 [1:22:59<3:22:22, 6.13s/it]
30%|βββ | 846/2826 [1:23:04<3:15:28, 5.92s/it]
30%|βββ | 847/2826 [1:23:09<3:09:06, 5.73s/it]
30%|βββ | 848/2826 [1:23:15<3:12:47, 5.85s/it]
30%|βββ | 849/2826 [1:23:22<3:14:44, 5.91s/it]
30%|βββ | 850/2826 [1:23:27<3:15:11, 5.93s/it]
{'loss': 0.4545, 'grad_norm': 2.2719292640686035, 'learning_rate': 4.413345303805996e-06, 'epoch': 0.9} |
|
30%|βββ | 850/2826 [1:23:27<3:15:11, 5.93s/it]
30%|βββ | 851/2826 [1:23:33<3:09:49, 5.77s/it]
30%|βββ | 852/2826 [1:23:40<3:18:21, 6.03s/it]
30%|βββ | 853/2826 [1:23:45<3:16:26, 5.97s/it]
30%|βββ | 854/2826 [1:23:51<3:08:49, 5.75s/it]
30%|βββ | 855/2826 [1:23:57<3:12:56, 5.87s/it]
30%|βββ | 856/2826 [1:24:02<3:04:42, 5.63s/it]
30%|βββ | 857/2826 [1:24:08<3:10:29, 5.80s/it]
30%|βββ | 858/2826 [1:24:13<3:03:55, 5.61s/it]
30%|βββ | 859/2826 [1:24:18<3:00:37, 5.51s/it]
30%|βββ | 860/2826 [1:24:24<2:57:40, 5.42s/it]
{'loss': 0.5003, 'grad_norm': 3.1209301948547363, 'learning_rate': 4.393321528199072e-06, 'epoch': 0.91} |
|
30%|βββ | 860/2826 [1:24:24<2:57:40, 5.42s/it]
30%|βββ | 861/2826 [1:24:29<2:59:49, 5.49s/it]
31%|βββ | 862/2826 [1:24:36<3:06:48, 5.71s/it]
31%|βββ | 863/2826 [1:24:42<3:09:17, 5.79s/it]
31%|βββ | 864/2826 [1:24:47<3:05:16, 5.67s/it]
31%|βββ | 865/2826 [1:24:53<3:06:12, 5.70s/it]
31%|βββ | 866/2826 [1:24:59<3:07:58, 5.75s/it]
31%|βββ | 867/2826 [1:25:04<3:08:17, 5.77s/it]
31%|βββ | 868/2826 [1:25:09<3:02:02, 5.58s/it]
31%|βββ | 869/2826 [1:25:15<2:57:41, 5.45s/it]
31%|βββ | 870/2826 [1:25:20<2:53:58, 5.34s/it]
{'loss': 0.472, 'grad_norm': 2.414945125579834, 'learning_rate': 4.373008800479118e-06, 'epoch': 0.92} |
|
31%|βββ | 870/2826 [1:25:20<2:53:58, 5.34s/it]
31%|βββ | 871/2826 [1:25:25<2:57:13, 5.44s/it]
31%|βββ | 872/2826 [1:25:31<2:54:06, 5.35s/it]
31%|βββ | 873/2826 [1:25:36<2:51:52, 5.28s/it]
31%|βββ | 874/2826 [1:25:42<3:02:38, 5.61s/it]
31%|βββ | 875/2826 [1:25:47<2:57:43, 5.47s/it]
31%|βββ | 876/2826 [1:25:52<2:54:15, 5.36s/it]
31%|βββ | 877/2826 [1:25:59<3:08:52, 5.81s/it]
31%|βββ | 878/2826 [1:26:05<3:04:33, 5.68s/it]
31%|βββ | 879/2826 [1:26:11<3:12:47, 5.94s/it]
31%|βββ | 880/2826 [1:26:17<3:16:48, 6.07s/it]
{'loss': 0.4661, 'grad_norm': 2.21144437789917, 'learning_rate': 4.352410220703629e-06, 'epoch': 0.93} |
|
31%|βββ | 880/2826 [1:26:17<3:16:48, 6.07s/it]
31%|βββ | 881/2826 [1:26:25<3:26:56, 6.38s/it]
31%|βββ | 882/2826 [1:26:30<3:20:38, 6.19s/it]
31%|βββ | 883/2826 [1:26:36<3:17:58, 6.11s/it]
31%|ββββ | 884/2826 [1:26:41<3:08:38, 5.83s/it]
31%|ββββ | 885/2826 [1:26:48<3:13:22, 5.98s/it]
31%|ββββ | 886/2826 [1:26:53<3:04:43, 5.71s/it]
31%|ββββ | 887/2826 [1:27:00<3:16:46, 6.09s/it]
31%|ββββ | 888/2826 [1:27:05<3:06:51, 5.79s/it]
31%|ββββ | 889/2826 [1:27:12<3:15:55, 6.07s/it]
31%|ββββ | 890/2826 [1:27:17<3:11:36, 5.94s/it]
{'loss': 0.4614, 'grad_norm': 2.210827589035034, 'learning_rate': 4.331528932555844e-06, 'epoch': 0.94} |
|
31%|ββββ | 890/2826 [1:27:17<3:11:36, 5.94s/it]
32%|ββββ | 891/2826 [1:27:22<3:04:34, 5.72s/it]
32%|ββββ | 892/2826 [1:27:28<2:58:48, 5.55s/it]
32%|ββββ | 893/2826 [1:27:35<3:15:39, 6.07s/it]
32%|ββββ | 894/2826 [1:27:40<3:05:23, 5.76s/it]
32%|ββββ | 895/2826 [1:27:46<3:06:43, 5.80s/it]
32%|ββββ | 896/2826 [1:27:52<3:11:42, 5.96s/it]
32%|ββββ | 897/2826 [1:27:58<3:07:25, 5.83s/it]
32%|ββββ | 898/2826 [1:28:04<3:07:33, 5.84s/it]
32%|ββββ | 899/2826 [1:28:10<3:12:55, 6.01s/it]
32%|ββββ | 900/2826 [1:28:15<3:04:44, 5.76s/it]
{'loss': 0.4623, 'grad_norm': 2.403038740158081, 'learning_rate': 4.3103681228649626e-06, 'epoch': 0.95} |
|
32%|ββββ | 900/2826 [1:28:15<3:04:44, 5.76s/it]
32%|ββββ | 901/2826 [1:28:20<2:59:30, 5.59s/it]
32%|ββββ | 902/2826 [1:28:25<2:55:26, 5.47s/it]
32%|ββββ | 903/2826 [1:28:31<2:58:45, 5.58s/it]
32%|ββββ | 904/2826 [1:28:38<3:10:55, 5.96s/it]
32%|ββββ | 905/2826 [1:28:44<3:10:35, 5.95s/it]
32%|ββββ | 906/2826 [1:28:50<3:06:40, 5.83s/it]
32%|ββββ | 907/2826 [1:28:56<3:15:02, 6.10s/it]
32%|ββββ | 908/2826 [1:29:02<3:14:04, 6.07s/it]
32%|ββββ | 909/2826 [1:29:08<3:10:21, 5.96s/it]
32%|ββββ | 910/2826 [1:29:13<3:02:19, 5.71s/it]
{'loss': 0.4902, 'grad_norm': 2.588114023208618, 'learning_rate': 4.288931021119788e-06, 'epoch': 0.97} |
|
32%|ββββ | 910/2826 [1:29:13<3:02:19, 5.71s/it]
32%|ββββ | 911/2826 [1:29:18<2:58:08, 5.58s/it]
32%|ββββ | 912/2826 [1:29:24<2:57:23, 5.56s/it]
32%|ββββ | 913/2826 [1:29:29<2:55:27, 5.50s/it]
32%|ββββ | 914/2826 [1:29:34<2:51:24, 5.38s/it]
32%|ββββ | 915/2826 [1:29:40<2:52:58, 5.43s/it]
32%|ββββ | 916/2826 [1:29:45<2:49:43, 5.33s/it]
32%|ββββ | 917/2826 [1:29:51<2:53:38, 5.46s/it]
32%|ββββ | 918/2826 [1:29:59<3:15:46, 6.16s/it]
33%|ββββ | 919/2826 [1:30:04<3:05:44, 5.84s/it]
33%|ββββ | 920/2826 [1:30:10<3:05:48, 5.85s/it]
{'loss': 0.5047, 'grad_norm': 2.288691997528076, 'learning_rate': 4.267220898975848e-06, 'epoch': 0.98} |
|
33%|ββββ | 920/2826 [1:30:10<3:05:48, 5.85s/it]
33%|ββββ | 921/2826 [1:30:16<3:12:02, 6.05s/it]
33%|ββββ | 922/2826 [1:30:22<3:10:46, 6.01s/it]
33%|ββββ | 923/2826 [1:30:28<3:08:25, 5.94s/it]
33%|ββββ | 924/2826 [1:30:35<3:16:32, 6.20s/it]
33%|ββββ | 925/2826 [1:30:40<3:08:33, 5.95s/it]
33%|ββββ | 926/2826 [1:30:47<3:14:26, 6.14s/it]
33%|ββββ | 927/2826 [1:30:54<3:21:57, 6.38s/it]
33%|ββββ | 928/2826 [1:30:59<3:12:01, 6.07s/it]
33%|ββββ | 929/2826 [1:31:04<3:02:48, 5.78s/it]
33%|ββββ | 930/2826 [1:31:10<3:07:58, 5.95s/it]
{'loss': 0.5358, 'grad_norm': 2.2487804889678955, 'learning_rate': 4.245241069756092e-06, 'epoch': 0.99} |
|
33%|ββββ | 930/2826 [1:31:10<3:07:58, 5.95s/it]
33%|ββββ | 931/2826 [1:31:19<3:29:39, 6.64s/it]
33%|ββββ | 932/2826 [1:31:24<3:22:17, 6.41s/it]
33%|ββββ | 933/2826 [1:31:30<3:15:02, 6.18s/it]
33%|ββββ | 934/2826 [1:31:35<3:04:54, 5.86s/it]
33%|ββββ | 935/2826 [1:31:41<3:03:18, 5.82s/it]
33%|ββββ | 936/2826 [1:31:46<3:00:09, 5.72s/it]
33%|ββββ | 937/2826 [1:31:52<3:02:43, 5.80s/it]
33%|ββββ | 938/2826 [1:31:58<3:03:28, 5.83s/it]
33%|ββββ | 939/2826 [1:32:05<3:14:38, 6.19s/it]
33%|ββββ | 940/2826 [1:32:11<3:10:32, 6.06s/it]
{'loss': 0.4928, 'grad_norm': 2.5266008377075195, 'learning_rate': 4.222994887945219e-06, 'epoch': 1.0} |
|
33%|ββββ | 940/2826 [1:32:11<3:10:32, 6.06s/it]
33%|ββββ | 941/2826 [1:32:17<3:07:09, 5.96s/it]
33%|ββββ | 942/2826 [1:32:22<3:02:07, 5.80s/it]
33%|ββββ | 943/2826 [1:32:25<2:31:12, 4.82s/it][INFO|trainer.py:3984] 2025-10-18 08:18:38,867 >> Saving model checkpoint to /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943 |
| [INFO|configuration_utils.py:419] 2025-10-18 08:18:38,877 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/config.json |
| [INFO|configuration_utils.py:911] 2025-10-18 08:18:38,879 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/generation_config.json |
| [INFO|modeling_utils.py:3580] 2025-10-18 08:18:54,649 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/model.safetensors.index.json. |
| [INFO|tokenization_utils_base.py:2510] 2025-10-18 08:18:54,651 >> tokenizer config file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/tokenizer_config.json |
| [INFO|tokenization_utils_base.py:2519] 2025-10-18 08:18:54,652 >> Special tokens file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/special_tokens_map.json |
| [2025-10-18 08:18:55,344] [INFO] [logging.py:107:log_dist] [Rank 0] [Torch] Checkpoint global_step942 is about to be saved! |
| [2025-10-18 08:18:55,355] [INFO] [logging.py:107:log_dist] [Rank 0] Saving model checkpoint: /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/zero_pp_rank_0_mp_rank_00_model_states.pt |
| [2025-10-18 08:18:55,355] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/zero_pp_rank_0_mp_rank_00_model_states.pt... |
| [2025-10-18 08:18:55,372] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/zero_pp_rank_0_mp_rank_00_model_states.pt. |
| [2025-10-18 08:18:55,384] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... |
| [2025-10-18 08:19:06,711] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. |
| [2025-10-18 08:19:06,716] [INFO] [engine.py:3701:_save_zero_checkpoint] zero checkpoint saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-943/global_step942/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt |
| [2025-10-18 08:19:07,451] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step942 is ready now! |
|
33%|ββββ | 944/2826 [1:33:09<8:41:27, 16.62s/it]
33%|ββββ | 945/2826 [1:33:15<7:01:57, 13.46s/it]
33%|ββββ | 946/2826 [1:33:20<5:46:28, 11.06s/it]
34%|ββββ | 947/2826 [1:33:27<5:04:11, 9.71s/it]
34%|ββββ | 948/2826 [1:33:32<4:23:27, 8.42s/it]
34%|ββββ | 949/2826 [1:33:39<4:01:52, 7.73s/it]
34%|ββββ | 950/2826 [1:33:45<3:49:22, 7.34s/it]
{'loss': 0.3963, 'grad_norm': 2.5962352752685547, 'learning_rate': 4.20048574867773e-06, 'epoch': 1.01} |
|
34%|ββββ | 950/2826 [1:33:45<3:49:22, 7.34s/it]
34%|ββββ | 951/2826 [1:33:52<3:43:24, 7.15s/it]
34%|ββββ | 952/2826 [1:33:57<3:24:25, 6.55s/it]
34%|ββββ | 953/2826 [1:34:04<3:28:22, 6.68s/it]
34%|ββββ | 954/2826 [1:34:10<3:23:13, 6.51s/it]
34%|ββββ | 955/2826 [1:34:15<3:11:55, 6.15s/it]
34%|ββββ | 956/2826 [1:34:21<3:03:10, 5.88s/it]
34%|ββββ | 957/2826 [1:34:26<3:02:49, 5.87s/it]
34%|ββββ | 958/2826 [1:34:32<3:00:15, 5.79s/it]
34%|ββββ | 959/2826 [1:34:38<3:01:04, 5.82s/it]
34%|ββββ | 960/2826 [1:34:43<2:53:58, 5.59s/it]
{'loss': 0.3125, 'grad_norm': 2.707613229751587, 'learning_rate': 4.1777170872197725e-06, 'epoch': 1.02} |
|
34%|ββββ | 960/2826 [1:34:43<2:53:58, 5.59s/it]
34%|ββββ | 961/2826 [1:34:48<2:49:24, 5.45s/it]
34%|ββββ | 962/2826 [1:34:55<2:59:06, 5.77s/it]
34%|ββββ | 963/2826 [1:35:00<2:52:13, 5.55s/it]
34%|ββββ | 964/2826 [1:35:06<2:57:05, 5.71s/it]
34%|ββββ | 965/2826 [1:35:12<3:00:15, 5.81s/it]
34%|ββββ | 966/2826 [1:35:17<2:54:05, 5.62s/it]
34%|ββββ | 967/2826 [1:35:22<2:50:43, 5.51s/it]
34%|ββββ | 968/2826 [1:35:28<2:51:12, 5.53s/it]
34%|ββββ | 969/2826 [1:35:35<3:06:10, 6.02s/it]
34%|ββββ | 970/2826 [1:35:40<3:00:38, 5.84s/it]
{'loss': 0.3457, 'grad_norm': 2.4237964153289795, 'learning_rate': 4.1546923784448646e-06, 'epoch': 1.03} |
|
34%|ββββ | 970/2826 [1:35:40<3:00:38, 5.84s/it]
34%|ββββ | 971/2826 [1:35:46<2:56:18, 5.70s/it]
34%|ββββ | 972/2826 [1:35:52<3:02:30, 5.91s/it]
34%|ββββ | 973/2826 [1:35:57<2:56:13, 5.71s/it]
34%|ββββ | 974/2826 [1:36:02<2:51:26, 5.55s/it]
35%|ββββ | 975/2826 [1:36:09<2:58:15, 5.78s/it]
35%|ββββ | 976/2826 [1:36:15<3:01:07, 5.87s/it]
35%|ββββ | 977/2826 [1:36:21<2:59:05, 5.81s/it]
35%|ββββ | 978/2826 [1:36:27<3:05:22, 6.02s/it]
35%|ββββ | 979/2826 [1:36:33<3:02:09, 5.92s/it]
35%|ββββ | 980/2826 [1:36:38<2:55:14, 5.70s/it]
{'loss': 0.3029, 'grad_norm': 1.6531928777694702, 'learning_rate': 4.1314151363035705e-06, 'epoch': 1.04} |
|
35%|ββββ | 980/2826 [1:36:38<2:55:14, 5.70s/it]
35%|ββββ | 981/2826 [1:36:43<2:49:17, 5.51s/it]
35%|ββββ | 982/2826 [1:36:48<2:46:14, 5.41s/it]
35%|ββββ | 983/2826 [1:36:54<2:50:11, 5.54s/it]
35%|ββββ | 984/2826 [1:37:00<2:56:04, 5.74s/it]
35%|ββββ | 985/2826 [1:37:06<2:56:10, 5.74s/it]
35%|ββββ | 986/2826 [1:37:12<3:02:53, 5.96s/it]
35%|ββββ | 987/2826 [1:37:19<3:06:18, 6.08s/it]
35%|ββββ | 988/2826 [1:37:25<3:05:34, 6.06s/it]
35%|ββββ | 989/2826 [1:37:30<2:58:32, 5.83s/it]
35%|ββββ | 990/2826 [1:37:36<3:02:40, 5.97s/it]
{'loss': 0.3289, 'grad_norm': 2.1669981479644775, 'learning_rate': 4.1078889132872145e-06, 'epoch': 1.05} |
|
35%|ββββ | 990/2826 [1:37:36<3:02:40, 5.97s/it]
35%|ββββ | 991/2826 [1:37:42<2:57:27, 5.80s/it]
35%|ββββ | 992/2826 [1:37:48<2:57:00, 5.79s/it]
35%|ββββ | 993/2826 [1:37:54<2:58:57, 5.86s/it]
35%|ββββ | 994/2826 [1:37:59<2:54:16, 5.71s/it]
35%|ββββ | 995/2826 [1:38:05<3:02:00, 5.96s/it]
35%|ββββ | 996/2826 [1:38:12<3:09:19, 6.21s/it]
35%|ββββ | 997/2826 [1:38:17<2:58:04, 5.84s/it]
35%|ββββ | 998/2826 [1:38:23<2:52:42, 5.67s/it]
35%|ββββ | 999/2826 [1:38:30<3:08:35, 6.19s/it]
35%|ββββ | 1000/2826 [1:38:36<3:09:47, 6.24s/it]
{'loss': 0.3234, 'grad_norm': 2.445012092590332, 'learning_rate': 4.084117299885712e-06, 'epoch': 1.06} |
|
35%|ββββ | 1000/2826 [1:38:36<3:09:47, 6.24s/it]
35%|ββββ | 1001/2826 [1:38:42<3:04:16, 6.06s/it]
35%|ββββ | 1002/2826 [1:38:48<3:01:52, 5.98s/it]
35%|ββββ | 1003/2826 [1:38:54<3:00:59, 5.96s/it]
36%|ββββ | 1004/2826 [1:39:00<3:09:10, 6.23s/it]
36%|ββββ | 1005/2826 [1:39:07<3:07:42, 6.18s/it]
36%|ββββ | 1006/2826 [1:39:13<3:11:31, 6.31s/it]
36%|ββββ | 1007/2826 [1:39:18<2:59:50, 5.93s/it]
36%|ββββ | 1008/2826 [1:39:23<2:51:32, 5.66s/it]
36%|ββββ | 1009/2826 [1:39:29<2:49:19, 5.59s/it]
36%|ββββ | 1010/2826 [1:39:35<2:59:22, 5.93s/it]
{'loss': 0.3139, 'grad_norm': 2.0615527629852295, 'learning_rate': 4.060103924039599e-06, 'epoch': 1.07} |
|
36%|ββββ | 1010/2826 [1:39:35<2:59:22, 5.93s/it]
36%|ββββ | 1011/2826 [1:39:41<2:59:26, 5.93s/it]
36%|ββββ | 1012/2826 [1:39:47<2:56:02, 5.82s/it]
36%|ββββ | 1013/2826 [1:39:52<2:50:05, 5.63s/it]
36%|ββββ | 1014/2826 [1:39:57<2:47:28, 5.55s/it]
36%|ββββ | 1015/2826 [1:40:03<2:48:04, 5.57s/it]
36%|ββββ | 1016/2826 [1:40:09<2:47:44, 5.56s/it]
36%|ββββ | 1017/2826 [1:40:14<2:46:35, 5.53s/it]
36%|ββββ | 1018/2826 [1:40:20<2:48:22, 5.59s/it]
36%|ββββ | 1019/2826 [1:40:25<2:49:06, 5.62s/it]
36%|ββββ | 1020/2826 [1:40:32<2:55:54, 5.84s/it]
{'loss': 0.3144, 'grad_norm': 1.990400791168213, 'learning_rate': 4.035852450586352e-06, 'epoch': 1.08} |
|
36%|ββββ | 1020/2826 [1:40:32<2:55:54, 5.84s/it]
36%|ββββ | 1021/2826 [1:40:37<2:52:24, 5.73s/it]
36%|ββββ | 1022/2826 [1:40:43<2:52:07, 5.73s/it]
36%|ββββ | 1023/2826 [1:40:48<2:49:56, 5.66s/it]
36%|ββββ | 1024/2826 [1:40:54<2:49:48, 5.65s/it]
36%|ββββ | 1025/2826 [1:40:59<2:44:53, 5.49s/it]
36%|ββββ | 1026/2826 [1:41:05<2:46:54, 5.56s/it]
36%|ββββ | 1027/2826 [1:41:10<2:42:46, 5.43s/it]
36%|ββββ | 1028/2826 [1:41:17<2:58:46, 5.97s/it]
36%|ββββ | 1029/2826 [1:41:23<2:59:35, 6.00s/it]
36%|ββββ | 1030/2826 [1:41:29<2:58:41, 5.97s/it]
{'loss': 0.323, 'grad_norm': 2.5510122776031494, 'learning_rate': 4.011366580701073e-06, 'epoch': 1.09} |
|
36%|ββββ | 1030/2826 [1:41:29<2:58:41, 5.97s/it]
36%|ββββ | 1031/2826 [1:41:36<3:04:08, 6.16s/it]
37%|ββββ | 1032/2826 [1:41:42<3:07:15, 6.26s/it]
37%|ββββ | 1033/2826 [1:41:48<3:00:04, 6.03s/it]
37%|ββββ | 1034/2826 [1:41:54<3:03:10, 6.13s/it]
37%|ββββ | 1035/2826 [1:42:01<3:10:27, 6.38s/it]
37%|ββββ | 1036/2826 [1:42:07<3:05:23, 6.21s/it]
37%|ββββ | 1037/2826 [1:42:12<2:58:14, 5.98s/it]
37%|ββββ | 1038/2826 [1:42:18<2:50:23, 5.72s/it]
37%|ββββ | 1039/2826 [1:42:23<2:47:49, 5.64s/it]
37%|ββββ | 1040/2826 [1:42:28<2:42:38, 5.46s/it]
{'loss': 0.3694, 'grad_norm': 2.462083101272583, 'learning_rate': 3.9866500513316274e-06, 'epoch': 1.1} |
|
37%|ββββ | 1040/2826 [1:42:28<2:42:38, 5.46s/it]
37%|ββββ | 1041/2826 [1:42:34<2:50:41, 5.74s/it]
37%|ββββ | 1042/2826 [1:42:40<2:45:18, 5.56s/it]
37%|ββββ | 1043/2826 [1:42:45<2:45:43, 5.58s/it]
37%|ββββ | 1044/2826 [1:42:53<3:00:56, 6.09s/it]
37%|ββββ | 1045/2826 [1:42:58<2:59:06, 6.03s/it]
37%|ββββ | 1046/2826 [1:43:04<2:58:46, 6.03s/it]
37%|ββββ | 1047/2826 [1:43:10<2:51:13, 5.78s/it]
37%|ββββ | 1048/2826 [1:43:15<2:46:01, 5.60s/it]
37%|ββββ | 1049/2826 [1:43:20<2:45:51, 5.60s/it]
37%|ββββ | 1050/2826 [1:43:26<2:41:47, 5.47s/it]
{'loss': 0.3351, 'grad_norm': 2.4385085105895996, 'learning_rate': 3.961706634628323e-06, 'epoch': 1.11} |
|
37%|ββββ | 1050/2826 [1:43:26<2:41:47, 5.47s/it]
37%|ββββ | 1051/2826 [1:43:32<2:46:15, 5.62s/it]
37%|ββββ | 1052/2826 [1:43:37<2:41:15, 5.45s/it]
37%|ββββ | 1053/2826 [1:43:42<2:42:28, 5.50s/it]
37%|ββββ | 1054/2826 [1:43:49<2:51:56, 5.82s/it]
37%|ββββ | 1055/2826 [1:43:55<2:55:01, 5.93s/it]
37%|ββββ | 1056/2826 [1:44:01<2:52:50, 5.86s/it]
37%|ββββ | 1057/2826 [1:44:07<2:58:38, 6.06s/it]
37%|ββββ | 1058/2826 [1:44:14<3:03:47, 6.24s/it]
37%|ββββ | 1059/2826 [1:44:21<3:11:08, 6.49s/it]
38%|ββββ | 1060/2826 [1:44:26<3:00:03, 6.12s/it]
{'loss': 0.3459, 'grad_norm': 1.7553578615188599, 'learning_rate': 3.936540137368222e-06, 'epoch': 1.12} |
|
38%|ββββ | 1060/2826 [1:44:26<3:00:03, 6.12s/it]
38%|ββββ | 1061/2826 [1:44:32<3:01:32, 6.17s/it]
38%|ββββ | 1062/2826 [1:44:38<2:54:03, 5.92s/it]
38%|ββββ | 1063/2826 [1:44:45<3:05:06, 6.30s/it]
38%|ββββ | 1064/2826 [1:44:51<2:59:14, 6.10s/it]
38%|ββββ | 1065/2826 [1:44:56<2:50:04, 5.79s/it]
38%|ββββ | 1066/2826 [1:45:01<2:44:29, 5.61s/it]
38%|ββββ | 1067/2826 [1:45:07<2:44:30, 5.61s/it]
38%|ββββ | 1068/2826 [1:45:12<2:40:10, 5.47s/it]
38%|ββββ | 1069/2826 [1:45:18<2:50:12, 5.81s/it]
38%|ββββ | 1070/2826 [1:45:24<2:45:11, 5.64s/it]
{'loss': 0.3186, 'grad_norm': 2.513950824737549, 'learning_rate': 3.911154400374159e-06, 'epoch': 1.13} |
|
38%|ββββ | 1070/2826 [1:45:24<2:45:11, 5.64s/it]
38%|ββββ | 1071/2826 [1:45:29<2:44:36, 5.63s/it]
38%|ββββ | 1072/2826 [1:45:35<2:43:08, 5.58s/it]
38%|ββββ | 1073/2826 [1:45:40<2:38:13, 5.42s/it]
38%|ββββ | 1074/2826 [1:45:45<2:39:13, 5.45s/it]
38%|ββββ | 1075/2826 [1:45:50<2:37:30, 5.40s/it]
38%|ββββ | 1076/2826 [1:45:57<2:51:32, 5.88s/it]
38%|ββββ | 1077/2826 [1:46:04<2:56:12, 6.04s/it]
38%|ββββ | 1078/2826 [1:46:09<2:52:41, 5.93s/it]
38%|ββββ | 1079/2826 [1:46:14<2:44:03, 5.63s/it]
38%|ββββ | 1080/2826 [1:46:21<2:48:28, 5.79s/it]
{'loss': 0.3333, 'grad_norm': 2.6273515224456787, 'learning_rate': 3.885553297928573e-06, 'epoch': 1.15} |
|
38%|ββββ | 1080/2826 [1:46:21<2:48:28, 5.79s/it]
38%|ββββ | 1081/2826 [1:46:27<2:53:52, 5.98s/it]
38%|ββββ | 1082/2826 [1:46:34<3:02:20, 6.27s/it]
38%|ββββ | 1083/2826 [1:46:40<2:57:07, 6.10s/it]
38%|ββββ | 1084/2826 [1:46:45<2:53:59, 5.99s/it]
38%|ββββ | 1085/2826 [1:46:51<2:46:29, 5.74s/it]
38%|ββββ | 1086/2826 [1:46:56<2:43:57, 5.65s/it]
38%|ββββ | 1087/2826 [1:47:01<2:39:38, 5.51s/it]
38%|ββββ | 1088/2826 [1:47:08<2:50:11, 5.88s/it]
39%|ββββ | 1089/2826 [1:47:13<2:45:55, 5.73s/it]
39%|ββββ | 1090/2826 [1:47:19<2:42:49, 5.63s/it]
{'loss': 0.3137, 'grad_norm': 2.4155592918395996, 'learning_rate': 3.859740737182222e-06, 'epoch': 1.16} |
|
39%|ββββ | 1090/2826 [1:47:19<2:42:49, 5.63s/it]
39%|ββββ | 1091/2826 [1:47:26<2:54:06, 6.02s/it]
39%|ββββ | 1092/2826 [1:47:31<2:52:35, 5.97s/it]
39%|ββββ | 1093/2826 [1:47:37<2:45:28, 5.73s/it]
39%|ββββ | 1094/2826 [1:47:42<2:39:51, 5.54s/it]
39%|ββββ | 1095/2826 [1:47:49<2:50:25, 5.91s/it]
39%|ββββ | 1096/2826 [1:47:54<2:47:13, 5.80s/it]
39%|ββββ | 1097/2826 [1:48:00<2:47:03, 5.80s/it]
39%|ββββ | 1098/2826 [1:48:06<2:50:12, 5.91s/it]
39%|ββββ | 1099/2826 [1:48:12<2:52:30, 5.99s/it]
39%|ββββ | 1100/2826 [1:48:18<2:49:45, 5.90s/it]
{'loss': 0.3426, 'grad_norm': 2.719611644744873, 'learning_rate': 3.833720657557894e-06, 'epoch': 1.17} |
|
39%|ββββ | 1100/2826 [1:48:18<2:49:45, 5.90s/it]
39%|ββββ | 1101/2826 [1:48:23<2:42:16, 5.64s/it]
39%|ββββ | 1102/2826 [1:48:28<2:38:20, 5.51s/it]
39%|ββββ | 1103/2826 [1:48:33<2:36:09, 5.44s/it]
39%|ββββ | 1104/2826 [1:48:39<2:39:40, 5.56s/it]
39%|ββββ | 1105/2826 [1:48:44<2:36:35, 5.46s/it]
39%|ββββ | 1106/2826 [1:48:50<2:35:30, 5.42s/it]
39%|ββββ | 1107/2826 [1:48:56<2:39:28, 5.57s/it]
39%|ββββ | 1108/2826 [1:49:01<2:36:21, 5.46s/it]
39%|ββββ | 1109/2826 [1:49:07<2:44:47, 5.76s/it]
39%|ββββ | 1110/2826 [1:49:13<2:43:13, 5.71s/it]
{'loss': 0.3709, 'grad_norm': 2.5729358196258545, 'learning_rate': 3.807497030149181e-06, 'epoch': 1.18} |
|
39%|ββββ | 1110/2826 [1:49:13<2:43:13, 5.71s/it]
39%|ββββ | 1111/2826 [1:49:18<2:37:52, 5.52s/it]
39%|ββββ | 1112/2826 [1:49:25<2:51:38, 6.01s/it]
39%|ββββ | 1113/2826 [1:49:31<2:48:24, 5.90s/it]
39%|ββββ | 1114/2826 [1:49:36<2:44:09, 5.75s/it]
39%|ββββ | 1115/2826 [1:49:42<2:41:00, 5.65s/it]
39%|ββββ | 1116/2826 [1:49:47<2:36:20, 5.49s/it]
40%|ββββ | 1117/2826 [1:49:52<2:33:20, 5.38s/it]
40%|ββββ | 1118/2826 [1:49:59<2:43:55, 5.76s/it]
40%|ββββ | 1119/2826 [1:50:05<2:50:31, 5.99s/it]
40%|ββββ | 1120/2826 [1:50:12<2:59:47, 6.32s/it]
{'loss': 0.329, 'grad_norm': 1.9626141786575317, 'learning_rate': 3.7810738571144257e-06, 'epoch': 1.19} |
|
40%|ββββ | 1120/2826 [1:50:12<2:59:47, 6.32s/it]
40%|ββββ | 1121/2826 [1:50:19<3:00:26, 6.35s/it]
40%|ββββ | 1122/2826 [1:50:25<2:58:29, 6.28s/it]
40%|ββββ | 1123/2826 [1:50:31<3:00:30, 6.36s/it]
40%|ββββ | 1124/2826 [1:50:37<2:54:26, 6.15s/it]
40%|ββββ | 1125/2826 [1:50:43<2:49:42, 5.99s/it]
40%|ββββ | 1126/2826 [1:50:48<2:47:34, 5.91s/it]
40%|ββββ | 1127/2826 [1:50:53<2:40:54, 5.68s/it]
40%|ββββ | 1128/2826 [1:51:00<2:49:57, 6.01s/it]
40%|ββββ | 1129/2826 [1:51:06<2:44:59, 5.83s/it]
40%|ββββ | 1130/2826 [1:51:11<2:43:19, 5.78s/it]
{'loss': 0.305, 'grad_norm': 2.601951837539673, 'learning_rate': 3.7544551710659296e-06, 'epoch': 1.2} |
|
40%|ββββ | 1130/2826 [1:51:11<2:43:19, 5.78s/it]
40%|ββββ | 1131/2826 [1:51:17<2:38:51, 5.62s/it]
40%|ββββ | 1132/2826 [1:51:22<2:38:00, 5.60s/it]
40%|ββββ | 1133/2826 [1:51:27<2:33:40, 5.45s/it]
40%|ββββ | 1134/2826 [1:51:32<2:31:17, 5.37s/it]
40%|ββββ | 1135/2826 [1:51:38<2:31:27, 5.37s/it]
40%|ββββ | 1136/2826 [1:51:43<2:32:33, 5.42s/it]
40%|ββββ | 1137/2826 [1:51:49<2:34:00, 5.47s/it]
40%|ββββ | 1138/2826 [1:51:55<2:38:03, 5.62s/it]
40%|ββββ | 1139/2826 [1:52:02<2:54:30, 6.21s/it]
40%|ββββ | 1140/2826 [1:52:08<2:48:10, 5.98s/it]
{'loss': 0.3449, 'grad_norm': 2.4118540287017822, 'learning_rate': 3.7276450344545024e-06, 'epoch': 1.21} |
|
40%|ββββ | 1140/2826 [1:52:08<2:48:10, 5.98s/it]
40%|ββββ | 1141/2826 [1:52:14<2:49:24, 6.03s/it]
40%|ββββ | 1142/2826 [1:52:19<2:44:28, 5.86s/it]
40%|ββββ | 1143/2826 [1:52:26<2:46:48, 5.95s/it]
40%|ββββ | 1144/2826 [1:52:32<2:49:26, 6.04s/it]
41%|ββββ | 1145/2826 [1:52:39<2:54:45, 6.24s/it]
41%|ββββ | 1146/2826 [1:52:44<2:49:05, 6.04s/it]
41%|ββββ | 1147/2826 [1:52:50<2:44:31, 5.88s/it]
41%|ββββ | 1148/2826 [1:52:56<2:44:26, 5.88s/it]
41%|ββββ | 1149/2826 [1:53:01<2:38:38, 5.68s/it]
41%|ββββ | 1150/2826 [1:53:07<2:41:21, 5.78s/it]
{'loss': 0.3403, 'grad_norm': 2.5080604553222656, 'learning_rate': 3.7006475389494723e-06, 'epoch': 1.22} |
|
41%|ββββ | 1150/2826 [1:53:07<2:41:21, 5.78s/it]
41%|ββββ | 1151/2826 [1:53:13<2:44:09, 5.88s/it]
41%|ββββ | 1152/2826 [1:53:18<2:40:38, 5.76s/it]
41%|ββββ | 1153/2826 [1:53:24<2:36:03, 5.60s/it]
41%|ββββ | 1154/2826 [1:53:29<2:34:55, 5.56s/it]
41%|ββββ | 1155/2826 [1:53:35<2:34:28, 5.55s/it]
41%|ββββ | 1156/2826 [1:53:41<2:41:39, 5.81s/it]
41%|ββββ | 1157/2826 [1:53:48<2:50:40, 6.14s/it]
41%|ββββ | 1158/2826 [1:53:54<2:49:25, 6.09s/it]
41%|ββββ | 1159/2826 [1:54:00<2:46:10, 5.98s/it]
41%|ββββ | 1160/2826 [1:54:06<2:46:48, 6.01s/it]
{'loss': 0.3342, 'grad_norm': 2.6882951259613037, 'learning_rate': 3.6734668048142273e-06, 'epoch': 1.23} |
|
41%|ββββ | 1160/2826 [1:54:06<2:46:48, 6.01s/it]
41%|ββββ | 1161/2826 [1:54:12<2:51:09, 6.17s/it]
41%|ββββ | 1162/2826 [1:54:18<2:47:06, 6.03s/it]
41%|ββββ | 1163/2826 [1:54:23<2:39:00, 5.74s/it]
41%|ββββ | 1164/2826 [1:54:28<2:33:37, 5.55s/it]
41%|ββββ | 1165/2826 [1:54:33<2:32:15, 5.50s/it]
41%|βββββ | 1166/2826 [1:54:40<2:38:52, 5.74s/it]
41%|βββββ | 1167/2826 [1:54:46<2:41:32, 5.84s/it]
41%|βββββ | 1168/2826 [1:54:52<2:42:24, 5.88s/it]
41%|βββββ | 1169/2826 [1:54:59<2:53:32, 6.28s/it]
41%|βββββ | 1170/2826 [1:55:04<2:46:16, 6.02s/it]
{'loss': 0.3589, 'grad_norm': 2.3755247592926025, 'learning_rate': 3.646106980277394e-06, 'epoch': 1.24} |
|
41%|βββββ | 1170/2826 [1:55:04<2:46:16, 6.02s/it]
41%|βββββ | 1171/2826 [1:55:12<3:00:48, 6.56s/it]
41%|βββββ | 1172/2826 [1:55:18<2:53:22, 6.29s/it]
42%|βββββ | 1173/2826 [1:55:23<2:46:19, 6.04s/it]
42%|βββββ | 1174/2826 [1:55:29<2:44:25, 5.97s/it]
42%|βββββ | 1175/2826 [1:55:34<2:37:32, 5.73s/it]
42%|βββββ | 1176/2826 [1:55:39<2:32:24, 5.54s/it]
42%|βββββ | 1177/2826 [1:55:45<2:28:56, 5.42s/it]
42%|βββββ | 1178/2826 [1:55:51<2:37:45, 5.74s/it]
42%|βββββ | 1179/2826 [1:55:57<2:39:37, 5.82s/it]
42%|βββββ | 1180/2826 [1:56:02<2:35:59, 5.69s/it]
{'loss': 0.3447, 'grad_norm': 2.4138166904449463, 'learning_rate': 3.618572240899748e-06, 'epoch': 1.25} |
|
42%|βββββ | 1180/2826 [1:56:02<2:35:59, 5.69s/it]
42%|βββββ | 1181/2826 [1:56:08<2:33:22, 5.59s/it]
42%|βββββ | 1182/2826 [1:56:14<2:36:00, 5.69s/it]
42%|βββββ | 1183/2826 [1:56:19<2:33:35, 5.61s/it]
42%|βββββ | 1184/2826 [1:56:25<2:34:16, 5.64s/it]
42%|βββββ | 1185/2826 [1:56:30<2:32:58, 5.59s/it]
42%|βββββ | 1186/2826 [1:56:35<2:28:12, 5.42s/it]
42%|βββββ | 1187/2826 [1:56:41<2:29:00, 5.45s/it]
42%|βββββ | 1188/2826 [1:56:46<2:25:36, 5.33s/it]
42%|βββββ | 1189/2826 [1:56:51<2:25:56, 5.35s/it]
42%|βββββ | 1190/2826 [1:56:58<2:33:41, 5.64s/it]
{'loss': 0.3787, 'grad_norm': 2.6930105686187744, 'learning_rate': 3.5908667889369603e-06, 'epoch': 1.26} |
|
42%|βββββ | 1190/2826 [1:56:58<2:33:41, 5.64s/it]
42%|βββββ | 1191/2826 [1:57:03<2:33:44, 5.64s/it]
42%|βββββ | 1192/2826 [1:57:08<2:29:11, 5.48s/it]
42%|βββββ | 1193/2826 [1:57:15<2:35:09, 5.70s/it]
42%|βββββ | 1194/2826 [1:57:20<2:35:46, 5.73s/it]
42%|βββββ | 1195/2826 [1:57:28<2:51:33, 6.31s/it]
42%|βββββ | 1196/2826 [1:57:35<2:55:24, 6.46s/it]
42%|βββββ | 1197/2826 [1:57:41<2:50:09, 6.27s/it]
42%|βββββ | 1198/2826 [1:57:46<2:43:27, 6.02s/it]
42%|βββββ | 1199/2826 [1:57:52<2:39:53, 5.90s/it]
42%|βββββ | 1200/2826 [1:57:58<2:39:54, 5.90s/it]
{'loss': 0.3376, 'grad_norm': 2.732795476913452, 'learning_rate': 3.5629948526982563e-06, 'epoch': 1.27} |
|
42%|βββββ | 1200/2826 [1:57:58<2:39:54, 5.90s/it]
42%|βββββ | 1201/2826 [1:58:03<2:39:13, 5.88s/it]
43%|βββββ | 1202/2826 [1:58:09<2:39:19, 5.89s/it]
43%|βββββ | 1203/2826 [1:58:15<2:34:22, 5.71s/it]
43%|βββββ | 1204/2826 [1:58:20<2:29:42, 5.54s/it]
43%|βββββ | 1205/2826 [1:58:25<2:29:21, 5.53s/it]
43%|βββββ | 1206/2826 [1:58:32<2:39:28, 5.91s/it]
43%|βββββ | 1207/2826 [1:58:38<2:40:30, 5.95s/it]
43%|βββββ | 1208/2826 [1:58:44<2:39:40, 5.92s/it]
43%|βββββ | 1209/2826 [1:58:50<2:40:20, 5.95s/it]
43%|βββββ | 1210/2826 [1:58:55<2:33:47, 5.71s/it]
{'loss': 0.3461, 'grad_norm': 1.8468087911605835, 'learning_rate': 3.534960685901111e-06, 'epoch': 1.28} |
|
43%|βββββ | 1210/2826 [1:58:55<2:33:47, 5.71s/it]
43%|βββββ | 1211/2826 [1:59:00<2:30:04, 5.58s/it]
43%|βββββ | 1212/2826 [1:59:06<2:29:24, 5.55s/it]
43%|βββββ | 1213/2826 [1:59:12<2:31:21, 5.63s/it]
43%|βββββ | 1214/2826 [1:59:17<2:26:46, 5.46s/it]
43%|βββββ | 1215/2826 [1:59:22<2:24:28, 5.38s/it]
43%|βββββ | 1216/2826 [1:59:27<2:22:41, 5.32s/it]
43%|βββββ | 1217/2826 [1:59:33<2:25:32, 5.43s/it]
43%|βββββ | 1218/2826 [1:59:39<2:29:00, 5.56s/it]
43%|βββββ | 1219/2826 [1:59:44<2:26:55, 5.49s/it]
43%|βββββ | 1220/2826 [1:59:49<2:24:10, 5.39s/it]
{'loss': 0.3396, 'grad_norm': 2.3408284187316895, 'learning_rate': 3.506768567022062e-06, 'epoch': 1.29} |
|
43%|βββββ | 1220/2826 [1:59:49<2:24:10, 5.39s/it]
43%|βββββ | 1221/2826 [1:59:57<2:42:47, 6.09s/it]
43%|βββββ | 1222/2826 [2:00:02<2:35:46, 5.83s/it]
43%|βββββ | 1223/2826 [2:00:07<2:29:33, 5.60s/it]
43%|βββββ | 1224/2826 [2:00:13<2:34:39, 5.79s/it]
43%|βββββ | 1225/2826 [2:00:19<2:29:34, 5.61s/it]
43%|βββββ | 1226/2826 [2:00:25<2:34:01, 5.78s/it]
43%|βββββ | 1227/2826 [2:00:31<2:38:56, 5.96s/it]
43%|βββββ | 1228/2826 [2:00:38<2:47:10, 6.28s/it]
43%|βββββ | 1229/2826 [2:00:43<2:37:14, 5.91s/it]
44%|βββββ | 1230/2826 [2:00:49<2:38:19, 5.95s/it]
{'loss': 0.3364, 'grad_norm': 2.7420434951782227, 'learning_rate': 3.478422798643737e-06, 'epoch': 1.3} |
|
44%|βββββ | 1230/2826 [2:00:49<2:38:19, 5.95s/it]
44%|βββββ | 1231/2826 [2:00:54<2:31:24, 5.70s/it]
44%|βββββ | 1232/2826 [2:01:00<2:27:28, 5.55s/it]
44%|βββββ | 1233/2826 [2:01:05<2:24:12, 5.43s/it]
44%|βββββ | 1234/2826 [2:01:11<2:26:55, 5.54s/it]
44%|βββββ | 1235/2826 [2:01:16<2:22:12, 5.36s/it]
44%|βββββ | 1236/2826 [2:01:22<2:33:06, 5.78s/it]
44%|βββββ | 1237/2826 [2:01:30<2:45:42, 6.26s/it]
44%|βββββ | 1238/2826 [2:01:35<2:42:04, 6.12s/it]
44%|βββββ | 1239/2826 [2:01:41<2:36:57, 5.93s/it]
44%|βββββ | 1240/2826 [2:01:46<2:29:55, 5.67s/it]
{'loss': 0.3126, 'grad_norm': 2.634403705596924, 'learning_rate': 3.4499277067982177e-06, 'epoch': 1.32} |
|
44%|βββββ | 1240/2826 [2:01:46<2:29:55, 5.67s/it]
44%|βββββ | 1241/2826 [2:01:52<2:35:38, 5.89s/it]
44%|βββββ | 1242/2826 [2:01:58<2:30:23, 5.70s/it]
44%|βββββ | 1243/2826 [2:02:04<2:37:16, 5.96s/it]
44%|βββββ | 1244/2826 [2:02:09<2:30:31, 5.71s/it]
44%|βββββ | 1245/2826 [2:02:16<2:33:58, 5.84s/it]
44%|βββββ | 1246/2826 [2:02:23<2:47:57, 6.38s/it]
44%|βββββ | 1247/2826 [2:02:29<2:40:25, 6.10s/it]
44%|βββββ | 1248/2826 [2:02:34<2:33:38, 5.84s/it]
44%|βββββ | 1249/2826 [2:02:39<2:27:36, 5.62s/it]
44%|βββββ | 1250/2826 [2:02:44<2:22:53, 5.44s/it]
{'loss': 0.3092, 'grad_norm': 2.4217336177825928, 'learning_rate': 3.421287640306809e-06, 'epoch': 1.33} |
|
44%|βββββ | 1250/2826 [2:02:44<2:22:53, 5.44s/it]
44%|βββββ | 1251/2826 [2:02:49<2:22:20, 5.42s/it]
44%|βββββ | 1252/2826 [2:02:55<2:20:51, 5.37s/it]
44%|βββββ | 1253/2826 [2:03:00<2:20:22, 5.35s/it]
44%|βββββ | 1254/2826 [2:03:07<2:30:44, 5.75s/it]
44%|βββββ | 1255/2826 [2:03:12<2:31:47, 5.80s/it]
44%|βββββ | 1256/2826 [2:03:20<2:43:03, 6.23s/it]
44%|βββββ | 1257/2826 [2:03:25<2:38:33, 6.06s/it]
45%|βββββ | 1258/2826 [2:03:33<2:51:53, 6.58s/it]
45%|βββββ | 1259/2826 [2:03:41<3:04:50, 7.08s/it]
45%|βββββ | 1260/2826 [2:03:48<2:57:03, 6.78s/it]
{'loss': 0.3374, 'grad_norm': 1.7107937335968018, 'learning_rate': 3.3925069701163406e-06, 'epoch': 1.34} |
|
45%|βββββ | 1260/2826 [2:03:48<2:57:03, 6.78s/it]
45%|βββββ | 1261/2826 [2:03:55<2:59:04, 6.87s/it]
45%|βββββ | 1262/2826 [2:04:02<3:02:05, 6.99s/it]
45%|βββββ | 1263/2826 [2:04:08<2:54:45, 6.71s/it]
45%|βββββ | 1264/2826 [2:04:13<2:42:59, 6.26s/it]
45%|βββββ | 1265/2826 [2:04:20<2:47:00, 6.42s/it]
45%|βββββ | 1266/2826 [2:04:26<2:48:05, 6.47s/it]
45%|βββββ | 1267/2826 [2:04:32<2:43:49, 6.31s/it]
45%|βββββ | 1268/2826 [2:04:38<2:40:28, 6.18s/it]
45%|βββββ | 1269/2826 [2:04:43<2:32:35, 5.88s/it]
45%|βββββ | 1270/2826 [2:04:49<2:26:33, 5.65s/it]
{'loss': 0.3436, 'grad_norm': 2.1515822410583496, 'learning_rate': 3.363590088632085e-06, 'epoch': 1.35} |
|
45%|βββββ | 1270/2826 [2:04:49<2:26:33, 5.65s/it]
45%|βββββ | 1271/2826 [2:04:54<2:21:12, 5.45s/it]
45%|βββββ | 1272/2826 [2:05:01<2:33:43, 5.94s/it]
45%|βββββ | 1273/2826 [2:05:07<2:34:11, 5.96s/it]
45%|βββββ | 1274/2826 [2:05:13<2:40:23, 6.20s/it]
45%|βββββ | 1275/2826 [2:05:20<2:44:42, 6.37s/it]
45%|βββββ | 1276/2826 [2:05:26<2:38:25, 6.13s/it]
45%|βββββ | 1277/2826 [2:05:32<2:36:57, 6.08s/it]
45%|βββββ | 1278/2826 [2:05:38<2:37:00, 6.09s/it]
45%|βββββ | 1279/2826 [2:05:44<2:34:16, 5.98s/it]
45%|βββββ | 1280/2826 [2:05:49<2:28:39, 5.77s/it]
{'loss': 0.3283, 'grad_norm': 2.0105717182159424, 'learning_rate': 3.334541409047408e-06, 'epoch': 1.36} |
|
45%|βββββ | 1280/2826 [2:05:49<2:28:39, 5.77s/it]
45%|βββββ | 1281/2826 [2:05:54<2:25:09, 5.64s/it]
45%|βββββ | 1282/2826 [2:06:00<2:29:19, 5.80s/it]
45%|βββββ | 1283/2826 [2:06:07<2:35:08, 6.03s/it]
45%|βββββ | 1284/2826 [2:06:12<2:28:22, 5.77s/it]
45%|βββββ | 1285/2826 [2:06:17<2:23:35, 5.59s/it]
46%|βββββ | 1286/2826 [2:06:22<2:19:50, 5.45s/it]
46%|βββββ | 1287/2826 [2:06:28<2:22:23, 5.55s/it]
46%|βββββ | 1288/2826 [2:06:33<2:18:55, 5.42s/it]
46%|βββββ | 1289/2826 [2:06:40<2:27:16, 5.75s/it]
46%|βββββ | 1290/2826 [2:06:47<2:39:01, 6.21s/it]
{'loss': 0.358, 'grad_norm': 1.8952791690826416, 'learning_rate': 3.3053653646702422e-06, 'epoch': 1.37} |
|
46%|βββββ | 1290/2826 [2:06:47<2:39:01, 6.21s/it]
46%|βββββ | 1291/2826 [2:06:53<2:34:47, 6.05s/it]
46%|βββββ | 1292/2826 [2:06:58<2:28:17, 5.80s/it]
46%|βββββ | 1293/2826 [2:07:05<2:37:43, 6.17s/it]
46%|βββββ | 1294/2826 [2:07:10<2:30:34, 5.90s/it]
46%|βββββ | 1295/2826 [2:07:16<2:32:57, 5.99s/it]
46%|βββββ | 1296/2826 [2:07:23<2:35:31, 6.10s/it]
46%|βββββ | 1297/2826 [2:07:28<2:31:19, 5.94s/it]
46%|βββββ | 1298/2826 [2:07:34<2:25:42, 5.72s/it]
46%|βββββ | 1299/2826 [2:07:39<2:26:39, 5.76s/it]
46%|βββββ | 1300/2826 [2:07:45<2:21:48, 5.58s/it]
{'loss': 0.3084, 'grad_norm': 1.8639928102493286, 'learning_rate': 3.276066408246487e-06, 'epoch': 1.38} |
|
46%|βββββ | 1300/2826 [2:07:45<2:21:48, 5.58s/it]
46%|βββββ | 1301/2826 [2:07:50<2:19:52, 5.50s/it]
46%|βββββ | 1302/2826 [2:07:55<2:18:49, 5.47s/it]
46%|βββββ | 1303/2826 [2:08:01<2:22:56, 5.63s/it]
46%|βββββ | 1304/2826 [2:08:06<2:17:04, 5.40s/it]
46%|βββββ | 1305/2826 [2:08:12<2:19:13, 5.49s/it]
46%|βββββ | 1306/2826 [2:08:19<2:33:02, 6.04s/it]
46%|βββββ | 1307/2826 [2:08:25<2:30:09, 5.93s/it]
46%|βββββ | 1308/2826 [2:08:31<2:32:11, 6.02s/it]
46%|βββββ | 1309/2826 [2:08:38<2:38:16, 6.26s/it]
46%|βββββ | 1310/2826 [2:08:45<2:43:16, 6.46s/it]
{'loss': 0.3508, 'grad_norm': 2.563251256942749, 'learning_rate': 3.2466490112804484e-06, 'epoch': 1.39} |
|
46%|βββββ | 1310/2826 [2:08:45<2:43:16, 6.46s/it]
46%|βββββ | 1311/2826 [2:08:50<2:33:04, 6.06s/it]
46%|βββββ | 1312/2826 [2:08:55<2:24:59, 5.75s/it]
46%|βββββ | 1313/2826 [2:09:02<2:31:01, 5.99s/it]
46%|βββββ | 1314/2826 [2:09:07<2:26:05, 5.80s/it]
47%|βββββ | 1315/2826 [2:09:13<2:30:47, 5.99s/it]
47%|βββββ | 1316/2826 [2:09:19<2:31:49, 6.03s/it]
47%|βββββ | 1317/2826 [2:09:25<2:24:31, 5.75s/it]
47%|βββββ | 1318/2826 [2:09:30<2:19:42, 5.56s/it]
47%|βββββ | 1319/2826 [2:09:35<2:17:57, 5.49s/it]
47%|βββββ | 1320/2826 [2:09:41<2:17:45, 5.49s/it]
{'loss': 0.3215, 'grad_norm': 2.214616060256958, 'learning_rate': 3.217117663352417e-06, 'epoch': 1.4} |
|
47%|βββββ | 1320/2826 [2:09:41<2:17:45, 5.49s/it]
47%|βββββ | 1321/2826 [2:09:46<2:16:04, 5.42s/it]
47%|βββββ | 1322/2826 [2:09:51<2:14:27, 5.36s/it]
47%|βββββ | 1323/2826 [2:09:57<2:20:51, 5.62s/it]
47%|βββββ | 1324/2826 [2:10:04<2:25:54, 5.83s/it]
47%|βββββ | 1325/2826 [2:10:10<2:32:08, 6.08s/it]
47%|βββββ | 1326/2826 [2:10:15<2:24:22, 5.78s/it]
47%|βββββ | 1327/2826 [2:10:21<2:24:36, 5.79s/it]
47%|βββββ | 1328/2826 [2:10:27<2:28:57, 5.97s/it]
47%|βββββ | 1329/2826 [2:10:34<2:32:56, 6.13s/it]
47%|βββββ | 1330/2826 [2:10:40<2:29:08, 5.98s/it]
{'loss': 0.3193, 'grad_norm': 1.793468952178955, 'learning_rate': 3.187476871433478e-06, 'epoch': 1.41} |
|
47%|βββββ | 1330/2826 [2:10:40<2:29:08, 5.98s/it]
47%|βββββ | 1331/2826 [2:10:45<2:23:33, 5.76s/it]
47%|βββββ | 1332/2826 [2:10:50<2:18:35, 5.57s/it]
47%|βββββ | 1333/2826 [2:10:57<2:26:11, 5.88s/it]
47%|βββββ | 1334/2826 [2:11:02<2:21:18, 5.68s/it]
47%|βββββ | 1335/2826 [2:11:08<2:27:13, 5.92s/it]
47%|βββββ | 1336/2826 [2:11:14<2:24:04, 5.80s/it]
47%|βββββ | 1337/2826 [2:11:21<2:35:42, 6.27s/it]
47%|βββββ | 1338/2826 [2:11:27<2:30:18, 6.06s/it]
47%|βββββ | 1339/2826 [2:11:32<2:23:26, 5.79s/it]
47%|βββββ | 1340/2826 [2:11:37<2:18:12, 5.58s/it]
{'loss': 0.3019, 'grad_norm': 2.204789638519287, 'learning_rate': 3.1577311591976766e-06, 'epoch': 1.42} |
|
47%|βββββ | 1340/2826 [2:11:37<2:18:12, 5.58s/it]
47%|βββββ | 1341/2826 [2:11:42<2:14:46, 5.45s/it]
47%|βββββ | 1342/2826 [2:11:49<2:21:40, 5.73s/it]
48%|βββββ | 1343/2826 [2:11:54<2:19:57, 5.66s/it]
48%|βββββ | 1344/2826 [2:11:59<2:18:13, 5.60s/it]
48%|βββββ | 1345/2826 [2:12:05<2:14:23, 5.44s/it]
48%|βββββ | 1346/2826 [2:12:10<2:12:03, 5.35s/it]
48%|βββββ | 1347/2826 [2:12:16<2:17:11, 5.57s/it]
48%|βββββ | 1348/2826 [2:12:23<2:28:48, 6.04s/it]
48%|βββββ | 1349/2826 [2:12:29<2:28:56, 6.05s/it]
48%|βββββ | 1350/2826 [2:12:36<2:37:53, 6.42s/it]
{'loss': 0.3099, 'grad_norm': 2.307568311691284, 'learning_rate': 3.1278850663316307e-06, 'epoch': 1.43} |
|
48%|βββββ | 1350/2826 [2:12:36<2:37:53, 6.42s/it]
48%|βββββ | 1351/2826 [2:12:42<2:34:33, 6.29s/it]
48%|βββββ | 1352/2826 [2:12:49<2:36:56, 6.39s/it]
48%|βββββ | 1353/2826 [2:12:55<2:32:30, 6.21s/it]
48%|βββββ | 1354/2826 [2:13:00<2:24:05, 5.87s/it]
48%|βββββ | 1355/2826 [2:13:05<2:22:01, 5.79s/it]
48%|βββββ | 1356/2826 [2:13:11<2:17:15, 5.60s/it]
48%|βββββ | 1357/2826 [2:13:17<2:20:35, 5.74s/it]
48%|βββββ | 1358/2826 [2:13:23<2:24:48, 5.92s/it]
48%|βββββ | 1359/2826 [2:13:29<2:25:00, 5.93s/it]
48%|βββββ | 1360/2826 [2:13:35<2:27:52, 6.05s/it]
{'loss': 0.3085, 'grad_norm': 2.485848903656006, 'learning_rate': 3.0979431478416987e-06, 'epoch': 1.44} |
|
48%|βββββ | 1360/2826 [2:13:35<2:27:52, 6.05s/it]
48%|βββββ | 1361/2826 [2:13:42<2:33:04, 6.27s/it]
48%|βββββ | 1362/2826 [2:13:47<2:25:26, 5.96s/it]
48%|βββββ | 1363/2826 [2:13:53<2:21:55, 5.82s/it]
48%|βββββ | 1364/2826 [2:13:59<2:25:53, 5.99s/it]
48%|βββββ | 1365/2826 [2:14:04<2:20:01, 5.75s/it]
48%|βββββ | 1366/2826 [2:14:10<2:19:49, 5.75s/it]
48%|βββββ | 1367/2826 [2:14:15<2:17:32, 5.66s/it]
48%|βββββ | 1368/2826 [2:14:21<2:15:01, 5.56s/it]
48%|βββββ | 1369/2826 [2:14:26<2:11:38, 5.42s/it]
48%|βββββ | 1370/2826 [2:14:32<2:14:16, 5.53s/it]
{'loss': 0.3211, 'grad_norm': 1.953053593635559, 'learning_rate': 3.067909973358811e-06, 'epoch': 1.45} |
|
48%|βββββ | 1370/2826 [2:14:32<2:14:16, 5.53s/it]
49%|βββββ | 1371/2826 [2:14:38<2:23:06, 5.90s/it]
49%|βββββ | 1372/2826 [2:14:45<2:24:23, 5.96s/it]
49%|βββββ | 1373/2826 [2:14:50<2:22:01, 5.86s/it]
49%|βββββ | 1374/2826 [2:14:58<2:37:49, 6.52s/it]
49%|βββββ | 1375/2826 [2:15:05<2:40:58, 6.66s/it]
49%|βββββ | 1376/2826 [2:15:11<2:30:58, 6.25s/it]
49%|βββββ | 1377/2826 [2:15:17<2:30:50, 6.25s/it]
49%|βββββ | 1378/2826 [2:15:25<2:46:58, 6.92s/it]
49%|βββββ | 1379/2826 [2:15:32<2:42:59, 6.76s/it]
49%|βββββ | 1380/2826 [2:15:37<2:31:28, 6.29s/it]
{'loss': 0.3329, 'grad_norm': 2.2350101470947266, 'learning_rate': 3.0377901264410673e-06, 'epoch': 1.46} |
|
49%|βββββ | 1380/2826 [2:15:37<2:31:28, 6.29s/it]
49%|βββββ | 1381/2826 [2:15:43<2:29:53, 6.22s/it]
49%|βββββ | 1382/2826 [2:15:48<2:22:06, 5.90s/it]
49%|βββββ | 1383/2826 [2:15:54<2:23:53, 5.98s/it]
49%|βββββ | 1384/2826 [2:16:00<2:19:47, 5.82s/it]
49%|βββββ | 1385/2826 [2:16:07<2:34:03, 6.41s/it]
49%|βββββ | 1386/2826 [2:16:13<2:25:59, 6.08s/it]
49%|βββββ | 1387/2826 [2:16:18<2:21:43, 5.91s/it]
49%|βββββ | 1388/2826 [2:16:23<2:16:02, 5.68s/it]
49%|βββββ | 1389/2826 [2:16:29<2:17:27, 5.74s/it]
49%|βββββ | 1390/2826 [2:16:35<2:14:07, 5.60s/it]
{'loss': 0.3376, 'grad_norm': 2.542452335357666, 'learning_rate': 3.0075882038742133e-06, 'epoch': 1.47} |
|
49%|βββββ | 1390/2826 [2:16:35<2:14:07, 5.60s/it]
49%|βββββ | 1391/2826 [2:16:41<2:18:52, 5.81s/it]
49%|βββββ | 1392/2826 [2:16:47<2:18:47, 5.81s/it]
49%|βββββ | 1393/2826 [2:16:53<2:21:25, 5.92s/it]
49%|βββββ | 1394/2826 [2:16:59<2:19:25, 5.84s/it]
49%|βββββ | 1395/2826 [2:17:05<2:25:19, 6.09s/it]
49%|βββββ | 1396/2826 [2:17:11<2:23:11, 6.01s/it]
49%|βββββ | 1397/2826 [2:17:17<2:25:45, 6.12s/it]
49%|βββββ | 1398/2826 [2:17:24<2:28:19, 6.23s/it]
50%|βββββ | 1399/2826 [2:17:30<2:24:35, 6.08s/it]
50%|βββββ | 1400/2826 [2:17:35<2:17:12, 5.77s/it]
{'loss': 0.2896, 'grad_norm': 2.3203530311584473, 'learning_rate': 2.9773088149700923e-06, 'epoch': 1.48} |
|
50%|βββββ | 1400/2826 [2:17:35<2:17:12, 5.77s/it]
50%|βββββ | 1401/2826 [2:17:40<2:15:09, 5.69s/it]
50%|βββββ | 1402/2826 [2:17:46<2:19:12, 5.87s/it]
50%|βββββ | 1403/2826 [2:17:52<2:18:14, 5.83s/it]
50%|βββββ | 1404/2826 [2:17:59<2:25:04, 6.12s/it]
50%|βββββ | 1405/2826 [2:18:04<2:20:20, 5.93s/it]
50%|βββββ | 1406/2826 [2:18:10<2:14:53, 5.70s/it]
50%|βββββ | 1407/2826 [2:18:15<2:15:57, 5.75s/it]
50%|βββββ | 1408/2826 [2:18:21<2:12:40, 5.61s/it]
50%|βββββ | 1409/2826 [2:18:27<2:15:12, 5.73s/it]
50%|βββββ | 1410/2826 [2:18:32<2:10:58, 5.55s/it]
{'loss': 0.299, 'grad_norm': 1.9708584547042847, 'learning_rate': 2.9469565808631888e-06, 'epoch': 1.5} |
|
50%|βββββ | 1410/2826 [2:18:32<2:10:58, 5.55s/it]
50%|βββββ | 1411/2826 [2:18:39<2:24:05, 6.11s/it]
50%|βββββ | 1412/2826 [2:18:45<2:22:02, 6.03s/it]
50%|βββββ | 1413/2826 [2:18:50<2:15:37, 5.76s/it]
50%|βββββ | 1414/2826 [2:18:56<2:18:27, 5.88s/it]
50%|βββββ | 1415/2826 [2:19:02<2:13:23, 5.67s/it]
50%|βββββ | 1416/2826 [2:19:07<2:11:13, 5.58s/it]
50%|βββββ | 1417/2826 [2:19:12<2:07:42, 5.44s/it]
50%|βββββ | 1418/2826 [2:19:17<2:06:28, 5.39s/it]
50%|βββββ | 1419/2826 [2:19:23<2:07:08, 5.42s/it]
50%|βββββ | 1420/2826 [2:19:29<2:08:46, 5.50s/it]
{'loss': 0.3484, 'grad_norm': 2.63698148727417, 'learning_rate': 2.9165361338053683e-06, 'epoch': 1.51} |
|
50%|βββββ | 1420/2826 [2:19:29<2:08:46, 5.50s/it]
50%|βββββ | 1421/2826 [2:19:34<2:06:34, 5.41s/it]
50%|βββββ | 1422/2826 [2:19:40<2:15:20, 5.78s/it]
50%|βββββ | 1423/2826 [2:19:46<2:11:04, 5.61s/it]
50%|βββββ | 1424/2826 [2:19:51<2:06:47, 5.43s/it]
50%|βββββ | 1425/2826 [2:19:56<2:05:00, 5.35s/it]
50%|βββββ | 1426/2826 [2:20:02<2:09:19, 5.54s/it]
50%|βββββ | 1427/2826 [2:20:08<2:11:39, 5.65s/it]
51%|βββββ | 1428/2826 [2:20:15<2:26:46, 6.30s/it]
51%|βββββ | 1429/2826 [2:20:22<2:25:56, 6.27s/it]
51%|βββββ | 1430/2826 [2:20:28<2:28:01, 6.36s/it]
{'loss': 0.3316, 'grad_norm': 2.091648578643799, 'learning_rate': 2.886052116458918e-06, 'epoch': 1.52} |
|
51%|βββββ | 1430/2826 [2:20:28<2:28:01, 6.36s/it]
51%|βββββ | 1431/2826 [2:20:34<2:21:56, 6.11s/it]
51%|βββββ | 1432/2826 [2:20:39<2:18:51, 5.98s/it]
51%|βββββ | 1433/2826 [2:20:45<2:13:38, 5.76s/it]
51%|βββββ | 1434/2826 [2:20:50<2:09:06, 5.56s/it]
51%|βββββ | 1435/2826 [2:20:57<2:16:52, 5.90s/it]
51%|βββββ | 1436/2826 [2:21:03<2:20:39, 6.07s/it]
51%|βββββ | 1437/2826 [2:21:08<2:15:37, 5.86s/it]
51%|βββββ | 1438/2826 [2:21:13<2:10:24, 5.64s/it]
51%|βββββ | 1439/2826 [2:21:20<2:18:44, 6.00s/it]
51%|βββββ | 1440/2826 [2:21:25<2:12:38, 5.74s/it]
{'loss': 0.328, 'grad_norm': 1.955355167388916, 'learning_rate': 2.8555091811880004e-06, 'epoch': 1.53} |
|
51%|βββββ | 1440/2826 [2:21:25<2:12:38, 5.74s/it]
51%|βββββ | 1441/2826 [2:21:33<2:22:46, 6.18s/it]
51%|βββββ | 1442/2826 [2:21:38<2:16:39, 5.92s/it]
51%|βββββ | 1443/2826 [2:21:44<2:18:11, 6.00s/it]
51%|βββββ | 1444/2826 [2:21:50<2:16:05, 5.91s/it]
51%|βββββ | 1445/2826 [2:21:57<2:24:43, 6.29s/it]
51%|βββββ | 1446/2826 [2:22:04<2:28:06, 6.44s/it]
51%|βββββ | 1447/2826 [2:22:09<2:21:07, 6.14s/it]
51%|βββββ | 1448/2826 [2:22:16<2:22:48, 6.22s/it]
51%|ββββββ | 1449/2826 [2:22:22<2:22:37, 6.21s/it]
51%|ββββββ | 1450/2826 [2:22:28<2:24:52, 6.32s/it]
{'loss': 0.3215, 'grad_norm': 1.6724951267242432, 'learning_rate': 2.8249119893486252e-06, 'epoch': 1.54} |
|
51%|ββββββ | 1450/2826 [2:22:28<2:24:52, 6.32s/it]
51%|ββββββ | 1451/2826 [2:22:34<2:22:48, 6.23s/it]
51%|ββββββ | 1452/2826 [2:22:41<2:24:20, 6.30s/it]
51%|ββββββ | 1453/2826 [2:22:47<2:20:23, 6.14s/it]
51%|ββββββ | 1454/2826 [2:22:52<2:14:36, 5.89s/it]
51%|ββββββ | 1455/2826 [2:22:57<2:10:07, 5.70s/it]
52%|ββββββ | 1456/2826 [2:23:04<2:20:31, 6.15s/it]
52%|ββββββ | 1457/2826 [2:23:11<2:23:45, 6.30s/it]
52%|ββββββ | 1458/2826 [2:23:18<2:25:44, 6.39s/it]
52%|ββββββ | 1459/2826 [2:23:23<2:18:02, 6.06s/it]
52%|ββββββ | 1460/2826 [2:23:28<2:13:18, 5.86s/it]
{'loss': 0.3118, 'grad_norm': 2.1872570514678955, 'learning_rate': 2.7942652105772516e-06, 'epoch': 1.55} |
|
52%|ββββββ | 1460/2826 [2:23:28<2:13:18, 5.86s/it]
52%|ββββββ | 1461/2826 [2:23:35<2:18:35, 6.09s/it]
52%|ββββββ | 1462/2826 [2:23:41<2:18:42, 6.10s/it]
52%|ββββββ | 1463/2826 [2:23:47<2:14:19, 5.91s/it]
52%|ββββββ | 1464/2826 [2:23:52<2:10:18, 5.74s/it]
52%|ββββββ | 1465/2826 [2:23:59<2:21:58, 6.26s/it]
52%|ββββββ | 1466/2826 [2:24:07<2:31:09, 6.67s/it]
52%|ββββββ | 1467/2826 [2:24:14<2:30:24, 6.64s/it]
52%|ββββββ | 1468/2826 [2:24:19<2:21:01, 6.23s/it]
52%|ββββββ | 1469/2826 [2:24:25<2:18:52, 6.14s/it]
52%|ββββββ | 1470/2826 [2:24:31<2:19:27, 6.17s/it]
{'loss': 0.2973, 'grad_norm': 3.0710208415985107, 'learning_rate': 2.7635735220781214e-06, 'epoch': 1.56} |
|
52%|ββββββ | 1470/2826 [2:24:31<2:19:27, 6.17s/it]
52%|ββββββ | 1471/2826 [2:24:37<2:15:11, 5.99s/it]
52%|ββββββ | 1472/2826 [2:24:42<2:08:45, 5.71s/it]
52%|ββββββ | 1473/2826 [2:24:47<2:04:36, 5.53s/it]
52%|ββββββ | 1474/2826 [2:24:53<2:09:37, 5.75s/it]
52%|ββββββ | 1475/2826 [2:24:59<2:14:02, 5.95s/it]
52%|ββββββ | 1476/2826 [2:25:05<2:12:32, 5.89s/it]
52%|ββββββ | 1477/2826 [2:25:13<2:23:37, 6.39s/it]
52%|ββββββ | 1478/2826 [2:25:19<2:24:09, 6.42s/it]
52%|ββββββ | 1479/2826 [2:25:24<2:15:55, 6.05s/it]
52%|ββββββ | 1480/2826 [2:25:31<2:16:18, 6.08s/it]
{'loss': 0.3423, 'grad_norm': 2.357663631439209, 'learning_rate': 2.7328416079094412e-06, 'epoch': 1.57} |
|
52%|ββββββ | 1480/2826 [2:25:31<2:16:18, 6.08s/it]
52%|ββββββ | 1481/2826 [2:25:36<2:12:05, 5.89s/it]
52%|ββββββ | 1482/2826 [2:25:42<2:09:59, 5.80s/it]
52%|ββββββ | 1483/2826 [2:25:48<2:12:18, 5.91s/it]
53%|ββββββ | 1484/2826 [2:25:53<2:09:47, 5.80s/it]
53%|ββββββ | 1485/2826 [2:25:59<2:05:43, 5.63s/it]
53%|ββββββ | 1486/2826 [2:26:04<2:03:09, 5.51s/it]
53%|ββββββ | 1487/2826 [2:26:09<2:02:44, 5.50s/it]
53%|ββββββ | 1488/2826 [2:26:15<2:04:11, 5.57s/it]
53%|ββββββ | 1489/2826 [2:26:21<2:06:54, 5.70s/it]
53%|ββββββ | 1490/2826 [2:26:27<2:11:41, 5.91s/it]
{'loss': 0.3211, 'grad_norm': 2.2559144496917725, 'learning_rate': 2.7020741582685217e-06, 'epoch': 1.58} |
|
53%|ββββββ | 1490/2826 [2:26:27<2:11:41, 5.91s/it]
53%|ββββββ | 1491/2826 [2:26:33<2:08:57, 5.80s/it]
53%|ββββββ | 1492/2826 [2:26:38<2:05:45, 5.66s/it]
53%|ββββββ | 1493/2826 [2:26:45<2:10:31, 5.88s/it]
53%|ββββββ | 1494/2826 [2:26:50<2:07:31, 5.74s/it]
53%|ββββββ | 1495/2826 [2:26:56<2:05:43, 5.67s/it]
53%|ββββββ | 1496/2826 [2:27:04<2:21:24, 6.38s/it]
53%|ββββββ | 1497/2826 [2:27:09<2:14:05, 6.05s/it]
53%|ββββββ | 1498/2826 [2:27:14<2:08:05, 5.79s/it]
53%|ββββββ | 1499/2826 [2:27:21<2:15:04, 6.11s/it]
53%|ββββββ | 1500/2826 [2:27:26<2:08:59, 5.84s/it]
{'loss': 0.2733, 'grad_norm': 2.0730817317962646, 'learning_rate': 2.6712758687759706e-06, 'epoch': 1.59} |
|
53%|ββββββ | 1500/2826 [2:27:26<2:08:59, 5.84s/it]
53%|ββββββ | 1501/2826 [2:27:31<2:05:15, 5.67s/it]
53%|ββββββ | 1502/2826 [2:27:37<2:05:15, 5.68s/it]
53%|ββββββ | 1503/2826 [2:27:42<2:02:02, 5.53s/it]
53%|ββββββ | 1504/2826 [2:27:48<2:04:18, 5.64s/it]
53%|ββββββ | 1505/2826 [2:27:53<2:00:14, 5.46s/it]
53%|ββββββ | 1506/2826 [2:28:00<2:09:37, 5.89s/it]
53%|ββββββ | 1507/2826 [2:28:06<2:11:55, 6.00s/it]
53%|ββββββ | 1508/2826 [2:28:13<2:16:18, 6.21s/it]
53%|ββββββ | 1509/2826 [2:28:19<2:15:06, 6.16s/it]
53%|ββββββ | 1510/2826 [2:28:24<2:08:26, 5.86s/it]
{'loss': 0.338, 'grad_norm': 2.6119141578674316, 'learning_rate': 2.6404514397590657e-06, 'epoch': 1.6} |
|
53%|ββββββ | 1510/2826 [2:28:24<2:08:26, 5.86s/it]
53%|ββββββ | 1511/2826 [2:28:31<2:15:14, 6.17s/it]
54%|ββββββ | 1512/2826 [2:28:37<2:15:48, 6.20s/it]
54%|ββββββ | 1513/2826 [2:28:44<2:20:11, 6.41s/it]
54%|ββββββ | 1514/2826 [2:28:50<2:14:05, 6.13s/it]
54%|ββββββ | 1515/2826 [2:28:55<2:09:32, 5.93s/it]
54%|ββββββ | 1516/2826 [2:29:01<2:09:27, 5.93s/it]
54%|ββββββ | 1517/2826 [2:29:07<2:10:28, 5.98s/it]
54%|ββββββ | 1518/2826 [2:29:14<2:12:44, 6.09s/it]
54%|ββββββ | 1519/2826 [2:29:19<2:07:23, 5.85s/it]
54%|ββββββ | 1520/2826 [2:29:24<2:02:12, 5.61s/it]
{'loss': 0.3124, 'grad_norm': 2.315875768661499, 'learning_rate': 2.6096055755344113e-06, 'epoch': 1.61} |
|
54%|ββββββ | 1520/2826 [2:29:24<2:02:12, 5.61s/it]
54%|ββββββ | 1521/2826 [2:29:29<2:00:01, 5.52s/it]
54%|ββββββ | 1522/2826 [2:29:35<1:59:29, 5.50s/it]
54%|ββββββ | 1523/2826 [2:29:40<1:56:40, 5.37s/it]
54%|ββββββ | 1524/2826 [2:29:45<1:54:55, 5.30s/it]
54%|ββββββ | 1525/2826 [2:29:50<1:53:48, 5.25s/it]
54%|ββββββ | 1526/2826 [2:29:55<1:54:11, 5.27s/it]
54%|ββββββ | 1527/2826 [2:30:02<1:59:37, 5.53s/it]
54%|ββββββ | 1528/2826 [2:30:07<1:56:35, 5.39s/it]
54%|ββββββ | 1529/2826 [2:30:12<1:57:28, 5.43s/it]
54%|ββββββ | 1530/2826 [2:30:18<1:59:13, 5.52s/it]
{'loss': 0.3538, 'grad_norm': 2.2880892753601074, 'learning_rate': 2.578742983689973e-06, 'epoch': 1.62} |
|
54%|ββββββ | 1530/2826 [2:30:18<1:59:13, 5.52s/it]
54%|ββββββ | 1531/2826 [2:30:24<1:59:53, 5.55s/it]
54%|ββββββ | 1532/2826 [2:30:30<2:04:08, 5.76s/it]
54%|ββββββ | 1533/2826 [2:30:35<2:00:30, 5.59s/it]
54%|ββββββ | 1534/2826 [2:30:42<2:10:51, 6.08s/it]
54%|ββββββ | 1535/2826 [2:30:49<2:12:32, 6.16s/it]
54%|ββββββ | 1536/2826 [2:30:55<2:14:24, 6.25s/it]
54%|ββββββ | 1537/2826 [2:31:01<2:15:44, 6.32s/it]
54%|ββββββ | 1538/2826 [2:31:07<2:09:14, 6.02s/it]
54%|ββββββ | 1539/2826 [2:31:13<2:09:01, 6.02s/it]
54%|ββββββ | 1540/2826 [2:31:20<2:15:35, 6.33s/it]
{'loss': 0.3353, 'grad_norm': 2.2615041732788086, 'learning_rate': 2.547868374366631e-06, 'epoch': 1.63} |
|
54%|ββββββ | 1540/2826 [2:31:20<2:15:35, 6.33s/it]
55%|ββββββ | 1541/2826 [2:31:26<2:13:57, 6.25s/it]
55%|ββββββ | 1542/2826 [2:31:32<2:13:36, 6.24s/it]
55%|ββββββ | 1543/2826 [2:31:38<2:09:49, 6.07s/it]
55%|ββββββ | 1544/2826 [2:31:43<2:06:18, 5.91s/it]
55%|ββββββ | 1545/2826 [2:31:49<2:06:29, 5.92s/it]
55%|ββββββ | 1546/2826 [2:31:55<2:04:51, 5.85s/it]
55%|ββββββ | 1547/2826 [2:32:01<2:06:52, 5.95s/it]
55%|ββββββ | 1548/2826 [2:32:06<2:01:18, 5.69s/it]
55%|ββββββ | 1549/2826 [2:32:12<1:59:28, 5.61s/it]
55%|ββββββ | 1550/2826 [2:32:18<2:02:13, 5.75s/it]
{'loss': 0.302, 'grad_norm': 1.9062315225601196, 'learning_rate': 2.5169864595393295e-06, 'epoch': 1.64} |
|
55%|ββββββ | 1550/2826 [2:32:18<2:02:13, 5.75s/it]
55%|ββββββ | 1551/2826 [2:32:24<2:03:14, 5.80s/it]
55%|ββββββ | 1552/2826 [2:32:29<1:59:34, 5.63s/it]
55%|ββββββ | 1553/2826 [2:32:35<2:03:47, 5.83s/it]
55%|ββββββ | 1554/2826 [2:32:41<2:03:06, 5.81s/it]
55%|ββββββ | 1555/2826 [2:32:48<2:12:37, 6.26s/it]
55%|ββββββ | 1556/2826 [2:32:54<2:09:23, 6.11s/it]
55%|ββββββ | 1557/2826 [2:32:59<2:03:22, 5.83s/it]
55%|ββββββ | 1558/2826 [2:33:05<2:01:59, 5.77s/it]
55%|ββββββ | 1559/2826 [2:33:10<1:58:14, 5.60s/it]
55%|ββββββ | 1560/2826 [2:33:16<1:57:39, 5.58s/it]
{'loss': 0.3124, 'grad_norm': 2.7016942501068115, 'learning_rate': 2.4861019522979537e-06, 'epoch': 1.65} |
|
55%|ββββββ | 1560/2826 [2:33:16<1:57:39, 5.58s/it]
55%|ββββββ | 1561/2826 [2:33:22<2:02:07, 5.79s/it]
55%|ββββββ | 1562/2826 [2:33:28<2:01:58, 5.79s/it]
55%|ββββββ | 1563/2826 [2:33:34<2:07:10, 6.04s/it]
55%|ββββββ | 1564/2826 [2:33:40<2:05:25, 5.96s/it]
55%|ββββββ | 1565/2826 [2:33:46<2:06:55, 6.04s/it]
55%|ββββββ | 1566/2826 [2:33:52<2:02:21, 5.83s/it]
55%|ββββββ | 1567/2826 [2:33:57<1:58:05, 5.63s/it]
55%|ββββββ | 1568/2826 [2:34:02<1:54:14, 5.45s/it]
56%|ββββββ | 1569/2826 [2:34:07<1:53:19, 5.41s/it]
56%|ββββββ | 1570/2826 [2:34:13<1:53:32, 5.42s/it]
{'loss': 0.3497, 'grad_norm': 2.4618184566497803, 'learning_rate': 2.455219566128034e-06, 'epoch': 1.67} |
|
56%|ββββββ | 1570/2826 [2:34:13<1:53:32, 5.42s/it]
56%|ββββββ | 1571/2826 [2:34:18<1:53:24, 5.42s/it]
56%|ββββββ | 1572/2826 [2:34:24<1:54:07, 5.46s/it]
56%|ββββββ | 1573/2826 [2:34:29<1:52:16, 5.38s/it]
56%|ββββββ | 1574/2826 [2:34:35<1:56:00, 5.56s/it]
56%|ββββββ | 1575/2826 [2:34:40<1:56:14, 5.58s/it]
56%|ββββββ | 1576/2826 [2:34:46<1:57:30, 5.64s/it]
56%|ββββββ | 1577/2826 [2:34:53<2:05:30, 6.03s/it]
56%|ββββββ | 1578/2826 [2:34:59<2:07:31, 6.13s/it]
56%|ββββββ | 1579/2826 [2:35:05<2:05:00, 6.01s/it]
56%|ββββββ | 1580/2826 [2:35:12<2:07:54, 6.16s/it]
{'loss': 0.3233, 'grad_norm': 2.8924951553344727, 'learning_rate': 2.4243440141913905e-06, 'epoch': 1.68} |
|
56%|ββββββ | 1580/2826 [2:35:12<2:07:54, 6.16s/it]
56%|ββββββ | 1581/2826 [2:35:18<2:07:28, 6.14s/it]
56%|ββββββ | 1582/2826 [2:35:25<2:11:43, 6.35s/it]
56%|ββββββ | 1583/2826 [2:35:30<2:08:40, 6.21s/it]
56%|ββββββ | 1584/2826 [2:35:36<2:05:36, 6.07s/it]
56%|ββββββ | 1585/2826 [2:35:41<2:00:26, 5.82s/it]
56%|ββββββ | 1586/2826 [2:35:48<2:04:25, 6.02s/it]
56%|ββββββ | 1587/2826 [2:35:53<1:59:19, 5.78s/it]
56%|ββββββ | 1588/2826 [2:35:59<1:58:20, 5.74s/it]
56%|ββββββ | 1589/2826 [2:36:04<1:54:37, 5.56s/it]
56%|ββββββ | 1590/2826 [2:36:10<1:55:40, 5.62s/it]
{'loss': 0.3067, 'grad_norm': 2.32255482673645, 'learning_rate': 2.393480008606825e-06, 'epoch': 1.69} |
|
56%|ββββββ | 1590/2826 [2:36:10<1:55:40, 5.62s/it]
56%|ββββββ | 1591/2826 [2:36:15<1:53:13, 5.50s/it]
56%|ββββββ | 1592/2826 [2:36:20<1:50:24, 5.37s/it]
56%|ββββββ | 1593/2826 [2:36:26<1:55:33, 5.62s/it]
56%|ββββββ | 1594/2826 [2:36:32<1:57:34, 5.73s/it]
56%|ββββββ | 1595/2826 [2:36:38<1:55:04, 5.61s/it]
56%|ββββββ | 1596/2826 [2:36:44<1:59:10, 5.81s/it]
57%|ββββββ | 1597/2826 [2:36:49<1:55:38, 5.65s/it]
57%|ββββββ | 1598/2826 [2:36:55<1:58:18, 5.78s/it]
57%|ββββββ | 1599/2826 [2:37:02<2:04:41, 6.10s/it]
57%|ββββββ | 1600/2826 [2:37:09<2:11:51, 6.45s/it]
{'loss': 0.2893, 'grad_norm': 1.8984359502792358, 'learning_rate': 2.3626322597309774e-06, 'epoch': 1.7} |
|
57%|ββββββ | 1600/2826 [2:37:09<2:11:51, 6.45s/it]
57%|ββββββ | 1601/2826 [2:37:15<2:06:46, 6.21s/it]
57%|ββββββ | 1602/2826 [2:37:20<2:02:48, 6.02s/it]
57%|ββββββ | 1603/2826 [2:37:26<1:57:59, 5.79s/it]
57%|ββββββ | 1604/2826 [2:37:31<1:56:41, 5.73s/it]
57%|ββββββ | 1605/2826 [2:37:37<1:58:05, 5.80s/it]
57%|ββββββ | 1606/2826 [2:37:43<1:58:54, 5.85s/it]
57%|ββββββ | 1607/2826 [2:37:50<2:05:37, 6.18s/it]
57%|ββββββ | 1608/2826 [2:37:56<2:01:53, 6.00s/it]
57%|ββββββ | 1609/2826 [2:38:02<2:03:18, 6.08s/it]
57%|ββββββ | 1610/2826 [2:38:07<1:59:12, 5.88s/it]
{'loss': 0.2825, 'grad_norm': 1.8360289335250854, 'learning_rate': 2.331805475439445e-06, 'epoch': 1.71} |
|
57%|ββββββ | 1610/2826 [2:38:07<1:59:12, 5.88s/it]
57%|ββββββ | 1611/2826 [2:38:13<1:54:49, 5.67s/it]
57%|ββββββ | 1612/2826 [2:38:18<1:51:13, 5.50s/it]
57%|ββββββ | 1613/2826 [2:38:23<1:49:05, 5.40s/it]
57%|ββββββ | 1614/2826 [2:38:28<1:49:12, 5.41s/it]
57%|ββββββ | 1615/2826 [2:38:35<1:55:45, 5.74s/it]
57%|ββββββ | 1616/2826 [2:38:40<1:52:21, 5.57s/it]
57%|ββββββ | 1617/2826 [2:38:45<1:49:27, 5.43s/it]
57%|ββββββ | 1618/2826 [2:38:50<1:48:33, 5.39s/it]
57%|ββββββ | 1619/2826 [2:38:55<1:46:10, 5.28s/it]
57%|ββββββ | 1620/2826 [2:39:01<1:44:54, 5.22s/it]
{'loss': 0.3379, 'grad_norm': 2.331998109817505, 'learning_rate': 2.3010043604082824e-06, 'epoch': 1.72} |
|
57%|ββββββ | 1620/2826 [2:39:01<1:44:54, 5.22s/it]
57%|ββββββ | 1621/2826 [2:39:06<1:44:19, 5.19s/it]
57%|ββββββ | 1622/2826 [2:39:13<1:54:32, 5.71s/it]
57%|ββββββ | 1623/2826 [2:39:18<1:51:12, 5.55s/it]
57%|ββββββ | 1624/2826 [2:39:25<1:58:44, 5.93s/it]
58%|ββββββ | 1625/2826 [2:39:30<1:55:06, 5.75s/it]
58%|ββββββ | 1626/2826 [2:39:37<2:01:37, 6.08s/it]
58%|ββββββ | 1627/2826 [2:39:44<2:08:46, 6.44s/it]
58%|ββββββ | 1628/2826 [2:39:49<2:02:13, 6.12s/it]
58%|ββββββ | 1629/2826 [2:39:55<1:58:04, 5.92s/it]
58%|ββββββ | 1630/2826 [2:40:01<1:59:03, 5.97s/it]
{'loss': 0.301, 'grad_norm': 2.3304574489593506, 'learning_rate': 2.2702336153959925e-06, 'epoch': 1.73} |
|
58%|ββββββ | 1630/2826 [2:40:01<1:59:03, 5.97s/it]
58%|ββββββ | 1631/2826 [2:40:07<2:00:10, 6.03s/it]
58%|ββββββ | 1632/2826 [2:40:13<1:56:44, 5.87s/it]
58%|ββββββ | 1633/2826 [2:40:18<1:51:58, 5.63s/it]
58%|ββββββ | 1634/2826 [2:40:23<1:48:38, 5.47s/it]
58%|ββββββ | 1635/2826 [2:40:28<1:49:03, 5.49s/it]
58%|ββββββ | 1636/2826 [2:40:33<1:46:44, 5.38s/it]
58%|ββββββ | 1637/2826 [2:40:39<1:47:59, 5.45s/it]
58%|ββββββ | 1638/2826 [2:40:45<1:48:22, 5.47s/it]
58%|ββββββ | 1639/2826 [2:40:50<1:45:44, 5.34s/it]
58%|ββββββ | 1640/2826 [2:40:58<2:02:47, 6.21s/it]
{'loss': 0.404, 'grad_norm': 2.534090518951416, 'learning_rate': 2.2394979365261134e-06, 'epoch': 1.74} |
|
58%|ββββββ | 1640/2826 [2:40:58<2:02:47, 6.21s/it]
58%|ββββββ | 1641/2826 [2:41:04<2:01:16, 6.14s/it]
58%|ββββββ | 1642/2826 [2:41:10<2:02:52, 6.23s/it]
58%|ββββββ | 1643/2826 [2:41:16<1:57:24, 5.95s/it]
58%|ββββββ | 1644/2826 [2:41:22<1:57:17, 5.95s/it]
58%|ββββββ | 1645/2826 [2:41:27<1:55:10, 5.85s/it]
58%|ββββββ | 1646/2826 [2:41:33<1:54:44, 5.83s/it]
58%|ββββββ | 1647/2826 [2:41:39<1:53:42, 5.79s/it]
58%|ββββββ | 1648/2826 [2:41:44<1:49:27, 5.58s/it]
58%|ββββββ | 1649/2826 [2:41:51<1:56:58, 5.96s/it]
58%|ββββββ | 1650/2826 [2:41:56<1:52:02, 5.72s/it]
{'loss': 0.3242, 'grad_norm': 2.273122549057007, 'learning_rate': 2.208802014570507e-06, 'epoch': 1.75} |
|
58%|ββββββ | 1650/2826 [2:41:56<1:52:02, 5.72s/it]
58%|ββββββ | 1651/2826 [2:42:02<1:52:53, 5.76s/it]
58%|ββββββ | 1652/2826 [2:42:07<1:49:24, 5.59s/it]
58%|ββββββ | 1653/2826 [2:42:13<1:50:33, 5.66s/it]
59%|ββββββ | 1654/2826 [2:42:19<1:57:09, 6.00s/it]
59%|ββββββ | 1655/2826 [2:42:25<1:54:50, 5.88s/it]
59%|ββββββ | 1656/2826 [2:42:32<1:58:24, 6.07s/it]
59%|ββββββ | 1657/2826 [2:42:37<1:53:51, 5.84s/it]
59%|ββββββ | 1658/2826 [2:42:42<1:49:11, 5.61s/it]
59%|ββββββ | 1659/2826 [2:42:49<1:56:21, 5.98s/it]
59%|ββββββ | 1660/2826 [2:42:55<2:00:40, 6.21s/it]
{'loss': 0.3152, 'grad_norm': 1.8859643936157227, 'learning_rate': 2.1781505342334775e-06, 'epoch': 1.76} |
|
59%|ββββββ | 1660/2826 [2:42:55<2:00:40, 6.21s/it]
59%|ββββββ | 1661/2826 [2:43:03<2:06:10, 6.50s/it]
59%|ββββββ | 1662/2826 [2:43:08<2:02:01, 6.29s/it]
59%|ββββββ | 1663/2826 [2:43:16<2:06:22, 6.52s/it]
59%|ββββββ | 1664/2826 [2:43:22<2:05:53, 6.50s/it]
59%|ββββββ | 1665/2826 [2:43:27<1:59:42, 6.19s/it]
59%|ββββββ | 1666/2826 [2:43:34<1:59:38, 6.19s/it]
59%|ββββββ | 1667/2826 [2:43:39<1:54:03, 5.90s/it]
59%|ββββββ | 1668/2826 [2:43:44<1:49:47, 5.69s/it]
59%|ββββββ | 1669/2826 [2:43:51<1:57:06, 6.07s/it]
59%|ββββββ | 1670/2826 [2:43:56<1:51:35, 5.79s/it]
{'loss': 0.3302, 'grad_norm': 2.567715644836426, 'learning_rate': 2.147548173436805e-06, 'epoch': 1.77} |
|
59%|ββββββ | 1670/2826 [2:43:56<1:51:35, 5.79s/it]
59%|ββββββ | 1671/2826 [2:44:01<1:48:16, 5.62s/it]
59%|ββββββ | 1672/2826 [2:44:07<1:47:12, 5.57s/it]
59%|ββββββ | 1673/2826 [2:44:13<1:50:40, 5.76s/it]
59%|ββββββ | 1674/2826 [2:44:21<2:03:28, 6.43s/it]
59%|ββββββ | 1675/2826 [2:44:27<2:00:11, 6.27s/it]
59%|ββββββ | 1676/2826 [2:44:32<1:54:02, 5.95s/it]
59%|ββββββ | 1677/2826 [2:44:38<1:51:21, 5.81s/it]
59%|ββββββ | 1678/2826 [2:44:44<1:53:40, 5.94s/it]
59%|ββββββ | 1679/2826 [2:44:49<1:47:55, 5.65s/it]
59%|ββββββ | 1680/2826 [2:44:55<1:49:36, 5.74s/it]
{'loss': 0.293, 'grad_norm': 2.7930519580841064, 'learning_rate': 2.116999602605814e-06, 'epoch': 1.78} |
|
59%|ββββββ | 1680/2826 [2:44:55<1:49:36, 5.74s/it]
59%|ββββββ | 1681/2826 [2:45:00<1:48:18, 5.68s/it]
60%|ββββββ | 1682/2826 [2:45:07<1:54:42, 6.02s/it]
60%|ββββββ | 1683/2826 [2:45:13<1:51:58, 5.88s/it]
60%|ββββββ | 1684/2826 [2:45:18<1:48:19, 5.69s/it]
60%|ββββββ | 1685/2826 [2:45:25<1:57:26, 6.18s/it]
60%|ββββββ | 1686/2826 [2:45:32<1:58:25, 6.23s/it]
60%|ββββββ | 1687/2826 [2:45:38<1:56:40, 6.15s/it]
60%|ββββββ | 1688/2826 [2:45:44<1:57:54, 6.22s/it]
60%|ββββββ | 1689/2826 [2:45:49<1:51:06, 5.86s/it]
60%|ββββββ | 1690/2826 [2:45:54<1:46:55, 5.65s/it]
{'loss': 0.2683, 'grad_norm': 2.646296262741089, 'learning_rate': 2.086509483956594e-06, 'epoch': 1.79} |
|
60%|ββββββ | 1690/2826 [2:45:54<1:46:55, 5.65s/it]
60%|ββββββ | 1691/2826 [2:46:00<1:47:50, 5.70s/it]
60%|ββββββ | 1692/2826 [2:46:05<1:45:04, 5.56s/it]
60%|ββββββ | 1693/2826 [2:46:11<1:45:16, 5.58s/it]
60%|ββββββ | 1694/2826 [2:46:16<1:42:07, 5.41s/it]
60%|ββββββ | 1695/2826 [2:46:21<1:43:29, 5.49s/it]
60%|ββββββ | 1696/2826 [2:46:27<1:42:21, 5.44s/it]
60%|ββββββ | 1697/2826 [2:46:33<1:44:08, 5.53s/it]
60%|ββββββ | 1698/2826 [2:46:39<1:47:26, 5.71s/it]
60%|ββββββ | 1699/2826 [2:46:44<1:44:49, 5.58s/it]
60%|ββββββ | 1700/2826 [2:46:50<1:48:47, 5.80s/it]
{'loss': 0.313, 'grad_norm': 2.3010053634643555, 'learning_rate': 2.056082470784469e-06, 'epoch': 1.8} |
|
60%|ββββββ | 1700/2826 [2:46:50<1:48:47, 5.80s/it]
60%|ββββββ | 1701/2826 [2:46:55<1:44:42, 5.58s/it]
60%|ββββββ | 1702/2826 [2:47:02<1:49:02, 5.82s/it]
60%|ββββββ | 1703/2826 [2:47:08<1:52:33, 6.01s/it]
60%|ββββββ | 1704/2826 [2:47:14<1:48:40, 5.81s/it]
60%|ββββββ | 1705/2826 [2:47:19<1:48:14, 5.79s/it]
60%|ββββββ | 1706/2826 [2:47:25<1:49:15, 5.85s/it]
60%|ββββββ | 1707/2826 [2:47:30<1:45:38, 5.66s/it]
60%|ββββββ | 1708/2826 [2:47:36<1:43:37, 5.56s/it]
60%|ββββββ | 1709/2826 [2:47:42<1:45:30, 5.67s/it]
61%|ββββββ | 1710/2826 [2:47:47<1:42:47, 5.53s/it]
{'loss': 0.262, 'grad_norm': 2.3864669799804688, 'learning_rate': 2.0257232067538213e-06, 'epoch': 1.81} |
|
61%|ββββββ | 1710/2826 [2:47:47<1:42:47, 5.53s/it]
61%|ββββββ | 1711/2826 [2:47:53<1:43:52, 5.59s/it]
61%|ββββββ | 1712/2826 [2:47:59<1:45:58, 5.71s/it]
61%|ββββββ | 1713/2826 [2:48:04<1:42:33, 5.53s/it]
61%|ββββββ | 1714/2826 [2:48:10<1:46:20, 5.74s/it]
61%|ββββββ | 1715/2826 [2:48:16<1:46:34, 5.76s/it]
61%|ββββββ | 1716/2826 [2:48:21<1:43:35, 5.60s/it]
61%|ββββββ | 1717/2826 [2:48:27<1:43:00, 5.57s/it]
61%|ββββββ | 1718/2826 [2:48:33<1:45:42, 5.72s/it]
61%|ββββββ | 1719/2826 [2:48:38<1:44:57, 5.69s/it]
61%|ββββββ | 1720/2826 [2:48:43<1:42:34, 5.56s/it]
{'loss': 0.3457, 'grad_norm': 2.63028883934021, 'learning_rate': 1.9954363251894007e-06, 'epoch': 1.82} |
|
61%|ββββββ | 1720/2826 [2:48:43<1:42:34, 5.56s/it]
61%|ββββββ | 1721/2826 [2:48:50<1:46:39, 5.79s/it]
61%|ββββββ | 1722/2826 [2:48:56<1:51:04, 6.04s/it]
61%|ββββββ | 1723/2826 [2:49:02<1:46:15, 5.78s/it]
61%|ββββββ | 1724/2826 [2:49:07<1:46:08, 5.78s/it]
61%|ββββββ | 1725/2826 [2:49:13<1:47:50, 5.88s/it]
61%|ββββββ | 1726/2826 [2:49:19<1:46:57, 5.83s/it]
61%|ββββββ | 1727/2826 [2:49:24<1:43:12, 5.64s/it]
61%|ββββββ | 1728/2826 [2:49:31<1:47:26, 5.87s/it]
61%|ββββββ | 1729/2826 [2:49:38<1:52:35, 6.16s/it]
61%|ββββββ | 1730/2826 [2:49:43<1:50:19, 6.04s/it]
{'loss': 0.2739, 'grad_norm': 2.0011484622955322, 'learning_rate': 1.9652264483691933e-06, 'epoch': 1.84} |
|
61%|ββββββ | 1730/2826 [2:49:43<1:50:19, 6.04s/it]
61%|βββββββ | 1731/2826 [2:49:49<1:46:41, 5.85s/it]
61%|βββββββ | 1732/2826 [2:49:54<1:42:24, 5.62s/it]
61%|βββββββ | 1733/2826 [2:49:59<1:40:58, 5.54s/it]
61%|βββββββ | 1734/2826 [2:50:06<1:48:38, 5.97s/it]
61%|βββββββ | 1735/2826 [2:50:12<1:46:45, 5.87s/it]
61%|βββββββ | 1736/2826 [2:50:19<1:52:13, 6.18s/it]
61%|βββββββ | 1737/2826 [2:50:24<1:47:58, 5.95s/it]
62%|βββββββ | 1738/2826 [2:50:30<1:46:00, 5.85s/it]
62%|βββββββ | 1739/2826 [2:50:35<1:41:12, 5.59s/it]
62%|βββββββ | 1740/2826 [2:50:40<1:38:18, 5.43s/it]
{'loss': 0.3109, 'grad_norm': 2.6818690299987793, 'learning_rate': 1.9350981868189944e-06, 'epoch': 1.85} |
|
62%|βββββββ | 1740/2826 [2:50:40<1:38:18, 5.43s/it]
62%|βββββββ | 1741/2826 [2:50:45<1:35:50, 5.30s/it]
62%|βββββββ | 1742/2826 [2:50:50<1:34:29, 5.23s/it]
62%|βββββββ | 1743/2826 [2:50:55<1:34:51, 5.26s/it]
62%|βββββββ | 1744/2826 [2:51:01<1:37:15, 5.39s/it]
62%|βββββββ | 1745/2826 [2:51:06<1:36:09, 5.34s/it]
62%|βββββββ | 1746/2826 [2:51:13<1:42:39, 5.70s/it]
62%|βββββββ | 1747/2826 [2:51:18<1:39:15, 5.52s/it]
62%|βββββββ | 1748/2826 [2:51:23<1:40:05, 5.57s/it]
62%|βββββββ | 1749/2826 [2:51:30<1:46:27, 5.93s/it]
62%|βββββββ | 1750/2826 [2:51:36<1:47:42, 6.01s/it]
{'loss': 0.3269, 'grad_norm': 2.6978225708007812, 'learning_rate': 1.9050561386087618e-06, 'epoch': 1.86} |
|
62%|βββββββ | 1750/2826 [2:51:36<1:47:42, 6.01s/it]
62%|βββββββ | 1751/2826 [2:51:41<1:42:26, 5.72s/it]
62%|βββββββ | 1752/2826 [2:51:48<1:45:14, 5.88s/it]
62%|βββββββ | 1753/2826 [2:51:53<1:41:24, 5.67s/it]
62%|βββββββ | 1754/2826 [2:51:58<1:38:46, 5.53s/it]
62%|βββββββ | 1755/2826 [2:52:03<1:36:29, 5.41s/it]
62%|βββββββ | 1756/2826 [2:52:10<1:46:05, 5.95s/it]
62%|βββββββ | 1757/2826 [2:52:17<1:49:18, 6.14s/it]
62%|βββββββ | 1758/2826 [2:52:22<1:44:35, 5.88s/it]
62%|βββββββ | 1759/2826 [2:52:29<1:49:05, 6.13s/it]
62%|βββββββ | 1760/2826 [2:52:34<1:44:20, 5.87s/it]
{'loss': 0.3617, 'grad_norm': 2.578031301498413, 'learning_rate': 1.8751048886508711e-06, 'epoch': 1.87} |
|
62%|βββββββ | 1760/2826 [2:52:34<1:44:20, 5.87s/it]
62%|βββββββ | 1761/2826 [2:52:41<1:49:08, 6.15s/it]
62%|βββββββ | 1762/2826 [2:52:46<1:45:00, 5.92s/it]
62%|βββββββ | 1763/2826 [2:52:53<1:45:45, 5.97s/it]
62%|βββββββ | 1764/2826 [2:52:58<1:43:38, 5.86s/it]
62%|βββββββ | 1765/2826 [2:53:03<1:39:37, 5.63s/it]
62%|βββββββ | 1766/2826 [2:53:10<1:46:18, 6.02s/it]
63%|βββββββ | 1767/2826 [2:53:15<1:40:50, 5.71s/it]
63%|βββββββ | 1768/2826 [2:53:20<1:38:28, 5.58s/it]
63%|βββββββ | 1769/2826 [2:53:26<1:37:01, 5.51s/it]
63%|βββββββ | 1770/2826 [2:53:33<1:44:39, 5.95s/it]
{'loss': 0.3228, 'grad_norm': 2.5525052547454834, 'learning_rate': 1.8452490080003888e-06, 'epoch': 1.88} |
|
63%|βββββββ | 1770/2826 [2:53:33<1:44:39, 5.95s/it]
63%|βββββββ | 1771/2826 [2:53:39<1:48:00, 6.14s/it]
63%|βββββββ | 1772/2826 [2:53:45<1:47:00, 6.09s/it]
63%|βββββββ | 1773/2826 [2:53:51<1:42:38, 5.85s/it]
63%|βββββββ | 1774/2826 [2:53:57<1:44:30, 5.96s/it]
63%|βββββββ | 1775/2826 [2:54:02<1:41:07, 5.77s/it]
63%|βββββββ | 1776/2826 [2:54:08<1:41:44, 5.81s/it]
63%|βββββββ | 1777/2826 [2:54:13<1:38:22, 5.63s/it]
63%|βββββββ | 1778/2826 [2:54:21<1:47:25, 6.15s/it]
63%|βββββββ | 1779/2826 [2:54:26<1:43:40, 5.94s/it]
63%|βββββββ | 1780/2826 [2:54:32<1:42:29, 5.88s/it]
{'loss': 0.2857, 'grad_norm': 2.1095635890960693, 'learning_rate': 1.8154930531574521e-06, 'epoch': 1.89} |
|
63%|βββββββ | 1780/2826 [2:54:32<1:42:29, 5.88s/it]
63%|βββββββ | 1781/2826 [2:54:37<1:40:35, 5.78s/it]
63%|βββββββ | 1782/2826 [2:54:44<1:44:03, 5.98s/it]
63%|βββββββ | 1783/2826 [2:54:49<1:41:01, 5.81s/it]
63%|βββββββ | 1784/2826 [2:54:54<1:36:51, 5.58s/it]
63%|βββββββ | 1785/2826 [2:55:01<1:41:36, 5.86s/it]
63%|βββββββ | 1786/2826 [2:55:06<1:39:35, 5.75s/it]
63%|βββββββ | 1787/2826 [2:55:12<1:37:48, 5.65s/it]
63%|βββββββ | 1788/2826 [2:55:17<1:34:36, 5.47s/it]
63%|βββββββ | 1789/2826 [2:55:22<1:35:07, 5.50s/it]
63%|βββββββ | 1790/2826 [2:55:29<1:42:31, 5.94s/it]
{'loss': 0.3622, 'grad_norm': 2.3965845108032227, 'learning_rate': 1.785841565371868e-06, 'epoch': 1.9} |
|
63%|βββββββ | 1790/2826 [2:55:29<1:42:31, 5.94s/it]
63%|βββββββ | 1791/2826 [2:55:35<1:40:40, 5.84s/it]
63%|βββββββ | 1792/2826 [2:55:41<1:40:29, 5.83s/it]
63%|βββββββ | 1793/2826 [2:55:46<1:38:35, 5.73s/it]
63%|βββββββ | 1794/2826 [2:55:52<1:37:03, 5.64s/it]
64%|βββββββ | 1795/2826 [2:55:59<1:45:40, 6.15s/it]
64%|βββββββ | 1796/2826 [2:56:04<1:42:18, 5.96s/it]
64%|βββββββ | 1797/2826 [2:56:10<1:37:45, 5.70s/it]
64%|βββββββ | 1798/2826 [2:56:15<1:38:16, 5.74s/it]
64%|βββββββ | 1799/2826 [2:56:20<1:34:52, 5.54s/it]
64%|βββββββ | 1800/2826 [2:56:27<1:38:07, 5.74s/it]
{'loss': 0.3031, 'grad_norm': 2.293715238571167, 'learning_rate': 1.7562990699500482e-06, 'epoch': 1.91} |
|
64%|βββββββ | 1800/2826 [2:56:27<1:38:07, 5.74s/it]
64%|βββββββ | 1801/2826 [2:56:33<1:42:57, 6.03s/it]
64%|βββββββ | 1802/2826 [2:56:38<1:38:10, 5.75s/it]
64%|βββββββ | 1803/2826 [2:56:44<1:36:22, 5.65s/it]
64%|βββββββ | 1804/2826 [2:56:50<1:40:07, 5.88s/it]
64%|βββββββ | 1805/2826 [2:56:56<1:37:25, 5.73s/it]
64%|βββββββ | 1806/2826 [2:57:01<1:36:31, 5.68s/it]
64%|βββββββ | 1807/2826 [2:57:08<1:43:01, 6.07s/it]
64%|βββββββ | 1808/2826 [2:57:14<1:41:20, 5.97s/it]
64%|βββββββ | 1809/2826 [2:57:20<1:40:21, 5.92s/it]
64%|βββββββ | 1810/2826 [2:57:25<1:36:22, 5.69s/it]
{'loss': 0.3019, 'grad_norm': 2.026015281677246, 'learning_rate': 1.7268700755643708e-06, 'epoch': 1.92} |
|
64%|βββββββ | 1810/2826 [2:57:25<1:36:22, 5.69s/it]
64%|βββββββ | 1811/2826 [2:57:31<1:36:47, 5.72s/it]
64%|βββββββ | 1812/2826 [2:57:38<1:47:09, 6.34s/it]
64%|βββββββ | 1813/2826 [2:57:45<1:46:12, 6.29s/it]
64%|βββββββ | 1814/2826 [2:57:51<1:44:59, 6.22s/it]
64%|βββββββ | 1815/2826 [2:57:56<1:38:55, 5.87s/it]
64%|βββββββ | 1816/2826 [2:58:01<1:36:12, 5.72s/it]
64%|βββββββ | 1817/2826 [2:58:06<1:32:43, 5.51s/it]
64%|βββββββ | 1818/2826 [2:58:13<1:36:57, 5.77s/it]
64%|βββββββ | 1819/2826 [2:58:19<1:38:58, 5.90s/it]
64%|βββββββ | 1820/2826 [2:58:25<1:39:09, 5.91s/it]
{'loss': 0.3047, 'grad_norm': 1.7175791263580322, 'learning_rate': 1.6975590735650812e-06, 'epoch': 1.93} |
|
64%|βββββββ | 1820/2826 [2:58:25<1:39:09, 5.91s/it]
64%|βββββββ | 1821/2826 [2:58:30<1:35:36, 5.71s/it]
64%|βββββββ | 1822/2826 [2:58:36<1:36:03, 5.74s/it]
65%|βββββββ | 1823/2826 [2:58:41<1:34:19, 5.64s/it]
65%|βββββββ | 1824/2826 [2:58:46<1:31:50, 5.50s/it]
65%|βββββββ | 1825/2826 [2:58:52<1:31:48, 5.50s/it]
65%|βββββββ | 1826/2826 [2:58:58<1:34:17, 5.66s/it]
65%|βββββββ | 1827/2826 [2:59:04<1:35:07, 5.71s/it]
65%|βββββββ | 1828/2826 [2:59:10<1:39:48, 6.00s/it]
65%|βββββββ | 1829/2826 [2:59:16<1:38:58, 5.96s/it]
65%|βββββββ | 1830/2826 [2:59:22<1:36:23, 5.81s/it]
{'loss': 0.3048, 'grad_norm': 2.0024490356445312, 'learning_rate': 1.668370537294841e-06, 'epoch': 1.94} |
|
65%|βββββββ | 1830/2826 [2:59:22<1:36:23, 5.81s/it]
65%|βββββββ | 1831/2826 [2:59:27<1:35:03, 5.73s/it]
65%|βββββββ | 1832/2826 [2:59:33<1:34:32, 5.71s/it]
65%|βββββββ | 1833/2826 [2:59:38<1:31:26, 5.52s/it]
65%|βββββββ | 1834/2826 [2:59:44<1:32:25, 5.59s/it]
65%|βββββββ | 1835/2826 [2:59:49<1:29:51, 5.44s/it]
65%|βββββββ | 1836/2826 [2:59:55<1:35:45, 5.80s/it]
65%|βββββββ | 1837/2826 [3:00:02<1:40:11, 6.08s/it]
65%|βββββββ | 1838/2826 [3:00:08<1:39:40, 6.05s/it]
65%|βββββββ | 1839/2826 [3:00:14<1:37:52, 5.95s/it]
65%|βββββββ | 1840/2826 [3:00:20<1:38:25, 5.99s/it]
{'loss': 0.3205, 'grad_norm': 2.8226239681243896, 'learning_rate': 1.6393089214060204e-06, 'epoch': 1.95} |
|
65%|βββββββ | 1840/2826 [3:00:20<1:38:25, 5.99s/it]
65%|βββββββ | 1841/2826 [3:00:25<1:34:10, 5.74s/it]
65%|βββββββ | 1842/2826 [3:00:30<1:31:41, 5.59s/it]
65%|βββββββ | 1843/2826 [3:00:36<1:32:12, 5.63s/it]
65%|βββββββ | 1844/2826 [3:00:43<1:39:03, 6.05s/it]
65%|βββββββ | 1845/2826 [3:00:48<1:34:42, 5.79s/it]
65%|βββββββ | 1846/2826 [3:00:54<1:33:47, 5.74s/it]
65%|βββββββ | 1847/2826 [3:01:01<1:39:31, 6.10s/it]
65%|βββββββ | 1848/2826 [3:01:08<1:44:20, 6.40s/it]
65%|βββββββ | 1849/2826 [3:01:16<1:51:30, 6.85s/it]
65%|βββββββ | 1850/2826 [3:01:23<1:52:02, 6.89s/it]
{'loss': 0.321, 'grad_norm': 1.9452221393585205, 'learning_rate': 1.6103786611808414e-06, 'epoch': 1.96} |
|
65%|βββββββ | 1850/2826 [3:01:23<1:52:02, 6.89s/it]
65%|βββββββ | 1851/2826 [3:01:30<1:52:23, 6.92s/it]
66%|βββββββ | 1852/2826 [3:01:35<1:43:37, 6.38s/it]
66%|βββββββ | 1853/2826 [3:01:40<1:36:58, 5.98s/it]
66%|βββββββ | 1854/2826 [3:01:45<1:33:39, 5.78s/it]
66%|βββββββ | 1855/2826 [3:01:51<1:31:43, 5.67s/it]
66%|βββββββ | 1856/2826 [3:01:58<1:38:38, 6.10s/it]
66%|βββββββ | 1857/2826 [3:02:04<1:37:27, 6.03s/it]
66%|βββββββ | 1858/2826 [3:02:09<1:32:53, 5.76s/it]
66%|βββββββ | 1859/2826 [3:02:14<1:29:45, 5.57s/it]
66%|βββββββ | 1860/2826 [3:02:19<1:27:40, 5.45s/it]
{'loss': 0.2954, 'grad_norm': 2.304274320602417, 'learning_rate': 1.5815841718544884e-06, 'epoch': 1.97} |
|
66%|βββββββ | 1860/2826 [3:02:19<1:27:40, 5.45s/it]
66%|βββββββ | 1861/2826 [3:02:26<1:34:15, 5.86s/it]
66%|βββββββ | 1862/2826 [3:02:32<1:34:22, 5.87s/it]
66%|βββββββ | 1863/2826 [3:02:37<1:29:58, 5.61s/it]
66%|βββββββ | 1864/2826 [3:02:43<1:30:24, 5.64s/it]
66%|βββββββ | 1865/2826 [3:02:48<1:28:03, 5.50s/it]
66%|βββββββ | 1866/2826 [3:02:53<1:28:22, 5.52s/it]
66%|βββββββ | 1867/2826 [3:02:58<1:25:49, 5.37s/it]
66%|βββββββ | 1868/2826 [3:03:03<1:24:40, 5.30s/it]
66%|βββββββ | 1869/2826 [3:03:10<1:28:38, 5.56s/it]
66%|βββββββ | 1870/2826 [3:03:16<1:32:21, 5.80s/it]
{'loss': 0.2945, 'grad_norm': 2.502206802368164, 'learning_rate': 1.5529298479412636e-06, 'epoch': 1.98} |
|
66%|βββββββ | 1870/2826 [3:03:16<1:32:21, 5.80s/it]
66%|βββββββ | 1871/2826 [3:03:22<1:34:55, 5.96s/it]
66%|βββββββ | 1872/2826 [3:03:29<1:37:43, 6.15s/it]
66%|βββββββ | 1873/2826 [3:03:34<1:34:39, 5.96s/it]
66%|βββββββ | 1874/2826 [3:03:40<1:32:15, 5.81s/it]
66%|βββββββ | 1875/2826 [3:03:45<1:29:33, 5.65s/it]
66%|βββββββ | 1876/2826 [3:03:52<1:35:24, 6.03s/it]
66%|βββββββ | 1877/2826 [3:03:57<1:31:38, 5.79s/it]
66%|βββββββ | 1878/2826 [3:04:03<1:29:25, 5.66s/it]
66%|βββββββ | 1879/2826 [3:04:08<1:26:37, 5.49s/it]
67%|βββββββ | 1880/2826 [3:04:13<1:24:42, 5.37s/it]
{'loss': 0.3291, 'grad_norm': 2.5796189308166504, 'learning_rate': 1.524420062563912e-06, 'epoch': 1.99} |
|
67%|βββββββ | 1880/2826 [3:04:13<1:24:42, 5.37s/it]
67%|βββββββ | 1881/2826 [3:04:19<1:26:16, 5.48s/it]
67%|βββββββ | 1882/2826 [3:04:25<1:33:05, 5.92s/it]
67%|βββββββ | 1883/2826 [3:04:32<1:35:55, 6.10s/it]
67%|βββββββ | 1884/2826 [3:04:38<1:34:43, 6.03s/it]
67%|βββββββ | 1885/2826 [3:04:43<1:30:52, 5.79s/it]
67%|βββββββ | 1886/2826 [3:04:48<1:26:53, 5.55s/it][INFO|trainer.py:3984] 2025-10-18 09:51:01,296 >> Saving model checkpoint to /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886 |
| [INFO|configuration_utils.py:419] 2025-10-18 09:51:01,303 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/config.json |
| [INFO|configuration_utils.py:911] 2025-10-18 09:51:01,304 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/generation_config.json |
| [INFO|modeling_utils.py:3580] 2025-10-18 09:51:16,354 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/model.safetensors.index.json. |
| [INFO|tokenization_utils_base.py:2510] 2025-10-18 09:51:16,359 >> tokenizer config file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/tokenizer_config.json |
| [INFO|tokenization_utils_base.py:2519] 2025-10-18 09:51:16,360 >> Special tokens file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/special_tokens_map.json |
| [2025-10-18 09:51:16,892] [INFO] [logging.py:107:log_dist] [Rank 0] [Torch] Checkpoint global_step1885 is about to be saved! |
| [2025-10-18 09:51:16,903] [INFO] [logging.py:107:log_dist] [Rank 0] Saving model checkpoint: /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/zero_pp_rank_0_mp_rank_00_model_states.pt |
| [2025-10-18 09:51:16,903] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/zero_pp_rank_0_mp_rank_00_model_states.pt... |
| [2025-10-18 09:51:16,923] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/zero_pp_rank_0_mp_rank_00_model_states.pt. |
| [2025-10-18 09:51:16,936] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... |
| [2025-10-18 09:51:34,927] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. |
| [2025-10-18 09:51:34,929] [INFO] [engine.py:3701:_save_zero_checkpoint] zero checkpoint saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-1886/global_step1885/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt |
| [2025-10-18 09:51:35,161] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step1885 is ready now! |
|
67%|βββββββ | 1887/2826 [3:05:36<4:45:09, 18.22s/it]
67%|βββββββ | 1888/2826 [3:05:42<3:48:33, 14.62s/it]
67%|βββββββ | 1889/2826 [3:05:47<3:04:18, 11.80s/it]
67%|βββββββ | 1890/2826 [3:05:53<2:37:08, 10.07s/it]
{'loss': 0.234, 'grad_norm': 1.9198871850967407, 'learning_rate': 1.4960591667862163e-06, 'epoch': 2.0} |
|
67%|βββββββ | 1890/2826 [3:05:53<2:37:08, 10.07s/it]
67%|βββββββ | 1891/2826 [3:05:59<2:15:35, 8.70s/it]
67%|βββββββ | 1892/2826 [3:06:05<2:03:47, 7.95s/it]
67%|βββββββ | 1893/2826 [3:06:11<1:54:09, 7.34s/it]
67%|βββββββ | 1894/2826 [3:06:16<1:45:05, 6.77s/it]
67%|βββββββ | 1895/2826 [3:06:22<1:38:19, 6.34s/it]
67%|βββββββ | 1896/2826 [3:06:28<1:35:50, 6.18s/it]
67%|βββββββ | 1897/2826 [3:06:35<1:39:16, 6.41s/it]
67%|βββββββ | 1898/2826 [3:06:41<1:40:07, 6.47s/it]
67%|βββββββ | 1899/2826 [3:06:47<1:36:03, 6.22s/it]
67%|βββββββ | 1900/2826 [3:06:54<1:40:35, 6.52s/it]
{'loss': 0.1943, 'grad_norm': 1.7082706689834595, 'learning_rate': 1.4678514889489464e-06, 'epoch': 2.01} |
|
67%|βββββββ | 1900/2826 [3:06:54<1:40:35, 6.52s/it]
67%|βββββββ | 1901/2826 [3:07:03<1:51:41, 7.24s/it]
67%|βββββββ | 1902/2826 [3:07:09<1:45:07, 6.83s/it]
67%|βββββββ | 1903/2826 [3:07:14<1:39:02, 6.44s/it]
67%|βββββββ | 1904/2826 [3:07:19<1:32:32, 6.02s/it]
67%|βββββββ | 1905/2826 [3:07:26<1:34:50, 6.18s/it]
67%|βββββββ | 1906/2826 [3:07:31<1:30:04, 5.87s/it]
67%|βββββββ | 1907/2826 [3:07:36<1:27:27, 5.71s/it]
68%|βββββββ | 1908/2826 [3:07:43<1:33:41, 6.12s/it]
68%|βββββββ | 1909/2826 [3:07:49<1:30:38, 5.93s/it]
68%|βββββββ | 1910/2826 [3:07:56<1:35:49, 6.28s/it]
{'loss': 0.1911, 'grad_norm': 1.8571817874908447, 'learning_rate': 1.4398013340092864e-06, 'epoch': 2.03} |
|
68%|βββββββ | 1910/2826 [3:07:56<1:35:49, 6.28s/it]
68%|βββββββ | 1911/2826 [3:08:02<1:32:27, 6.06s/it]
68%|βββββββ | 1912/2826 [3:08:07<1:29:36, 5.88s/it]
68%|βββββββ | 1913/2826 [3:08:13<1:28:53, 5.84s/it]
68%|βββββββ | 1914/2826 [3:08:19<1:31:11, 6.00s/it]
68%|βββββββ | 1915/2826 [3:08:26<1:33:33, 6.16s/it]
68%|βββββββ | 1916/2826 [3:08:32<1:34:31, 6.23s/it]
68%|βββββββ | 1917/2826 [3:08:37<1:29:10, 5.89s/it]
68%|βββββββ | 1918/2826 [3:08:42<1:25:34, 5.65s/it]
68%|βββββββ | 1919/2826 [3:08:49<1:30:19, 5.98s/it]
68%|βββββββ | 1920/2826 [3:08:56<1:33:49, 6.21s/it]
{'loss': 0.1895, 'grad_norm': 2.454561233520508, 'learning_rate': 1.4119129828838275e-06, 'epoch': 2.04} |
|
68%|βββββββ | 1920/2826 [3:08:56<1:33:49, 6.21s/it]
68%|βββββββ | 1921/2826 [3:09:02<1:33:13, 6.18s/it]
68%|βββββββ | 1922/2826 [3:09:09<1:37:57, 6.50s/it]
68%|βββββββ | 1923/2826 [3:09:17<1:43:01, 6.84s/it]
68%|βββββββ | 1924/2826 [3:09:22<1:34:50, 6.31s/it]
68%|βββββββ | 1925/2826 [3:09:27<1:30:38, 6.04s/it]
68%|βββββββ | 1926/2826 [3:09:33<1:27:29, 5.83s/it]
68%|βββββββ | 1927/2826 [3:09:39<1:29:13, 5.96s/it]
68%|βββββββ | 1928/2826 [3:09:45<1:28:23, 5.91s/it]
68%|βββββββ | 1929/2826 [3:09:51<1:28:49, 5.94s/it]
68%|βββββββ | 1930/2826 [3:09:56<1:24:37, 5.67s/it]
{'loss': 0.2177, 'grad_norm': 2.3714683055877686, 'learning_rate': 1.384190691795226e-06, 'epoch': 2.05} |
|
68%|βββββββ | 1930/2826 [3:09:56<1:24:37, 5.67s/it]
68%|βββββββ | 1931/2826 [3:10:01<1:24:10, 5.64s/it]
68%|βββββββ | 1932/2826 [3:10:06<1:21:41, 5.48s/it]
68%|βββββββ | 1933/2826 [3:10:14<1:29:20, 6.00s/it]
68%|βββββββ | 1934/2826 [3:10:19<1:25:39, 5.76s/it]
68%|βββββββ | 1935/2826 [3:10:24<1:22:28, 5.55s/it]
69%|βββββββ | 1936/2826 [3:10:29<1:20:14, 5.41s/it]
69%|βββββββ | 1937/2826 [3:10:36<1:26:10, 5.82s/it]
69%|βββββββ | 1938/2826 [3:10:42<1:29:57, 6.08s/it]
69%|βββββββ | 1939/2826 [3:10:48<1:25:36, 5.79s/it]
69%|βββββββ | 1940/2826 [3:10:53<1:25:20, 5.78s/it]
{'loss': 0.2252, 'grad_norm': 2.1356313228607178, 'learning_rate': 1.3566386916226373e-06, 'epoch': 2.06} |
|
69%|βββββββ | 1940/2826 [3:10:53<1:25:20, 5.78s/it]
69%|βββββββ | 1941/2826 [3:11:01<1:33:30, 6.34s/it]
69%|βββββββ | 1942/2826 [3:11:06<1:29:11, 6.05s/it]
69%|βββββββ | 1943/2826 [3:11:12<1:25:14, 5.79s/it]
69%|βββββββ | 1944/2826 [3:11:18<1:28:11, 6.00s/it]
69%|βββββββ | 1945/2826 [3:11:25<1:33:26, 6.36s/it]
69%|βββββββ | 1946/2826 [3:11:32<1:37:07, 6.62s/it]
69%|βββββββ | 1947/2826 [3:11:40<1:39:07, 6.77s/it]
69%|βββββββ | 1948/2826 [3:11:45<1:33:28, 6.39s/it]
69%|βββββββ | 1949/2826 [3:11:51<1:29:38, 6.13s/it]
69%|βββββββ | 1950/2826 [3:11:57<1:28:57, 6.09s/it]
{'loss': 0.1982, 'grad_norm': 2.446906089782715, 'learning_rate': 1.3292611872560134e-06, 'epoch': 2.07} |
|
69%|βββββββ | 1950/2826 [3:11:57<1:28:57, 6.09s/it]
69%|βββββββ | 1951/2826 [3:12:02<1:26:12, 5.91s/it]
69%|βββββββ | 1952/2826 [3:12:07<1:23:51, 5.76s/it]
69%|βββββββ | 1953/2826 [3:12:13<1:24:18, 5.79s/it]
69%|βββββββ | 1954/2826 [3:12:19<1:21:45, 5.63s/it]
69%|βββββββ | 1955/2826 [3:12:26<1:28:01, 6.06s/it]
69%|βββββββ | 1956/2826 [3:12:33<1:32:10, 6.36s/it]
69%|βββββββ | 1957/2826 [3:12:38<1:29:15, 6.16s/it]
69%|βββββββ | 1958/2826 [3:12:44<1:26:07, 5.95s/it]
69%|βββββββ | 1959/2826 [3:12:50<1:28:51, 6.15s/it]
69%|βββββββ | 1960/2826 [3:12:58<1:32:42, 6.42s/it]
{'loss': 0.1696, 'grad_norm': 2.1040875911712646, 'learning_rate': 1.302062356954365e-06, 'epoch': 2.08} |
|
69%|βββββββ | 1960/2826 [3:12:58<1:32:42, 6.42s/it]
69%|βββββββ | 1961/2826 [3:13:06<1:40:15, 6.95s/it]
69%|βββββββ | 1962/2826 [3:13:11<1:33:04, 6.46s/it]
69%|βββββββ | 1963/2826 [3:13:16<1:27:18, 6.07s/it]
69%|βββββββ | 1964/2826 [3:13:22<1:24:00, 5.85s/it]
70%|βββββββ | 1965/2826 [3:13:29<1:28:56, 6.20s/it]
70%|βββββββ | 1966/2826 [3:13:34<1:26:09, 6.01s/it]
70%|βββββββ | 1967/2826 [3:13:41<1:31:34, 6.40s/it]
70%|βββββββ | 1968/2826 [3:13:47<1:27:26, 6.12s/it]
70%|βββββββ | 1969/2826 [3:13:53<1:26:53, 6.08s/it]
70%|βββββββ | 1970/2826 [3:14:00<1:29:35, 6.28s/it]
{'loss': 0.1936, 'grad_norm': 2.220742702484131, 'learning_rate': 1.2750463517080922e-06, 'epoch': 2.09} |
|
70%|βββββββ | 1970/2826 [3:14:00<1:29:35, 6.28s/it]
70%|βββββββ | 1971/2826 [3:14:05<1:24:17, 5.91s/it]
70%|βββββββ | 1972/2826 [3:14:10<1:22:25, 5.79s/it]
70%|βββββββ | 1973/2826 [3:14:17<1:26:58, 6.12s/it]
70%|βββββββ | 1974/2826 [3:14:23<1:26:07, 6.07s/it]
70%|βββββββ | 1975/2826 [3:14:28<1:22:19, 5.80s/it]
70%|βββββββ | 1976/2826 [3:14:33<1:19:45, 5.63s/it]
70%|βββββββ | 1977/2826 [3:14:39<1:17:36, 5.48s/it]
70%|βββββββ | 1978/2826 [3:14:45<1:22:04, 5.81s/it]
70%|βββββββ | 1979/2826 [3:14:50<1:20:02, 5.67s/it]
70%|βββββββ | 1980/2826 [3:14:56<1:20:16, 5.69s/it]
{'loss': 0.1604, 'grad_norm': 2.7784054279327393, 'learning_rate': 1.2482172946054753e-06, 'epoch': 2.1} |
|
70%|βββββββ | 1980/2826 [3:14:56<1:20:16, 5.69s/it]
70%|βββββββ | 1981/2826 [3:15:02<1:18:45, 5.59s/it]
70%|βββββββ | 1982/2826 [3:15:08<1:20:40, 5.73s/it]
70%|βββββββ | 1983/2826 [3:15:14<1:24:24, 6.01s/it]
70%|βββββββ | 1984/2826 [3:15:19<1:20:36, 5.74s/it]
70%|βββββββ | 1985/2826 [3:15:25<1:20:48, 5.76s/it]
70%|βββββββ | 1986/2826 [3:15:31<1:20:35, 5.76s/it]
70%|βββββββ | 1987/2826 [3:15:37<1:19:54, 5.72s/it]
70%|βββββββ | 1988/2826 [3:15:42<1:18:59, 5.66s/it]
70%|βββββββ | 1989/2826 [3:15:48<1:18:14, 5.61s/it]
70%|βββββββ | 1990/2826 [3:15:54<1:22:43, 5.94s/it]
{'loss': 0.2069, 'grad_norm': 2.0539498329162598, 'learning_rate': 1.2215792802034187e-06, 'epoch': 2.11} |
|
70%|βββββββ | 1990/2826 [3:15:54<1:22:43, 5.94s/it]
70%|βββββββ | 1991/2826 [3:15:59<1:19:14, 5.69s/it]
70%|βββββββ | 1992/2826 [3:16:06<1:21:47, 5.88s/it]
71%|βββββββ | 1993/2826 [3:16:13<1:26:41, 6.24s/it]
71%|βββββββ | 1994/2826 [3:16:19<1:26:02, 6.20s/it]
71%|βββββββ | 1995/2826 [3:16:25<1:25:45, 6.19s/it]
71%|βββββββ | 1996/2826 [3:16:30<1:21:23, 5.88s/it]
71%|βββββββ | 1997/2826 [3:16:38<1:27:54, 6.36s/it]
71%|βββββββ | 1998/2826 [3:16:44<1:25:16, 6.18s/it]
71%|βββββββ | 1999/2826 [3:16:49<1:21:13, 5.89s/it]
71%|βββββββ | 2000/2826 [3:16:54<1:18:18, 5.69s/it]
{'loss': 0.1964, 'grad_norm': 1.8337138891220093, 'learning_rate': 1.1951363739025618e-06, 'epoch': 2.12} |
|
71%|βββββββ | 2000/2826 [3:16:54<1:18:18, 5.69s/it]
71%|βββββββ | 2001/2826 [3:17:01<1:24:31, 6.15s/it]
71%|βββββββ | 2002/2826 [3:17:08<1:28:55, 6.47s/it]
71%|βββββββ | 2003/2826 [3:17:14<1:24:46, 6.18s/it]
71%|βββββββ | 2004/2826 [3:17:20<1:25:04, 6.21s/it]
71%|βββββββ | 2005/2826 [3:17:26<1:21:26, 5.95s/it]
71%|βββββββ | 2006/2826 [3:17:31<1:18:48, 5.77s/it]
71%|βββββββ | 2007/2826 [3:17:36<1:16:02, 5.57s/it]
71%|βββββββ | 2008/2826 [3:17:42<1:17:17, 5.67s/it]
71%|βββββββ | 2009/2826 [3:17:48<1:17:53, 5.72s/it]
71%|βββββββ | 2010/2826 [3:17:53<1:14:54, 5.51s/it]
{'loss': 0.1871, 'grad_norm': 1.7631642818450928, 'learning_rate': 1.168892611326827e-06, 'epoch': 2.13} |
|
71%|βββββββ | 2010/2826 [3:17:53<1:14:54, 5.51s/it]
71%|βββββββ | 2011/2826 [3:17:58<1:14:29, 5.48s/it]
71%|βββββββ | 2012/2826 [3:18:06<1:23:32, 6.16s/it]
71%|βββββββ | 2013/2826 [3:18:11<1:20:27, 5.94s/it]
71%|ββββββββ | 2014/2826 [3:18:17<1:17:52, 5.75s/it]
71%|ββββββββ | 2015/2826 [3:18:22<1:15:07, 5.56s/it]
71%|ββββββββ | 2016/2826 [3:18:27<1:13:24, 5.44s/it]
71%|ββββββββ | 2017/2826 [3:18:33<1:16:56, 5.71s/it]
71%|ββββββββ | 2018/2826 [3:18:39<1:18:59, 5.87s/it]
71%|ββββββββ | 2019/2826 [3:18:46<1:21:10, 6.04s/it]
71%|ββββββββ | 2020/2826 [3:18:51<1:16:58, 5.73s/it]
{'loss': 0.2595, 'grad_norm': 2.386589527130127, 'learning_rate': 1.1428519977075136e-06, 'epoch': 2.14} |
|
71%|ββββββββ | 2020/2826 [3:18:51<1:16:58, 5.73s/it]
72%|ββββββββ | 2021/2826 [3:18:57<1:18:28, 5.85s/it]
72%|ββββββββ | 2022/2826 [3:19:02<1:15:55, 5.67s/it]
72%|ββββββββ | 2023/2826 [3:19:07<1:13:31, 5.49s/it]
72%|ββββββββ | 2024/2826 [3:19:14<1:16:15, 5.71s/it]
72%|ββββββββ | 2025/2826 [3:19:20<1:19:35, 5.96s/it]
72%|ββββββββ | 2026/2826 [3:19:27<1:22:04, 6.16s/it]
72%|ββββββββ | 2027/2826 [3:19:32<1:17:51, 5.85s/it]
72%|ββββββββ | 2028/2826 [3:19:38<1:19:51, 6.00s/it]
72%|ββββββββ | 2029/2826 [3:19:43<1:16:21, 5.75s/it]
72%|ββββββββ | 2030/2826 [3:19:49<1:13:51, 5.57s/it]
{'loss': 0.185, 'grad_norm': 2.553382635116577, 'learning_rate': 1.1170185072720434e-06, 'epoch': 2.15} |
|
72%|ββββββββ | 2030/2826 [3:19:49<1:13:51, 5.57s/it]
72%|ββββββββ | 2031/2826 [3:19:54<1:12:13, 5.45s/it]
72%|ββββββββ | 2032/2826 [3:19:59<1:10:46, 5.35s/it]
72%|ββββββββ | 2033/2826 [3:20:04<1:11:20, 5.40s/it]
72%|ββββββββ | 2034/2826 [3:20:10<1:13:55, 5.60s/it]
72%|ββββββββ | 2035/2826 [3:20:17<1:17:21, 5.87s/it]
72%|ββββββββ | 2036/2826 [3:20:23<1:16:08, 5.78s/it]
72%|ββββββββ | 2037/2826 [3:20:28<1:16:03, 5.78s/it]
72%|ββββββββ | 2038/2826 [3:20:35<1:18:50, 6.00s/it]
72%|ββββββββ | 2039/2826 [3:20:41<1:19:04, 6.03s/it]
72%|ββββββββ | 2040/2826 [3:20:47<1:20:00, 6.11s/it]
{'loss': 0.228, 'grad_norm': 2.870973825454712, 'learning_rate': 1.091396082637419e-06, 'epoch': 2.16} |
|
72%|ββββββββ | 2040/2826 [3:20:47<1:20:00, 6.11s/it]
72%|ββββββββ | 2041/2826 [3:20:53<1:17:39, 5.94s/it]
72%|ββββββββ | 2042/2826 [3:20:58<1:15:02, 5.74s/it]
72%|ββββββββ | 2043/2826 [3:21:03<1:13:49, 5.66s/it]
72%|ββββββββ | 2044/2826 [3:21:10<1:15:49, 5.82s/it]
72%|ββββββββ | 2045/2826 [3:21:16<1:18:18, 6.02s/it]
72%|ββββββββ | 2046/2826 [3:21:22<1:18:41, 6.05s/it]
72%|ββββββββ | 2047/2826 [3:21:28<1:18:47, 6.07s/it]
72%|ββββββββ | 2048/2826 [3:21:34<1:18:41, 6.07s/it]
73%|ββββββββ | 2049/2826 [3:21:42<1:22:51, 6.40s/it]
73%|ββββββββ | 2050/2826 [3:21:48<1:21:13, 6.28s/it]
{'loss': 0.2098, 'grad_norm': 2.643745183944702, 'learning_rate': 1.065988634208516e-06, 'epoch': 2.17} |
|
73%|ββββββββ | 2050/2826 [3:21:48<1:21:13, 6.28s/it]
73%|ββββββββ | 2051/2826 [3:21:53<1:18:57, 6.11s/it]
73%|ββββββββ | 2052/2826 [3:22:00<1:21:20, 6.31s/it]
73%|ββββββββ | 2053/2826 [3:22:07<1:23:59, 6.52s/it]
73%|ββββββββ | 2054/2826 [3:22:13<1:20:35, 6.26s/it]
73%|ββββββββ | 2055/2826 [3:22:19<1:18:48, 6.13s/it]
73%|ββββββββ | 2056/2826 [3:22:24<1:14:42, 5.82s/it]
73%|ββββββββ | 2057/2826 [3:22:29<1:11:45, 5.60s/it]
73%|ββββββββ | 2058/2826 [3:22:36<1:18:24, 6.13s/it]
73%|ββββββββ | 2059/2826 [3:22:42<1:15:54, 5.94s/it]
73%|ββββββββ | 2060/2826 [3:22:48<1:15:33, 5.92s/it]
{'loss': 0.1982, 'grad_norm': 2.369596481323242, 'learning_rate': 1.0408000395812961e-06, 'epoch': 2.18} |
|
73%|ββββββββ | 2060/2826 [3:22:48<1:15:33, 5.92s/it]
73%|ββββββββ | 2061/2826 [3:22:54<1:16:32, 6.00s/it]
73%|ββββββββ | 2062/2826 [3:22:59<1:13:27, 5.77s/it]
73%|ββββββββ | 2063/2826 [3:23:04<1:10:57, 5.58s/it]
73%|ββββββββ | 2064/2826 [3:23:10<1:11:39, 5.64s/it]
73%|ββββββββ | 2065/2826 [3:23:16<1:12:13, 5.69s/it]
73%|ββββββββ | 2066/2826 [3:23:22<1:15:30, 5.96s/it]
73%|ββββββββ | 2067/2826 [3:23:28<1:14:26, 5.88s/it]
73%|ββββββββ | 2068/2826 [3:23:34<1:14:49, 5.92s/it]
73%|ββββββββ | 2069/2826 [3:23:39<1:12:33, 5.75s/it]
73%|ββββββββ | 2070/2826 [3:23:45<1:11:27, 5.67s/it]
{'loss': 0.1844, 'grad_norm': 2.1093883514404297, 'learning_rate': 1.0158341429510194e-06, 'epoch': 2.2} |
|
73%|ββββββββ | 2070/2826 [3:23:45<1:11:27, 5.67s/it]
73%|ββββββββ | 2071/2826 [3:23:52<1:18:35, 6.25s/it]
73%|ββββββββ | 2072/2826 [3:23:58<1:14:18, 5.91s/it]
73%|ββββββββ | 2073/2826 [3:24:03<1:11:32, 5.70s/it]
73%|ββββββββ | 2074/2826 [3:24:10<1:16:00, 6.06s/it]
73%|ββββββββ | 2075/2826 [3:24:15<1:12:30, 5.79s/it]
73%|ββββββββ | 2076/2826 [3:24:22<1:18:19, 6.27s/it]
73%|ββββββββ | 2077/2826 [3:24:27<1:14:25, 5.96s/it]
74%|ββββββββ | 2078/2826 [3:24:34<1:16:09, 6.11s/it]
74%|ββββββββ | 2079/2826 [3:24:39<1:13:08, 5.87s/it]
74%|ββββββββ | 2080/2826 [3:24:45<1:12:01, 5.79s/it]
{'loss': 0.1654, 'grad_norm': 1.951935052871704, 'learning_rate': 9.910947545255523e-07, 'epoch': 2.21} |
|
74%|ββββββββ | 2080/2826 [3:24:45<1:12:01, 5.79s/it]
74%|ββββββββ | 2081/2826 [3:24:51<1:12:04, 5.80s/it]
74%|ββββββββ | 2082/2826 [3:24:57<1:12:16, 5.83s/it]
74%|ββββββββ | 2083/2826 [3:25:02<1:11:05, 5.74s/it]
74%|ββββββββ | 2084/2826 [3:25:07<1:09:16, 5.60s/it]
74%|ββββββββ | 2085/2826 [3:25:15<1:15:48, 6.14s/it]
74%|ββββββββ | 2086/2826 [3:25:20<1:12:21, 5.87s/it]
74%|ββββββββ | 2087/2826 [3:25:26<1:12:01, 5.85s/it]
74%|ββββββββ | 2088/2826 [3:25:31<1:09:49, 5.68s/it]
74%|ββββββββ | 2089/2826 [3:25:37<1:09:15, 5.64s/it]
74%|ββββββββ | 2090/2826 [3:25:43<1:11:40, 5.84s/it]
{'loss': 0.2037, 'grad_norm': 2.230781078338623, 'learning_rate': 9.665856499438744e-07, 'epoch': 2.22} |
|
74%|ββββββββ | 2090/2826 [3:25:43<1:11:40, 5.84s/it]
74%|ββββββββ | 2091/2826 [3:25:48<1:09:46, 5.70s/it]
74%|ββββββββ | 2092/2826 [3:25:55<1:11:46, 5.87s/it]
74%|ββββββββ | 2093/2826 [3:26:01<1:12:30, 5.94s/it]
74%|ββββββββ | 2094/2826 [3:26:08<1:16:57, 6.31s/it]
74%|ββββββββ | 2095/2826 [3:26:13<1:13:42, 6.05s/it]
74%|ββββββββ | 2096/2826 [3:26:18<1:10:09, 5.77s/it]
74%|ββββββββ | 2097/2826 [3:26:24<1:10:40, 5.82s/it]
74%|ββββββββ | 2098/2826 [3:26:29<1:08:00, 5.61s/it]
74%|ββββββββ | 2099/2826 [3:26:34<1:05:53, 5.44s/it]
74%|ββββββββ | 2100/2826 [3:26:40<1:05:13, 5.39s/it]
{'loss': 0.2087, 'grad_norm': 2.6240904331207275, 'learning_rate': 9.423105696998491e-07, 'epoch': 2.23} |
|
74%|ββββββββ | 2100/2826 [3:26:40<1:05:13, 5.39s/it]
74%|ββββββββ | 2101/2826 [3:26:46<1:09:12, 5.73s/it]
74%|ββββββββ | 2102/2826 [3:26:52<1:08:27, 5.67s/it]
74%|ββββββββ | 2103/2826 [3:26:57<1:08:10, 5.66s/it]
74%|ββββββββ | 2104/2826 [3:27:04<1:09:40, 5.79s/it]
74%|ββββββββ | 2105/2826 [3:27:09<1:08:16, 5.68s/it]
75%|ββββββββ | 2106/2826 [3:27:15<1:10:07, 5.84s/it]
75%|ββββββββ | 2107/2826 [3:27:20<1:07:24, 5.62s/it]
75%|ββββββββ | 2108/2826 [3:27:25<1:05:28, 5.47s/it]
75%|ββββββββ | 2109/2826 [3:27:32<1:08:24, 5.72s/it]
75%|ββββββββ | 2110/2826 [3:27:38<1:10:22, 5.90s/it]
{'loss': 0.2105, 'grad_norm': 1.712857723236084, 'learning_rate': 9.182732185713633e-07, 'epoch': 2.24} |
|
75%|ββββββββ | 2110/2826 [3:27:38<1:10:22, 5.90s/it]
75%|ββββββββ | 2111/2826 [3:27:44<1:09:00, 5.79s/it]
75%|ββββββββ | 2112/2826 [3:27:49<1:06:52, 5.62s/it]
75%|ββββββββ | 2113/2826 [3:27:55<1:08:21, 5.75s/it]
75%|ββββββββ | 2114/2826 [3:28:01<1:10:52, 5.97s/it]
75%|ββββββββ | 2115/2826 [3:28:08<1:14:43, 6.31s/it]
75%|ββββββββ | 2116/2826 [3:28:14<1:11:51, 6.07s/it]
75%|ββββββββ | 2117/2826 [3:28:20<1:12:30, 6.14s/it]
75%|ββββββββ | 2118/2826 [3:28:25<1:09:08, 5.86s/it]
75%|ββββββββ | 2119/2826 [3:28:31<1:06:20, 5.63s/it]
75%|ββββββββ | 2120/2826 [3:28:36<1:04:27, 5.48s/it]
{'loss': 0.2186, 'grad_norm': 2.036086082458496, 'learning_rate': 8.94477265054918e-07, 'epoch': 2.25} |
|
75%|ββββββββ | 2120/2826 [3:28:36<1:04:27, 5.48s/it]
75%|ββββββββ | 2121/2826 [3:28:41<1:05:06, 5.54s/it]
75%|ββββββββ | 2122/2826 [3:28:47<1:05:10, 5.55s/it]
75%|ββββββββ | 2123/2826 [3:28:52<1:04:29, 5.50s/it]
75%|ββββββββ | 2124/2826 [3:28:59<1:08:49, 5.88s/it]
75%|ββββββββ | 2125/2826 [3:29:06<1:12:23, 6.20s/it]
75%|ββββββββ | 2126/2826 [3:29:11<1:08:44, 5.89s/it]
75%|ββββββββ | 2127/2826 [3:29:16<1:06:02, 5.67s/it]
75%|ββββββββ | 2128/2826 [3:29:23<1:07:45, 5.82s/it]
75%|ββββββββ | 2129/2826 [3:29:28<1:06:46, 5.75s/it]
75%|ββββββββ | 2130/2826 [3:29:34<1:05:57, 5.69s/it]
{'loss': 0.1879, 'grad_norm': 2.3545398712158203, 'learning_rate': 8.709263408057522e-07, 'epoch': 2.26} |
|
75%|ββββββββ | 2130/2826 [3:29:34<1:05:57, 5.69s/it]
75%|ββββββββ | 2131/2826 [3:29:40<1:09:22, 5.99s/it]
75%|ββββββββ | 2132/2826 [3:29:45<1:06:14, 5.73s/it]
75%|ββββββββ | 2133/2826 [3:29:50<1:03:38, 5.51s/it]
76%|ββββββββ | 2134/2826 [3:29:56<1:03:07, 5.47s/it]
76%|ββββββββ | 2135/2826 [3:30:01<1:02:23, 5.42s/it]
76%|ββββββββ | 2136/2826 [3:30:06<1:01:41, 5.36s/it]
76%|ββββββββ | 2137/2826 [3:30:13<1:05:37, 5.71s/it]
76%|ββββββββ | 2138/2826 [3:30:18<1:04:12, 5.60s/it]
76%|ββββββββ | 2139/2826 [3:30:23<1:02:18, 5.44s/it]
76%|ββββββββ | 2140/2826 [3:30:30<1:07:36, 5.91s/it]
{'loss': 0.2177, 'grad_norm': 1.9098992347717285, 'learning_rate': 8.476240400835972e-07, 'epoch': 2.27} |
|
76%|ββββββββ | 2140/2826 [3:30:30<1:07:36, 5.91s/it]
76%|ββββββββ | 2141/2826 [3:30:36<1:06:05, 5.79s/it]
76%|ββββββββ | 2142/2826 [3:30:41<1:03:44, 5.59s/it]
76%|ββββββββ | 2143/2826 [3:30:47<1:03:51, 5.61s/it]
76%|ββββββββ | 2144/2826 [3:30:52<1:04:15, 5.65s/it]
76%|ββββββββ | 2145/2826 [3:30:58<1:05:14, 5.75s/it]
76%|ββββββββ | 2146/2826 [3:31:04<1:05:48, 5.81s/it]
76%|ββββββββ | 2147/2826 [3:31:12<1:11:58, 6.36s/it]
76%|ββββββββ | 2148/2826 [3:31:19<1:13:50, 6.53s/it]
76%|ββββββββ | 2149/2826 [3:31:25<1:10:42, 6.27s/it]
76%|ββββββββ | 2150/2826 [3:31:30<1:07:01, 5.95s/it]
{'loss': 0.165, 'grad_norm': 2.107959270477295, 'learning_rate': 8.245739192041311e-07, 'epoch': 2.28} |
|
76%|ββββββββ | 2150/2826 [3:31:30<1:07:01, 5.95s/it]
76%|ββββββββ | 2151/2826 [3:31:36<1:07:15, 5.98s/it]
76%|ββββββββ | 2152/2826 [3:31:41<1:05:21, 5.82s/it]
76%|ββββββββ | 2153/2826 [3:31:48<1:06:53, 5.96s/it]
76%|ββββββββ | 2154/2826 [3:31:53<1:04:08, 5.73s/it]
76%|ββββββββ | 2155/2826 [3:31:59<1:05:29, 5.86s/it]
76%|ββββββββ | 2156/2826 [3:32:04<1:02:42, 5.62s/it]
76%|ββββββββ | 2157/2826 [3:32:09<1:00:49, 5.45s/it]
76%|ββββββββ | 2158/2826 [3:32:15<1:01:31, 5.53s/it]
76%|ββββββββ | 2159/2826 [3:32:20<1:00:05, 5.41s/it]
76%|ββββββββ | 2160/2826 [3:32:27<1:04:54, 5.85s/it]
{'loss': 0.2018, 'grad_norm': 2.550719976425171, 'learning_rate': 8.017794959962225e-07, 'epoch': 2.29} |
|
76%|ββββββββ | 2160/2826 [3:32:27<1:04:54, 5.85s/it]
76%|ββββββββ | 2161/2826 [3:32:33<1:06:06, 5.96s/it]
77%|ββββββββ | 2162/2826 [3:32:40<1:09:13, 6.26s/it]
77%|ββββββββ | 2163/2826 [3:32:47<1:12:06, 6.53s/it]
77%|ββββββββ | 2164/2826 [3:32:54<1:13:20, 6.65s/it]
77%|ββββββββ | 2165/2826 [3:32:59<1:09:29, 6.31s/it]
77%|ββββββββ | 2166/2826 [3:33:06<1:11:42, 6.52s/it]
77%|ββββββββ | 2167/2826 [3:33:12<1:08:24, 6.23s/it]
77%|ββββββββ | 2168/2826 [3:33:18<1:06:30, 6.06s/it]
77%|ββββββββ | 2169/2826 [3:33:23<1:03:08, 5.77s/it]
77%|ββββββββ | 2170/2826 [3:33:28<1:00:57, 5.58s/it]
{'loss': 0.1955, 'grad_norm': 2.354701280593872, 'learning_rate': 7.792442492650587e-07, 'epoch': 2.3} |
|
77%|ββββββββ | 2170/2826 [3:33:28<1:00:57, 5.58s/it]
77%|ββββββββ | 2171/2826 [3:33:35<1:04:28, 5.91s/it]
77%|ββββββββ | 2172/2826 [3:33:41<1:06:40, 6.12s/it]
77%|ββββββββ | 2173/2826 [3:33:47<1:05:52, 6.05s/it]
77%|ββββββββ | 2174/2826 [3:33:53<1:03:50, 5.88s/it]
77%|ββββββββ | 2175/2826 [3:33:58<1:01:10, 5.64s/it]
77%|ββββββββ | 2176/2826 [3:34:03<59:25, 5.48s/it]
77%|ββββββββ | 2177/2826 [3:34:08<1:00:00, 5.55s/it]
77%|ββββββββ | 2178/2826 [3:34:14<59:35, 5.52s/it]
77%|ββββββββ | 2179/2826 [3:34:19<59:16, 5.50s/it]
77%|ββββββββ | 2180/2826 [3:34:25<58:22, 5.42s/it]
{'loss': 0.1976, 'grad_norm': 2.3547091484069824, 'learning_rate': 7.569716182612177e-07, 'epoch': 2.31} |
|
77%|ββββββββ | 2180/2826 [3:34:25<58:22, 5.42s/it]
77%|ββββββββ | 2181/2826 [3:34:30<58:29, 5.44s/it]
77%|ββββββββ | 2182/2826 [3:34:37<1:02:44, 5.85s/it]
77%|ββββββββ | 2183/2826 [3:34:43<1:03:58, 5.97s/it]
77%|ββββββββ | 2184/2826 [3:34:50<1:06:22, 6.20s/it]
77%|ββββββββ | 2185/2826 [3:34:56<1:04:53, 6.07s/it]
77%|ββββββββ | 2186/2826 [3:35:01<1:01:41, 5.78s/it]
77%|ββββββββ | 2187/2826 [3:35:06<1:00:31, 5.68s/it]
77%|ββββββββ | 2188/2826 [3:35:12<1:01:55, 5.82s/it]
77%|ββββββββ | 2189/2826 [3:35:18<1:01:15, 5.77s/it]
77%|ββββββββ | 2190/2826 [3:35:25<1:04:03, 6.04s/it]
{'loss': 0.1685, 'grad_norm': 1.4048022031784058, 'learning_rate': 7.349650021557839e-07, 'epoch': 2.32} |
|
77%|ββββββββ | 2190/2826 [3:35:25<1:04:03, 6.04s/it]
78%|ββββββββ | 2191/2826 [3:35:30<1:01:02, 5.77s/it]
78%|ββββββββ | 2192/2826 [3:35:36<1:01:22, 5.81s/it]
78%|ββββββββ | 2193/2826 [3:35:41<1:00:40, 5.75s/it]
78%|ββββββββ | 2194/2826 [3:35:47<1:00:55, 5.78s/it]
78%|ββββββββ | 2195/2826 [3:35:52<59:07, 5.62s/it]
78%|ββββββββ | 2196/2826 [3:35:58<57:46, 5.50s/it]
78%|ββββββββ | 2197/2826 [3:36:03<56:47, 5.42s/it]
78%|ββββββββ | 2198/2826 [3:36:08<56:36, 5.41s/it]
78%|ββββββββ | 2199/2826 [3:36:13<55:36, 5.32s/it]
78%|ββββββββ | 2200/2826 [3:36:19<54:53, 5.26s/it]
{'loss': 0.1519, 'grad_norm': 2.568500280380249, 'learning_rate': 7.132277595215773e-07, 'epoch': 2.33} |
|
78%|ββββββββ | 2200/2826 [3:36:19<54:53, 5.26s/it]
78%|ββββββββ | 2201/2826 [3:36:25<59:00, 5.66s/it]
78%|ββββββββ | 2202/2826 [3:36:31<1:00:43, 5.84s/it]
78%|ββββββββ | 2203/2826 [3:36:37<1:00:25, 5.82s/it]
78%|ββββββββ | 2204/2826 [3:36:44<1:02:22, 6.02s/it]
78%|ββββββββ | 2205/2826 [3:36:49<59:23, 5.74s/it]
78%|ββββββββ | 2206/2826 [3:36:54<59:15, 5.74s/it]
78%|ββββββββ | 2207/2826 [3:37:00<57:33, 5.58s/it]
78%|ββββββββ | 2208/2826 [3:37:06<1:00:19, 5.86s/it]
78%|ββββββββ | 2209/2826 [3:37:12<1:00:50, 5.92s/it]
78%|ββββββββ | 2210/2826 [3:37:18<1:01:06, 5.95s/it]
{'loss': 0.1573, 'grad_norm': 2.205993413925171, 'learning_rate': 6.917632078205805e-07, 'epoch': 2.34} |
|
78%|ββββββββ | 2210/2826 [3:37:18<1:01:06, 5.95s/it]
78%|ββββββββ | 2211/2826 [3:37:24<1:01:57, 6.04s/it]
78%|ββββββββ | 2212/2826 [3:37:30<1:01:16, 5.99s/it]
78%|ββββββββ | 2213/2826 [3:37:37<1:02:16, 6.10s/it]
78%|ββββββββ | 2214/2826 [3:37:42<1:00:56, 5.98s/it]
78%|ββββββββ | 2215/2826 [3:37:47<58:08, 5.71s/it]
78%|ββββββββ | 2216/2826 [3:37:53<56:31, 5.56s/it]
78%|ββββββββ | 2217/2826 [3:37:59<57:27, 5.66s/it]
78%|ββββββββ | 2218/2826 [3:38:04<55:42, 5.50s/it]
79%|ββββββββ | 2219/2826 [3:38:09<54:57, 5.43s/it]
79%|ββββββββ | 2220/2826 [3:38:14<54:23, 5.39s/it]
{'loss': 0.184, 'grad_norm': 2.067505121231079, 'learning_rate': 6.705746228976387e-07, 'epoch': 2.35} |
|
79%|ββββββββ | 2220/2826 [3:38:14<54:23, 5.39s/it]
79%|ββββββββ | 2221/2826 [3:38:21<56:59, 5.65s/it]
79%|ββββββββ | 2222/2826 [3:38:26<55:25, 5.51s/it]
79%|ββββββββ | 2223/2826 [3:38:31<55:25, 5.51s/it]
79%|ββββββββ | 2224/2826 [3:38:37<55:44, 5.55s/it]
79%|ββββββββ | 2225/2826 [3:38:42<55:28, 5.54s/it]
79%|ββββββββ | 2226/2826 [3:38:49<58:35, 5.86s/it]
79%|ββββββββ | 2227/2826 [3:38:54<56:41, 5.68s/it]
79%|ββββββββ | 2228/2826 [3:39:00<56:33, 5.67s/it]
79%|ββββββββ | 2229/2826 [3:39:05<56:04, 5.64s/it]
79%|ββββββββ | 2230/2826 [3:39:11<54:31, 5.49s/it]
{'loss': 0.1968, 'grad_norm': 2.4360201358795166, 'learning_rate': 6.496652384805125e-07, 'epoch': 2.36} |
|
79%|ββββββββ | 2230/2826 [3:39:11<54:31, 5.49s/it]
79%|ββββββββ | 2231/2826 [3:39:16<53:53, 5.44s/it]
79%|ββββββββ | 2232/2826 [3:39:22<54:48, 5.54s/it]
79%|ββββββββ | 2233/2826 [3:39:28<57:11, 5.79s/it]
79%|ββββββββ | 2234/2826 [3:39:34<58:45, 5.95s/it]
79%|ββββββββ | 2235/2826 [3:39:41<59:48, 6.07s/it]
79%|ββββββββ | 2236/2826 [3:39:47<59:45, 6.08s/it]
79%|ββββββββ | 2237/2826 [3:39:53<1:00:38, 6.18s/it]
79%|ββββββββ | 2238/2826 [3:39:59<1:00:35, 6.18s/it]
79%|ββββββββ | 2239/2826 [3:40:05<58:35, 5.99s/it]
79%|ββββββββ | 2240/2826 [3:40:12<1:01:36, 6.31s/it]
{'loss': 0.1846, 'grad_norm': 2.042179584503174, 'learning_rate': 6.290382456863584e-07, 'epoch': 2.38} |
|
79%|ββββββββ | 2240/2826 [3:40:12<1:01:36, 6.31s/it]
79%|ββββββββ | 2241/2826 [3:40:17<59:00, 6.05s/it]
79%|ββββββββ | 2242/2826 [3:40:24<58:59, 6.06s/it]
79%|ββββββββ | 2243/2826 [3:40:31<1:03:39, 6.55s/it]
79%|ββββββββ | 2244/2826 [3:40:38<1:03:01, 6.50s/it]
79%|ββββββββ | 2245/2826 [3:40:43<59:06, 6.10s/it]
79%|ββββββββ | 2246/2826 [3:40:50<1:02:49, 6.50s/it]
80%|ββββββββ | 2247/2826 [3:40:55<58:38, 6.08s/it]
80%|ββββββββ | 2248/2826 [3:41:02<1:00:33, 6.29s/it]
80%|ββββββββ | 2249/2826 [3:41:08<58:50, 6.12s/it]
80%|ββββββββ | 2250/2826 [3:41:13<55:52, 5.82s/it]
{'loss': 0.1858, 'grad_norm': 2.849271535873413, 'learning_rate': 6.086967925347075e-07, 'epoch': 2.39} |
|
80%|ββββββββ | 2250/2826 [3:41:13<55:52, 5.82s/it]
80%|ββββββββ | 2251/2826 [3:41:19<56:26, 5.89s/it]
80%|ββββββββ | 2252/2826 [3:41:24<55:01, 5.75s/it]
80%|ββββββββ | 2253/2826 [3:41:30<53:34, 5.61s/it]
80%|ββββββββ | 2254/2826 [3:41:36<56:02, 5.88s/it]
80%|ββββββββ | 2255/2826 [3:41:42<56:14, 5.91s/it]
80%|ββββββββ | 2256/2826 [3:41:49<57:35, 6.06s/it]
80%|ββββββββ | 2257/2826 [3:41:54<55:16, 5.83s/it]
80%|ββββββββ | 2258/2826 [3:42:01<58:27, 6.17s/it]
80%|ββββββββ | 2259/2826 [3:42:06<55:15, 5.85s/it]
80%|ββββββββ | 2260/2826 [3:42:12<55:51, 5.92s/it]
{'loss': 0.1837, 'grad_norm': 2.0765082836151123, 'learning_rate': 5.88643983467033e-07, 'epoch': 2.4} |
|
80%|ββββββββ | 2260/2826 [3:42:12<55:51, 5.92s/it]
80%|ββββββββ | 2261/2826 [3:42:18<55:54, 5.94s/it]
80%|ββββββββ | 2262/2826 [3:42:23<53:39, 5.71s/it]
80%|ββββββββ | 2263/2826 [3:42:30<56:01, 5.97s/it]
80%|ββββββββ | 2264/2826 [3:42:35<53:51, 5.75s/it]
80%|ββββββββ | 2265/2826 [3:42:41<53:00, 5.67s/it]
80%|ββββββββ | 2266/2826 [3:42:47<54:28, 5.84s/it]
80%|ββββββββ | 2267/2826 [3:42:53<55:09, 5.92s/it]
80%|ββββββββ | 2268/2826 [3:42:58<54:00, 5.81s/it]
80%|ββββββββ | 2269/2826 [3:43:04<53:00, 5.71s/it]
80%|ββββββββ | 2270/2826 [3:43:09<51:56, 5.60s/it]
{'loss': 0.1659, 'grad_norm': 1.9958840608596802, 'learning_rate': 5.688828788729547e-07, 'epoch': 2.41} |
|
80%|ββββββββ | 2270/2826 [3:43:09<51:56, 5.60s/it]
80%|ββββββββ | 2271/2826 [3:43:16<54:23, 5.88s/it]
80%|ββββββββ | 2272/2826 [3:43:21<53:29, 5.79s/it]
80%|ββββββββ | 2273/2826 [3:43:27<51:39, 5.61s/it]
80%|ββββββββ | 2274/2826 [3:43:32<49:48, 5.41s/it]
81%|ββββββββ | 2275/2826 [3:43:37<48:50, 5.32s/it]
81%|ββββββββ | 2276/2826 [3:43:42<48:05, 5.25s/it]
81%|ββββββββ | 2277/2826 [3:43:47<48:08, 5.26s/it]
81%|ββββββββ | 2278/2826 [3:43:53<50:46, 5.56s/it]
81%|ββββββββ | 2279/2826 [3:43:59<50:21, 5.52s/it]
81%|ββββββββ | 2280/2826 [3:44:04<50:05, 5.51s/it]
{'loss': 0.2095, 'grad_norm': 2.253602981567383, 'learning_rate': 5.494164946231747e-07, 'epoch': 2.42} |
|
81%|ββββββββ | 2280/2826 [3:44:04<50:05, 5.51s/it]
81%|ββββββββ | 2281/2826 [3:44:09<49:02, 5.40s/it]
81%|ββββββββ | 2282/2826 [3:44:15<48:52, 5.39s/it]
81%|ββββββββ | 2283/2826 [3:44:20<49:40, 5.49s/it]
81%|ββββββββ | 2284/2826 [3:44:26<48:51, 5.41s/it]
81%|ββββββββ | 2285/2826 [3:44:32<50:44, 5.63s/it]
81%|ββββββββ | 2286/2826 [3:44:37<49:33, 5.51s/it]
81%|ββββββββ | 2287/2826 [3:44:45<56:11, 6.26s/it]
81%|ββββββββ | 2288/2826 [3:44:52<57:50, 6.45s/it]
81%|ββββββββ | 2289/2826 [3:44:57<54:16, 6.07s/it]
81%|ββββββββ | 2290/2826 [3:45:03<53:50, 6.03s/it]
{'loss': 0.1862, 'grad_norm': 1.5552992820739746, 'learning_rate': 5.302478016092075e-07, 'epoch': 2.43} |
|
81%|ββββββββ | 2290/2826 [3:45:03<53:50, 6.03s/it]
81%|ββββββββ | 2291/2826 [3:45:08<51:04, 5.73s/it]
81%|ββββββββ | 2292/2826 [3:45:13<49:29, 5.56s/it]
81%|ββββββββ | 2293/2826 [3:45:18<48:33, 5.47s/it]
81%|ββββββββ | 2294/2826 [3:45:24<47:41, 5.38s/it]
81%|ββββββββ | 2295/2826 [3:45:29<47:22, 5.35s/it]
81%|ββββββββ | 2296/2826 [3:45:35<49:27, 5.60s/it]
81%|βββββββββ | 2297/2826 [3:45:42<51:51, 5.88s/it]
81%|βββββββββ | 2298/2826 [3:45:47<50:21, 5.72s/it]
81%|βββββββββ | 2299/2826 [3:45:56<57:54, 6.59s/it]
81%|βββββββββ | 2300/2826 [3:46:01<53:44, 6.13s/it]
{'loss': 0.2085, 'grad_norm': 2.721445322036743, 'learning_rate': 5.113797252899728e-07, 'epoch': 2.44} |
|
81%|βββββββββ | 2300/2826 [3:46:01<53:44, 6.13s/it]
81%|βββββββββ | 2301/2826 [3:46:06<50:48, 5.81s/it]
81%|βββββββββ | 2302/2826 [3:46:12<52:45, 6.04s/it]
81%|βββββββββ | 2303/2826 [3:46:17<50:26, 5.79s/it]
82%|βββββββββ | 2304/2826 [3:46:23<48:47, 5.61s/it]
82%|βββββββββ | 2305/2826 [3:46:28<48:03, 5.53s/it]
82%|βββββββββ | 2306/2826 [3:46:33<46:58, 5.42s/it]
82%|βββββββββ | 2307/2826 [3:46:41<53:02, 6.13s/it]
82%|βββββββββ | 2308/2826 [3:46:46<50:44, 5.88s/it]
82%|βββββββββ | 2309/2826 [3:46:51<48:41, 5.65s/it]
82%|βββββββββ | 2310/2826 [3:46:58<51:06, 5.94s/it]
{'loss': 0.1914, 'grad_norm': 2.3488707542419434, 'learning_rate': 4.928151452453184e-07, 'epoch': 2.45} |
|
82%|βββββββββ | 2310/2826 [3:46:58<51:06, 5.94s/it]
82%|βββββββββ | 2311/2826 [3:47:03<48:58, 5.71s/it]
82%|βββββββββ | 2312/2826 [3:47:09<48:15, 5.63s/it]
82%|βββββββββ | 2313/2826 [3:47:14<46:56, 5.49s/it]
82%|βββββββββ | 2314/2826 [3:47:20<49:18, 5.78s/it]
82%|βββββββββ | 2315/2826 [3:47:26<48:12, 5.66s/it]
82%|βββββββββ | 2316/2826 [3:47:31<47:14, 5.56s/it]
82%|βββββββββ | 2317/2826 [3:47:38<49:55, 5.89s/it]
82%|βββββββββ | 2318/2826 [3:47:45<53:20, 6.30s/it]
82%|βββββββββ | 2319/2826 [3:47:52<56:04, 6.64s/it]
82%|βββββββββ | 2320/2826 [3:47:59<55:20, 6.56s/it]
{'loss': 0.1718, 'grad_norm': 2.49068021774292, 'learning_rate': 4.745568947365542e-07, 'epoch': 2.46} |
|
82%|βββββββββ | 2320/2826 [3:47:59<55:20, 6.56s/it]
82%|βββββββββ | 2321/2826 [3:48:04<51:33, 6.13s/it]
82%|βββββββββ | 2322/2826 [3:48:09<49:00, 5.83s/it]
82%|βββββββββ | 2323/2826 [3:48:15<50:23, 6.01s/it]
82%|βββββββββ | 2324/2826 [3:48:21<49:10, 5.88s/it]
82%|βββββββββ | 2325/2826 [3:48:26<48:18, 5.79s/it]
82%|βββββββββ | 2326/2826 [3:48:33<49:17, 5.91s/it]
82%|βββββββββ | 2327/2826 [3:48:40<52:32, 6.32s/it]
82%|βββββββββ | 2328/2826 [3:48:46<50:49, 6.12s/it]
82%|βββββββββ | 2329/2826 [3:48:52<50:21, 6.08s/it]
82%|βββββββββ | 2330/2826 [3:48:58<50:41, 6.13s/it]
{'loss': 0.1669, 'grad_norm': 1.4638549089431763, 'learning_rate': 4.5660776027404654e-07, 'epoch': 2.47} |
|
82%|βββββββββ | 2330/2826 [3:48:58<50:41, 6.13s/it]
82%|βββββββββ | 2331/2826 [3:49:04<51:05, 6.19s/it]
83%|βββββββββ | 2332/2826 [3:49:11<52:12, 6.34s/it]
83%|βββββββββ | 2333/2826 [3:49:19<55:36, 6.77s/it]
83%|βββββββββ | 2334/2826 [3:49:24<52:32, 6.41s/it]
83%|βββββββββ | 2335/2826 [3:49:29<49:42, 6.07s/it]
83%|βββββββββ | 2336/2826 [3:49:36<51:12, 6.27s/it]
83%|βββββββββ | 2337/2826 [3:49:43<51:27, 6.31s/it]
83%|βββββββββ | 2338/2826 [3:49:48<48:33, 5.97s/it]
83%|βββββββββ | 2339/2826 [3:49:53<47:47, 5.89s/it]
83%|βββββββββ | 2340/2826 [3:50:00<48:21, 5.97s/it]
{'loss': 0.1731, 'grad_norm': 2.288776159286499, 'learning_rate': 4.389704811919507e-07, 'epoch': 2.48} |
|
83%|βββββββββ | 2340/2826 [3:50:00<48:21, 5.97s/it]
83%|βββββββββ | 2341/2826 [3:50:05<47:06, 5.83s/it]
83%|βββββββββ | 2342/2826 [3:50:10<45:28, 5.64s/it]
83%|βββββββββ | 2343/2826 [3:50:16<46:05, 5.73s/it]
83%|βββββββββ | 2344/2826 [3:50:23<48:01, 5.98s/it]
83%|βββββββββ | 2345/2826 [3:50:28<46:00, 5.74s/it]
83%|βββββββββ | 2346/2826 [3:50:33<44:32, 5.57s/it]
83%|βββββββββ | 2347/2826 [3:50:39<45:26, 5.69s/it]
83%|βββββββββ | 2348/2826 [3:50:45<44:54, 5.64s/it]
83%|βββββββββ | 2349/2826 [3:50:51<47:32, 5.98s/it]
83%|βββββββββ | 2350/2826 [3:50:57<45:41, 5.76s/it]
{'loss': 0.1802, 'grad_norm': 2.385162115097046, 'learning_rate': 4.216477492301455e-07, 'epoch': 2.49} |
|
83%|βββββββββ | 2350/2826 [3:50:57<45:41, 5.76s/it]
83%|βββββββββ | 2351/2826 [3:51:03<46:17, 5.85s/it]
83%|βββββββββ | 2352/2826 [3:51:08<44:44, 5.66s/it]
83%|βββββββββ | 2353/2826 [3:51:14<46:25, 5.89s/it]
83%|βββββββββ | 2354/2826 [3:51:20<44:29, 5.66s/it]
83%|βββββββββ | 2355/2826 [3:51:25<42:56, 5.47s/it]
83%|βββββββββ | 2356/2826 [3:51:30<41:58, 5.36s/it]
83%|βββββββββ | 2357/2826 [3:51:35<42:07, 5.39s/it]
83%|βββββββββ | 2358/2826 [3:51:42<44:47, 5.74s/it]
83%|βββββββββ | 2359/2826 [3:51:47<43:11, 5.55s/it]
84%|βββββββββ | 2360/2826 [3:51:53<44:19, 5.71s/it]
{'loss': 0.2232, 'grad_norm': 2.0100815296173096, 'learning_rate': 4.0464220812342526e-07, 'epoch': 2.5} |
|
84%|βββββββββ | 2360/2826 [3:51:53<44:19, 5.71s/it]
84%|βββββββββ | 2361/2826 [3:51:58<42:59, 5.55s/it]
84%|βββββββββ | 2362/2826 [3:52:05<45:05, 5.83s/it]
84%|βββββββββ | 2363/2826 [3:52:10<43:23, 5.62s/it]
84%|βββββββββ | 2364/2826 [3:52:16<44:01, 5.72s/it]
84%|βββββββββ | 2365/2826 [3:52:21<42:47, 5.57s/it]
84%|βββββββββ | 2366/2826 [3:52:27<44:40, 5.83s/it]
84%|βββββββββ | 2367/2826 [3:52:34<46:48, 6.12s/it]
84%|βββββββββ | 2368/2826 [3:52:41<47:55, 6.28s/it]
84%|βββββββββ | 2369/2826 [3:52:47<46:53, 6.16s/it]
84%|βββββββββ | 2370/2826 [3:52:52<45:14, 5.95s/it]
{'loss': 0.1432, 'grad_norm': 1.8439091444015503, 'learning_rate': 3.87956453198027e-07, 'epoch': 2.51} |
|
84%|βββββββββ | 2370/2826 [3:52:52<45:14, 5.95s/it]
84%|βββββββββ | 2371/2826 [3:52:57<43:06, 5.68s/it]
84%|βββββββββ | 2372/2826 [3:53:03<43:30, 5.75s/it]
84%|βββββββββ | 2373/2826 [3:53:09<43:59, 5.83s/it]
84%|βββββββββ | 2374/2826 [3:53:16<46:14, 6.14s/it]
84%|βββββββββ | 2375/2826 [3:53:22<45:16, 6.02s/it]
84%|βββββββββ | 2376/2826 [3:53:27<44:22, 5.92s/it]
84%|βββββββββ | 2377/2826 [3:53:33<43:41, 5.84s/it]
84%|βββββββββ | 2378/2826 [3:53:39<43:55, 5.88s/it]
84%|βββββββββ | 2379/2826 [3:53:45<43:29, 5.84s/it]
84%|βββββββββ | 2380/2826 [3:53:51<44:09, 5.94s/it]
{'loss': 0.1834, 'grad_norm': 2.3093338012695312, 'learning_rate': 3.715930309755389e-07, 'epoch': 2.52} |
|
84%|βββββββββ | 2380/2826 [3:53:51<44:09, 5.94s/it]
84%|βββββββββ | 2381/2826 [3:53:57<44:37, 6.02s/it]
84%|βββββββββ | 2382/2826 [3:54:03<43:21, 5.86s/it]
84%|βββββββββ | 2383/2826 [3:54:08<42:30, 5.76s/it]
84%|βββββββββ | 2384/2826 [3:54:14<42:46, 5.81s/it]
84%|βββββββββ | 2385/2826 [3:54:19<41:07, 5.59s/it]
84%|βββββββββ | 2386/2826 [3:54:24<39:50, 5.43s/it]
84%|βββββββββ | 2387/2826 [3:54:30<39:53, 5.45s/it]
85%|βββββββββ | 2388/2826 [3:54:35<40:12, 5.51s/it]
85%|βββββββββ | 2389/2826 [3:54:41<40:39, 5.58s/it]
85%|βββββββββ | 2390/2826 [3:54:47<40:25, 5.56s/it]
{'loss': 0.2123, 'grad_norm': 2.3250088691711426, 'learning_rate': 3.5555443878425635e-07, 'epoch': 2.53} |
|
85%|βββββββββ | 2390/2826 [3:54:47<40:25, 5.56s/it]
85%|βββββββββ | 2391/2826 [3:54:52<39:14, 5.41s/it]
85%|βββββββββ | 2392/2826 [3:54:57<39:20, 5.44s/it]
85%|βββββββββ | 2393/2826 [3:55:03<40:15, 5.58s/it]
85%|βββββββββ | 2394/2826 [3:55:09<40:40, 5.65s/it]
85%|βββββββββ | 2395/2826 [3:55:14<39:20, 5.48s/it]
85%|βββββββββ | 2396/2826 [3:55:19<38:34, 5.38s/it]
85%|βββββββββ | 2397/2826 [3:55:24<38:05, 5.33s/it]
85%|βββββββββ | 2398/2826 [3:55:30<38:19, 5.37s/it]
85%|βββββββββ | 2399/2826 [3:55:35<37:45, 5.31s/it]
85%|βββββββββ | 2400/2826 [3:55:41<38:55, 5.48s/it]
{'loss': 0.2034, 'grad_norm': 1.8003133535385132, 'learning_rate': 3.398431243780531e-07, 'epoch': 2.55} |
|
85%|βββββββββ | 2400/2826 [3:55:41<38:55, 5.48s/it]
85%|βββββββββ | 2401/2826 [3:55:47<40:42, 5.75s/it]
85%|βββββββββ | 2402/2826 [3:55:54<42:21, 5.99s/it]
85%|βββββββββ | 2403/2826 [3:55:59<40:38, 5.76s/it]
85%|βββββββββ | 2404/2826 [3:56:05<40:30, 5.76s/it]
85%|βββββββββ | 2405/2826 [3:56:10<39:25, 5.62s/it]
85%|βββββββββ | 2406/2826 [3:56:15<38:53, 5.55s/it]
85%|βββββββββ | 2407/2826 [3:56:22<40:34, 5.81s/it]
85%|βββββββββ | 2408/2826 [3:56:29<42:39, 6.12s/it]
85%|βββββββββ | 2409/2826 [3:56:35<42:51, 6.17s/it]
85%|βββββββββ | 2410/2826 [3:56:40<40:33, 5.85s/it]
{'loss': 0.1778, 'grad_norm': 2.8948135375976562, 'learning_rate': 3.2446148556281117e-07, 'epoch': 2.56} |
|
85%|βββββββββ | 2410/2826 [3:56:40<40:33, 5.85s/it]
85%|βββββββββ | 2411/2826 [3:56:45<38:52, 5.62s/it]
85%|βββββββββ | 2412/2826 [3:56:51<39:08, 5.67s/it]
85%|βββββββββ | 2413/2826 [3:56:57<39:07, 5.68s/it]
85%|βββββββββ | 2414/2826 [3:57:02<37:38, 5.48s/it]
85%|βββββββββ | 2415/2826 [3:57:07<37:42, 5.50s/it]
85%|βββββββββ | 2416/2826 [3:57:12<37:12, 5.45s/it]
86%|βββββββββ | 2417/2826 [3:57:20<40:18, 5.91s/it]
86%|βββββββββ | 2418/2826 [3:57:25<39:55, 5.87s/it]
86%|βββββββββ | 2419/2826 [3:57:30<38:10, 5.63s/it]
86%|βββββββββ | 2420/2826 [3:57:36<39:02, 5.77s/it]
{'loss': 0.1892, 'grad_norm': 1.8556360006332397, 'learning_rate': 3.0941186983047543e-07, 'epoch': 2.57} |
|
86%|βββββββββ | 2420/2826 [3:57:36<39:02, 5.77s/it]
86%|βββββββββ | 2421/2826 [3:57:43<40:13, 5.96s/it]
86%|βββββββββ | 2422/2826 [3:57:48<38:29, 5.72s/it]
86%|βββββββββ | 2423/2826 [3:57:54<39:12, 5.84s/it]
86%|βββββββββ | 2424/2826 [3:57:59<37:43, 5.63s/it]
86%|βββββββββ | 2425/2826 [3:58:04<36:16, 5.43s/it]
86%|βββββββββ | 2426/2826 [3:58:10<37:36, 5.64s/it]
86%|βββββββββ | 2427/2826 [3:58:17<39:01, 5.87s/it]
86%|βββββββββ | 2428/2826 [3:58:22<37:58, 5.72s/it]
86%|βββββββββ | 2429/2826 [3:58:28<37:57, 5.74s/it]
86%|βββββββββ | 2430/2826 [3:58:33<36:23, 5.51s/it]
{'loss': 0.1935, 'grad_norm': 2.771932363510132, 'learning_rate': 2.9469657400078925e-07, 'epoch': 2.58} |
|
86%|βββββββββ | 2430/2826 [3:58:33<36:23, 5.51s/it]
86%|βββββββββ | 2431/2826 [3:58:38<36:04, 5.48s/it]
86%|βββββββββ | 2432/2826 [3:58:45<37:46, 5.75s/it]
86%|βββββββββ | 2433/2826 [3:58:51<38:08, 5.82s/it]
86%|βββββββββ | 2434/2826 [3:58:56<36:33, 5.60s/it]
86%|βββββββββ | 2435/2826 [3:59:01<36:02, 5.53s/it]
86%|βββββββββ | 2436/2826 [3:59:06<35:16, 5.43s/it]
86%|βββββββββ | 2437/2826 [3:59:12<35:44, 5.51s/it]
86%|βββββββββ | 2438/2826 [3:59:19<38:57, 6.02s/it]
86%|βββββββββ | 2439/2826 [3:59:24<37:04, 5.75s/it]
86%|βββββββββ | 2440/2826 [3:59:29<35:36, 5.54s/it]
{'loss': 0.1858, 'grad_norm': 2.5325114727020264, 'learning_rate': 2.8031784387076186e-07, 'epoch': 2.59} |
|
86%|βββββββββ | 2440/2826 [3:59:29<35:36, 5.54s/it]
86%|βββββββββ | 2441/2826 [3:59:34<34:37, 5.40s/it]
86%|βββββββββ | 2442/2826 [3:59:40<34:14, 5.35s/it]
86%|βββββββββ | 2443/2826 [3:59:46<35:04, 5.49s/it]
86%|βββββββββ | 2444/2826 [3:59:51<34:15, 5.38s/it]
87%|βββββββββ | 2445/2826 [3:59:56<34:37, 5.45s/it]
87%|βββββββββ | 2446/2826 [4:00:02<34:34, 5.46s/it]
87%|βββββββββ | 2447/2826 [4:00:08<36:16, 5.74s/it]
87%|βββββββββ | 2448/2826 [4:00:13<35:17, 5.60s/it]
87%|βββββββββ | 2449/2826 [4:00:20<37:44, 6.01s/it]
87%|βββββββββ | 2450/2826 [4:00:25<35:57, 5.74s/it]
{'loss': 0.2118, 'grad_norm': 2.4069302082061768, 'learning_rate': 2.6627787387191934e-07, 'epoch': 2.6} |
|
87%|βββββββββ | 2450/2826 [4:00:25<35:57, 5.74s/it]
87%|βββββββββ | 2451/2826 [4:00:31<34:38, 5.54s/it]
87%|βββββββββ | 2452/2826 [4:00:36<34:11, 5.49s/it]
87%|βββββββββ | 2453/2826 [4:00:41<34:12, 5.50s/it]
87%|βββββββββ | 2454/2826 [4:00:48<36:09, 5.83s/it]
87%|βββββββββ | 2455/2826 [4:00:53<34:42, 5.61s/it]
87%|βββββββββ | 2456/2826 [4:00:58<33:52, 5.49s/it]
87%|βββββββββ | 2457/2826 [4:01:04<33:06, 5.38s/it]
87%|βββββββββ | 2458/2826 [4:01:10<34:17, 5.59s/it]
87%|βββββββββ | 2459/2826 [4:01:15<33:33, 5.49s/it]
87%|βββββββββ | 2460/2826 [4:01:21<34:37, 5.68s/it]
{'loss': 0.1929, 'grad_norm': 2.053656816482544, 'learning_rate': 2.5257880673540376e-07, 'epoch': 2.61} |
|
87%|βββββββββ | 2460/2826 [4:01:21<34:37, 5.68s/it]
87%|βββββββββ | 2461/2826 [4:01:27<35:43, 5.87s/it]
87%|βββββββββ | 2462/2826 [4:01:32<34:20, 5.66s/it]
87%|βββββββββ | 2463/2826 [4:01:38<33:20, 5.51s/it]
87%|βββββββββ | 2464/2826 [4:01:43<33:13, 5.51s/it]
87%|βββββββββ | 2465/2826 [4:01:48<32:47, 5.45s/it]
87%|βββββββββ | 2466/2826 [4:01:55<34:01, 5.67s/it]
87%|βββββββββ | 2467/2826 [4:02:01<35:15, 5.89s/it]
87%|βββββββββ | 2468/2826 [4:02:06<33:49, 5.67s/it]
87%|βββββββββ | 2469/2826 [4:02:12<33:38, 5.66s/it]
87%|βββββββββ | 2470/2826 [4:02:18<34:17, 5.78s/it]
{'loss': 0.1745, 'grad_norm': 1.8820626735687256, 'learning_rate': 2.392227331649527e-07, 'epoch': 2.62} |
|
87%|βββββββββ | 2470/2826 [4:02:18<34:17, 5.78s/it]
87%|βββββββββ | 2471/2826 [4:02:25<36:53, 6.23s/it]
87%|βββββββββ | 2472/2826 [4:02:31<36:49, 6.24s/it]
88%|βββββββββ | 2473/2826 [4:02:37<34:55, 5.94s/it]
88%|βββββββββ | 2474/2826 [4:02:42<33:14, 5.67s/it]
88%|βββββββββ | 2475/2826 [4:02:47<33:20, 5.70s/it]
88%|βββββββββ | 2476/2826 [4:02:52<32:06, 5.50s/it]
88%|βββββββββ | 2477/2826 [4:02:58<32:24, 5.57s/it]
88%|βββββββββ | 2478/2826 [4:03:04<32:45, 5.65s/it]
88%|βββββββββ | 2479/2826 [4:03:09<32:06, 5.55s/it]
88%|βββββββββ | 2480/2826 [4:03:15<32:24, 5.62s/it]
{'loss': 0.1823, 'grad_norm': 1.9418586492538452, 'learning_rate': 2.2621169151782417e-07, 'epoch': 2.63} |
|
88%|βββββββββ | 2480/2826 [4:03:15<32:24, 5.62s/it]
88%|βββββββββ | 2481/2826 [4:03:21<32:21, 5.63s/it]
88%|βββββββββ | 2482/2826 [4:03:27<32:35, 5.69s/it]
88%|βββββββββ | 2483/2826 [4:03:32<31:38, 5.53s/it]
88%|βββββββββ | 2484/2826 [4:03:37<30:42, 5.39s/it]
88%|βββββββββ | 2485/2826 [4:03:42<30:08, 5.30s/it]
88%|βββββββββ | 2486/2826 [4:03:48<31:41, 5.59s/it]
88%|βββββββββ | 2487/2826 [4:03:54<31:32, 5.58s/it]
88%|βββββββββ | 2488/2826 [4:04:00<31:43, 5.63s/it]
88%|βββββββββ | 2489/2826 [4:04:05<31:34, 5.62s/it]
88%|βββββββββ | 2490/2826 [4:04:11<31:21, 5.60s/it]
{'loss': 0.2037, 'grad_norm': 2.519037961959839, 'learning_rate': 2.1354766749371093e-07, 'epoch': 2.64} |
|
88%|βββββββββ | 2490/2826 [4:04:11<31:21, 5.60s/it]
88%|βββββββββ | 2491/2826 [4:04:17<31:48, 5.70s/it]
88%|βββββββββ | 2492/2826 [4:04:22<30:45, 5.52s/it]
88%|βββββββββ | 2493/2826 [4:04:27<29:59, 5.40s/it]
88%|βββββββββ | 2494/2826 [4:04:33<31:57, 5.78s/it]
88%|βββββββββ | 2495/2826 [4:04:39<30:45, 5.58s/it]
88%|βββββββββ | 2496/2826 [4:04:45<32:42, 5.95s/it]
88%|βββββββββ | 2497/2826 [4:04:52<33:57, 6.19s/it]
88%|βββββββββ | 2498/2826 [4:04:58<32:49, 6.00s/it]
88%|βββββββββ | 2499/2826 [4:05:03<31:25, 5.77s/it]
88%|βββββββββ | 2500/2826 [4:05:10<34:02, 6.27s/it]
{'loss': 0.2196, 'grad_norm': 2.010211944580078, 'learning_rate': 2.0123259383169031e-07, 'epoch': 2.65} |
|
88%|βββββββββ | 2500/2826 [4:05:10<34:02, 6.27s/it]
88%|βββββββββ | 2501/2826 [4:05:17<33:53, 6.26s/it]
89%|βββββββββ | 2502/2826 [4:05:22<31:51, 5.90s/it]
89%|βββββββββ | 2503/2826 [4:05:27<30:32, 5.67s/it]
89%|βββββββββ | 2504/2826 [4:05:32<30:05, 5.61s/it]
89%|βββββββββ | 2505/2826 [4:05:38<29:39, 5.55s/it]
89%|βββββββββ | 2506/2826 [4:05:43<29:51, 5.60s/it]
89%|βββββββββ | 2507/2826 [4:05:50<30:35, 5.76s/it]
89%|βββββββββ | 2508/2826 [4:05:55<29:29, 5.56s/it]
89%|βββββββββ | 2509/2826 [4:06:00<29:27, 5.58s/it]
89%|βββββββββ | 2510/2826 [4:06:05<28:49, 5.47s/it]
{'loss': 0.1848, 'grad_norm': 1.9838532209396362, 'learning_rate': 1.8926835001525257e-07, 'epoch': 2.66} |
|
89%|βββββββββ | 2510/2826 [4:06:05<28:49, 5.47s/it]
89%|βββββββββ | 2511/2826 [4:06:11<29:01, 5.53s/it]
89%|βββββββββ | 2512/2826 [4:06:16<28:19, 5.41s/it]
89%|βββββββββ | 2513/2826 [4:06:23<30:10, 5.78s/it]
89%|βββββββββ | 2514/2826 [4:06:28<29:03, 5.59s/it]
89%|βββββββββ | 2515/2826 [4:06:33<28:38, 5.53s/it]
89%|βββββββββ | 2516/2826 [4:06:39<27:51, 5.39s/it]
89%|βββββββββ | 2517/2826 [4:06:44<27:15, 5.29s/it]
89%|βββββββββ | 2518/2826 [4:06:49<26:48, 5.22s/it]
89%|βββββββββ | 2519/2826 [4:06:55<28:33, 5.58s/it]
89%|βββββββββ | 2520/2826 [4:07:00<27:43, 5.44s/it]
{'loss': 0.1823, 'grad_norm': 2.3488149642944336, 'learning_rate': 1.776567619854655e-07, 'epoch': 2.67} |
|
89%|βββββββββ | 2520/2826 [4:07:00<27:43, 5.44s/it]
89%|βββββββββ | 2521/2826 [4:07:05<27:05, 5.33s/it]
89%|βββββββββ | 2522/2826 [4:07:11<28:08, 5.55s/it]
89%|βββββββββ | 2523/2826 [4:07:16<27:21, 5.42s/it]
89%|βββββββββ | 2524/2826 [4:07:22<26:52, 5.34s/it]
89%|βββββββββ | 2525/2826 [4:07:29<29:32, 5.89s/it]
89%|βββββββββ | 2526/2826 [4:07:35<29:17, 5.86s/it]
89%|βββββββββ | 2527/2826 [4:07:41<30:01, 6.02s/it]
89%|βββββββββ | 2528/2826 [4:07:48<31:50, 6.41s/it]
89%|βββββββββ | 2529/2826 [4:07:54<30:25, 6.15s/it]
90%|βββββββββ | 2530/2826 [4:08:00<30:51, 6.26s/it]
{'loss': 0.2039, 'grad_norm': 2.839651584625244, 'learning_rate': 1.6639960186230293e-07, 'epoch': 2.68} |
|
90%|βββββββββ | 2530/2826 [4:08:00<30:51, 6.26s/it]
90%|βββββββββ | 2531/2826 [4:08:05<29:05, 5.92s/it]
90%|βββββββββ | 2532/2826 [4:08:12<29:48, 6.08s/it]
90%|βββββββββ | 2533/2826 [4:08:19<31:31, 6.45s/it]
90%|βββββββββ | 2534/2826 [4:08:24<29:25, 6.05s/it]
90%|βββββββββ | 2535/2826 [4:08:29<27:43, 5.72s/it]
90%|βββββββββ | 2536/2826 [4:08:34<26:45, 5.54s/it]
90%|βββββββββ | 2537/2826 [4:08:40<26:13, 5.44s/it]
90%|βββββββββ | 2538/2826 [4:08:45<25:45, 5.37s/it]
90%|βββββββββ | 2539/2826 [4:08:51<26:14, 5.49s/it]
90%|βββββββββ | 2540/2826 [4:08:57<26:51, 5.63s/it]
{'loss': 0.1796, 'grad_norm': 2.050480842590332, 'learning_rate': 1.5549858767419018e-07, 'epoch': 2.69} |
|
90%|βββββββββ | 2540/2826 [4:08:57<26:51, 5.63s/it]
90%|βββββββββ | 2541/2826 [4:09:03<28:40, 6.04s/it]
90%|βββββββββ | 2542/2826 [4:09:09<27:17, 5.77s/it]
90%|βββββββββ | 2543/2826 [4:09:15<28:02, 5.95s/it]
90%|βββββββββ | 2544/2826 [4:09:22<29:02, 6.18s/it]
90%|βββββββββ | 2545/2826 [4:09:28<28:26, 6.07s/it]
90%|βββββββββ | 2546/2826 [4:09:33<26:58, 5.78s/it]
90%|βββββββββ | 2547/2826 [4:09:39<27:19, 5.88s/it]
90%|βββββββββ | 2548/2826 [4:09:44<26:22, 5.69s/it]
90%|βββββββββ | 2549/2826 [4:09:51<27:42, 6.00s/it]
90%|βββββββββ | 2550/2826 [4:09:57<27:41, 6.02s/it]
{'loss': 0.1893, 'grad_norm': 1.2738044261932373, 'learning_rate': 1.449553830958053e-07, 'epoch': 2.7} |
|
90%|βββββββββ | 2550/2826 [4:09:57<27:41, 6.02s/it]
90%|βββββββββ | 2551/2826 [4:10:02<26:16, 5.73s/it]
90%|βββββββββ | 2552/2826 [4:10:07<25:27, 5.58s/it]
90%|βββββββββ | 2553/2826 [4:10:14<26:58, 5.93s/it]
90%|βββββββββ | 2554/2826 [4:10:20<26:41, 5.89s/it]
90%|βββββββββ | 2555/2826 [4:10:26<26:46, 5.93s/it]
90%|βββββββββ | 2556/2826 [4:10:33<28:20, 6.30s/it]
90%|βββββββββ | 2557/2826 [4:10:38<27:09, 6.06s/it]
91%|βββββββββ | 2558/2826 [4:10:44<27:07, 6.07s/it]
91%|βββββββββ | 2559/2826 [4:10:50<26:45, 6.01s/it]
91%|βββββββββ | 2560/2826 [4:10:56<26:51, 6.06s/it]
{'loss': 0.1947, 'grad_norm': 1.8912787437438965, 'learning_rate': 1.347715971941746e-07, 'epoch': 2.72} |
|
91%|βββββββββ | 2560/2826 [4:10:56<26:51, 6.06s/it]
91%|βββββββββ | 2561/2826 [4:11:02<25:46, 5.83s/it]
91%|βββββββββ | 2562/2826 [4:11:08<26:50, 6.10s/it]
91%|βββββββββ | 2563/2826 [4:11:14<25:27, 5.81s/it]
91%|βββββββββ | 2564/2826 [4:11:19<25:23, 5.82s/it]
91%|βββββββββ | 2565/2826 [4:11:25<24:52, 5.72s/it]
91%|βββββββββ | 2566/2826 [4:11:30<23:57, 5.53s/it]
91%|βββββββββ | 2567/2826 [4:11:36<24:41, 5.72s/it]
91%|βββββββββ | 2568/2826 [4:11:43<26:07, 6.07s/it]
91%|βββββββββ | 2569/2826 [4:11:49<26:27, 6.18s/it]
91%|βββββββββ | 2570/2826 [4:11:56<26:21, 6.18s/it]
{'loss': 0.1744, 'grad_norm': 1.8385730981826782, 'learning_rate': 1.2494878418310234e-07, 'epoch': 2.73} |
|
91%|βββββββββ | 2570/2826 [4:11:56<26:21, 6.18s/it]
91%|βββββββββ | 2571/2826 [4:12:01<24:59, 5.88s/it]
91%|βββββββββ | 2572/2826 [4:12:06<24:31, 5.79s/it]
91%|βββββββββ | 2573/2826 [4:12:13<24:48, 5.89s/it]
91%|βββββββββ | 2574/2826 [4:12:18<24:31, 5.84s/it]
91%|βββββββββ | 2575/2826 [4:12:24<23:53, 5.71s/it]
91%|βββββββββ | 2576/2826 [4:12:30<24:08, 5.79s/it]
91%|βββββββββ | 2577/2826 [4:12:36<24:20, 5.86s/it]
91%|βββββββββ | 2578/2826 [4:12:42<24:27, 5.92s/it]
91%|ββββββββββ| 2579/2826 [4:12:48<24:21, 5.92s/it]
91%|ββββββββββ| 2580/2826 [4:12:55<25:23, 6.19s/it]
{'loss': 0.2351, 'grad_norm': 2.1071712970733643, 'learning_rate': 1.1548844318597208e-07, 'epoch': 2.74} |
|
91%|ββββββββββ| 2580/2826 [4:12:55<25:23, 6.19s/it]
91%|ββββββββββ| 2581/2826 [4:13:01<25:12, 6.17s/it]
91%|ββββββββββ| 2582/2826 [4:13:06<23:43, 5.83s/it]
91%|ββββββββββ| 2583/2826 [4:13:12<23:43, 5.86s/it]
91%|ββββββββββ| 2584/2826 [4:13:17<22:43, 5.64s/it]
91%|ββββββββββ| 2585/2826 [4:13:22<22:01, 5.48s/it]
92%|ββββββββββ| 2586/2826 [4:13:29<23:29, 5.87s/it]
92%|ββββββββββ| 2587/2826 [4:13:34<22:38, 5.68s/it]
92%|ββββββββββ| 2588/2826 [4:13:39<21:40, 5.46s/it]
92%|ββββββββββ| 2589/2826 [4:13:46<23:18, 5.90s/it]
92%|ββββββββββ| 2590/2826 [4:13:51<22:32, 5.73s/it]
{'loss': 0.2245, 'grad_norm': 2.054392099380493, 'learning_rate': 1.0639201800695553e-07, 'epoch': 2.75} |
|
92%|ββββββββββ| 2590/2826 [4:13:51<22:32, 5.73s/it]
92%|ββββββββββ| 2591/2826 [4:13:56<22:01, 5.62s/it]
92%|ββββββββββ| 2592/2826 [4:14:03<23:37, 6.06s/it]
92%|ββββββββββ| 2593/2826 [4:14:10<23:44, 6.11s/it]
92%|ββββββββββ| 2594/2826 [4:14:15<22:38, 5.86s/it]
92%|ββββββββββ| 2595/2826 [4:14:20<21:35, 5.61s/it]
92%|ββββββββββ| 2596/2826 [4:14:26<21:55, 5.72s/it]
92%|ββββββββββ| 2597/2826 [4:14:31<21:10, 5.55s/it]
92%|ββββββββββ| 2598/2826 [4:14:37<21:32, 5.67s/it]
92%|ββββββββββ| 2599/2826 [4:14:43<21:36, 5.71s/it]
92%|ββββββββββ| 2600/2826 [4:14:49<21:48, 5.79s/it]
{'loss': 0.2014, 'grad_norm': 1.656562328338623, 'learning_rate': 9.76608969106646e-08, 'epoch': 2.76} |
|
92%|ββββββββββ| 2600/2826 [4:14:49<21:48, 5.79s/it]
92%|ββββββββββ| 2601/2826 [4:14:55<22:14, 5.93s/it]
92%|ββββββββββ| 2602/2826 [4:15:01<21:51, 5.85s/it]
92%|ββββββββββ| 2603/2826 [4:15:06<21:26, 5.77s/it]
92%|ββββββββββ| 2604/2826 [4:15:14<22:49, 6.17s/it]
92%|ββββββββββ| 2605/2826 [4:15:19<21:31, 5.84s/it]
92%|ββββββββββ| 2606/2826 [4:15:24<21:04, 5.75s/it]
92%|ββββββββββ| 2607/2826 [4:15:30<20:49, 5.71s/it]
92%|ββββββββββ| 2608/2826 [4:15:35<20:36, 5.67s/it]
92%|ββββββββββ| 2609/2826 [4:15:42<21:47, 6.02s/it]
92%|ββββββββββ| 2610/2826 [4:15:47<20:42, 5.75s/it]
{'loss': 0.1824, 'grad_norm': 2.6887638568878174, 'learning_rate': 8.929641241027937e-08, 'epoch': 2.77} |
|
92%|ββββββββββ| 2610/2826 [4:15:47<20:42, 5.75s/it]
92%|ββββββββββ| 2611/2826 [4:15:53<20:26, 5.71s/it]
92%|ββββββββββ| 2612/2826 [4:15:58<20:04, 5.63s/it]
92%|ββββββββββ| 2613/2826 [4:16:05<21:18, 6.00s/it]
92%|ββββββββββ| 2614/2826 [4:16:10<20:12, 5.72s/it]
93%|ββββββββββ| 2615/2826 [4:16:15<19:34, 5.57s/it]
93%|ββββββββββ| 2616/2826 [4:16:22<20:17, 5.80s/it]
93%|ββββββββββ| 2617/2826 [4:16:29<21:27, 6.16s/it]
93%|ββββββββββ| 2618/2826 [4:16:34<20:18, 5.86s/it]
93%|ββββββββββ| 2619/2826 [4:16:39<19:32, 5.66s/it]
93%|ββββββββββ| 2620/2826 [4:16:45<20:03, 5.84s/it]
{'loss': 0.1706, 'grad_norm': 2.4606659412384033, 'learning_rate': 8.129984106418354e-08, 'epoch': 2.78} |
|
93%|ββββββββββ| 2620/2826 [4:16:45<20:03, 5.84s/it]
93%|ββββββββββ| 2621/2826 [4:16:51<19:10, 5.61s/it]
93%|ββββββββββ| 2622/2826 [4:16:56<18:29, 5.44s/it]
93%|ββββββββββ| 2623/2826 [4:17:02<19:40, 5.81s/it]
93%|ββββββββββ| 2624/2826 [4:17:08<19:46, 5.88s/it]
93%|ββββββββββ| 2625/2826 [4:17:14<19:08, 5.72s/it]
93%|ββββββββββ| 2626/2826 [4:17:19<19:02, 5.71s/it]
93%|ββββββββββ| 2627/2826 [4:17:25<19:14, 5.80s/it]
93%|ββββββββββ| 2628/2826 [4:17:33<20:31, 6.22s/it]
93%|ββββββββββ| 2629/2826 [4:17:39<20:35, 6.27s/it]
93%|ββββββββββ| 2630/2826 [4:17:44<19:21, 5.93s/it]
{'loss': 0.2195, 'grad_norm': 2.5548455715179443, 'learning_rate': 7.3672403281142e-08, 'epoch': 2.79} |
|
93%|ββββββββββ| 2630/2826 [4:17:44<19:21, 5.93s/it]
93%|ββββββββββ| 2631/2826 [4:17:50<19:21, 5.96s/it]
93%|ββββββββββ| 2632/2826 [4:17:56<19:21, 5.99s/it]
93%|ββββββββββ| 2633/2826 [4:18:03<20:08, 6.26s/it]
93%|ββββββββββ| 2634/2826 [4:18:09<19:32, 6.11s/it]
93%|ββββββββββ| 2635/2826 [4:18:14<18:40, 5.87s/it]
93%|ββββββββββ| 2636/2826 [4:18:20<18:22, 5.80s/it]
93%|ββββββββββ| 2637/2826 [4:18:25<17:29, 5.55s/it]
93%|ββββββββββ| 2638/2826 [4:18:30<17:29, 5.58s/it]
93%|ββββββββββ| 2639/2826 [4:18:36<17:08, 5.50s/it]
93%|ββββββββββ| 2640/2826 [4:18:42<18:07, 5.85s/it]
{'loss': 0.1748, 'grad_norm': 1.7952167987823486, 'learning_rate': 6.641526313404534e-08, 'epoch': 2.8} |
|
93%|ββββββββββ| 2640/2826 [4:18:42<18:07, 5.85s/it]
93%|ββββββββββ| 2641/2826 [4:18:47<17:17, 5.61s/it]
93%|ββββββββββ| 2642/2826 [4:18:53<17:28, 5.70s/it]
94%|ββββββββββ| 2643/2826 [4:18:59<17:45, 5.82s/it]
94%|ββββββββββ| 2644/2826 [4:19:06<18:19, 6.04s/it]
94%|ββββββββββ| 2645/2826 [4:19:11<17:25, 5.78s/it]
94%|ββββββββββ| 2646/2826 [4:19:16<16:43, 5.57s/it]
94%|ββββββββββ| 2647/2826 [4:19:21<16:19, 5.47s/it]
94%|ββββββββββ| 2648/2826 [4:19:27<16:34, 5.59s/it]
94%|ββββββββββ| 2649/2826 [4:19:33<16:24, 5.56s/it]
94%|ββββββββββ| 2650/2826 [4:19:40<17:37, 6.01s/it]
{'loss': 0.2061, 'grad_norm': 2.376830816268921, 'learning_rate': 5.952952818225416e-08, 'epoch': 2.81} |
|
94%|ββββββββββ| 2650/2826 [4:19:40<17:37, 6.01s/it]
94%|ββββββββββ| 2651/2826 [4:19:45<16:43, 5.73s/it]
94%|ββββββββββ| 2652/2826 [4:19:51<16:39, 5.74s/it]
94%|ββββββββββ| 2653/2826 [4:19:56<16:36, 5.76s/it]
94%|ββββββββββ| 2654/2826 [4:20:02<16:01, 5.59s/it]
94%|ββββββββββ| 2655/2826 [4:20:07<15:57, 5.60s/it]
94%|ββββββββββ| 2656/2826 [4:20:14<16:59, 5.99s/it]
94%|ββββββββββ| 2657/2826 [4:20:20<16:48, 5.97s/it]
94%|ββββββββββ| 2658/2826 [4:20:26<16:15, 5.81s/it]
94%|ββββββββββ| 2659/2826 [4:20:31<15:27, 5.55s/it]
94%|ββββββββββ| 2660/2826 [4:20:36<15:22, 5.56s/it]
{'loss': 0.1742, 'grad_norm': 1.7183632850646973, 'learning_rate': 5.3016249302565436e-08, 'epoch': 2.82} |
|
94%|ββββββββββ| 2660/2826 [4:20:36<15:22, 5.56s/it]
94%|ββββββββββ| 2661/2826 [4:20:41<14:54, 5.42s/it]
94%|ββββββββββ| 2662/2826 [4:20:48<15:59, 5.85s/it]
94%|ββββββββββ| 2663/2826 [4:20:54<15:47, 5.81s/it]
94%|ββββββββββ| 2664/2826 [4:21:00<16:05, 5.96s/it]
94%|ββββββββββ| 2665/2826 [4:21:06<15:54, 5.93s/it]
94%|ββββββββββ| 2666/2826 [4:21:12<15:59, 6.00s/it]
94%|ββββββββββ| 2667/2826 [4:21:18<15:26, 5.83s/it]
94%|ββββββββββ| 2668/2826 [4:21:24<16:07, 6.13s/it]
94%|ββββββββββ| 2669/2826 [4:21:30<15:40, 5.99s/it]
94%|ββββββββββ| 2670/2826 [4:21:35<14:55, 5.74s/it]
{'loss': 0.2082, 'grad_norm': 2.11011004447937, 'learning_rate': 4.6876420528833014e-08, 'epoch': 2.83} |
|
94%|ββββββββββ| 2670/2826 [4:21:35<14:55, 5.74s/it]
95%|ββββββββββ| 2671/2826 [4:21:41<14:59, 5.80s/it]
95%|ββββββββββ| 2672/2826 [4:21:46<14:25, 5.62s/it]
95%|ββββββββββ| 2673/2826 [4:21:52<14:39, 5.75s/it]
95%|ββββββββββ| 2674/2826 [4:21:59<15:02, 5.94s/it]
95%|ββββββββββ| 2675/2826 [4:22:05<15:10, 6.03s/it]
95%|ββββββββββ| 2676/2826 [4:22:13<16:34, 6.63s/it]
95%|ββββββββββ| 2677/2826 [4:22:18<15:22, 6.19s/it]
95%|ββββββββββ| 2678/2826 [4:22:24<14:48, 6.00s/it]
95%|ββββββββββ| 2679/2826 [4:22:30<15:03, 6.14s/it]
95%|ββββββββββ| 2680/2826 [4:22:35<14:11, 5.83s/it]
{'loss': 0.1805, 'grad_norm': 1.8799868822097778, 'learning_rate': 4.111097890026089e-08, 'epoch': 2.84} |
|
95%|ββββββββββ| 2680/2826 [4:22:35<14:11, 5.83s/it]
95%|ββββββββββ| 2681/2826 [4:22:40<13:30, 5.59s/it]
95%|ββββββββββ| 2682/2826 [4:22:46<13:19, 5.55s/it]
95%|ββββββββββ| 2683/2826 [4:22:51<13:01, 5.46s/it]
95%|ββββββββββ| 2684/2826 [4:22:57<13:24, 5.66s/it]
95%|ββββββββββ| 2685/2826 [4:23:04<14:13, 6.06s/it]
95%|ββββββββββ| 2686/2826 [4:23:10<13:57, 5.98s/it]
95%|ββββββββββ| 2687/2826 [4:23:17<14:17, 6.17s/it]
95%|ββββββββββ| 2688/2826 [4:23:22<13:59, 6.08s/it]
95%|ββββββββββ| 2689/2826 [4:23:29<13:54, 6.09s/it]
95%|ββββββββββ| 2690/2826 [4:23:34<13:18, 5.87s/it]
{'loss': 0.2058, 'grad_norm': 2.5171291828155518, 'learning_rate': 3.5720804318395976e-08, 'epoch': 2.85} |
|
95%|ββββββββββ| 2690/2826 [4:23:34<13:18, 5.87s/it]
95%|ββββββββββ| 2691/2826 [4:23:39<12:41, 5.64s/it]
95%|ββββββββββ| 2692/2826 [4:23:45<12:54, 5.78s/it]
95%|ββββββββββ| 2693/2826 [4:23:51<13:09, 5.93s/it]
95%|ββββββββββ| 2694/2826 [4:23:59<13:56, 6.34s/it]
95%|ββββββββββ| 2695/2826 [4:24:04<13:13, 6.05s/it]
95%|ββββββββββ| 2696/2826 [4:24:10<12:43, 5.88s/it]
95%|ββββββββββ| 2697/2826 [4:24:16<12:52, 5.99s/it]
95%|ββββββββββ| 2698/2826 [4:24:21<12:31, 5.87s/it]
96%|ββββββββββ| 2699/2826 [4:24:27<12:23, 5.86s/it]
96%|ββββββββββ| 2700/2826 [4:24:32<11:51, 5.65s/it]
{'loss': 0.2027, 'grad_norm': 2.142263650894165, 'learning_rate': 3.0706719412839926e-08, 'epoch': 2.86} |
|
96%|ββββββββββ| 2700/2826 [4:24:32<11:51, 5.65s/it]
96%|ββββββββββ| 2701/2826 [4:24:39<12:09, 5.84s/it]
96%|ββββββββββ| 2702/2826 [4:24:44<11:38, 5.64s/it]
96%|ββββββββββ| 2703/2826 [4:24:49<11:25, 5.58s/it]
96%|ββββββββββ| 2704/2826 [4:24:55<11:23, 5.60s/it]
96%|ββββββββββ| 2705/2826 [4:25:02<11:59, 5.94s/it]
96%|ββββββββββ| 2706/2826 [4:25:08<12:19, 6.17s/it]
96%|ββββββββββ| 2707/2826 [4:25:14<11:53, 6.00s/it]
96%|ββββββββββ| 2708/2826 [4:25:19<11:30, 5.86s/it]
96%|ββββββββββ| 2709/2826 [4:25:26<11:55, 6.12s/it]
96%|ββββββββββ| 2710/2826 [4:25:32<11:20, 5.87s/it]
{'loss': 0.1941, 'grad_norm': 2.2124040126800537, 'learning_rate': 2.6069489415703197e-08, 'epoch': 2.87} |
|
96%|ββββββββββ| 2710/2826 [4:25:32<11:20, 5.87s/it]
96%|ββββββββββ| 2711/2826 [4:25:38<11:24, 5.95s/it]
96%|ββββββββββ| 2712/2826 [4:25:43<10:59, 5.78s/it]
96%|ββββββββββ| 2713/2826 [4:25:49<10:52, 5.78s/it]
96%|ββββββββββ| 2714/2826 [4:25:56<11:48, 6.32s/it]
96%|ββββββββββ| 2715/2826 [4:26:03<11:52, 6.42s/it]
96%|ββββββββββ| 2716/2826 [4:26:09<11:15, 6.14s/it]
96%|ββββββββββ| 2717/2826 [4:26:14<10:35, 5.83s/it]
96%|ββββββββββ| 2718/2826 [4:26:19<10:11, 5.66s/it]
96%|ββββββββββ| 2719/2826 [4:26:24<09:45, 5.48s/it]
96%|ββββββββββ| 2720/2826 [4:26:30<10:07, 5.73s/it]
{'loss': 0.2029, 'grad_norm': 2.033259153366089, 'learning_rate': 2.18098220448168e-08, 'epoch': 2.88} |
|
96%|ββββββββββ| 2720/2826 [4:26:30<10:07, 5.73s/it]
96%|ββββββββββ| 2721/2826 [4:26:35<09:43, 5.56s/it]
96%|ββββββββββ| 2722/2826 [4:26:43<10:46, 6.22s/it]
96%|ββββββββββ| 2723/2826 [4:26:48<10:03, 5.86s/it]
96%|ββββββββββ| 2724/2826 [4:26:54<09:49, 5.78s/it]
96%|ββββββββββ| 2725/2826 [4:26:59<09:22, 5.57s/it]
96%|ββββββββββ| 2726/2826 [4:27:04<09:15, 5.56s/it]
96%|ββββββββββ| 2727/2826 [4:27:10<09:25, 5.71s/it]
97%|ββββββββββ| 2728/2826 [4:27:17<09:38, 5.90s/it]
97%|ββββββββββ| 2729/2826 [4:27:22<09:10, 5.67s/it]
97%|ββββββββββ| 2730/2826 [4:27:27<08:54, 5.56s/it]
{'loss': 0.2062, 'grad_norm': 2.416912794113159, 'learning_rate': 1.7928367395725066e-08, 'epoch': 2.9} |
|
97%|ββββββββββ| 2730/2826 [4:27:27<08:54, 5.56s/it]
97%|ββββββββββ| 2731/2826 [4:27:33<08:47, 5.55s/it]
97%|ββββββββββ| 2732/2826 [4:27:39<08:47, 5.61s/it]
97%|ββββββββββ| 2733/2826 [4:27:46<09:34, 6.17s/it]
97%|ββββββββββ| 2734/2826 [4:27:53<09:58, 6.50s/it]
97%|ββββββββββ| 2735/2826 [4:27:58<09:12, 6.08s/it]
97%|ββββββββββ| 2736/2826 [4:28:05<09:17, 6.19s/it]
97%|ββββββββββ| 2737/2826 [4:28:11<09:02, 6.10s/it]
97%|ββββββββββ| 2738/2826 [4:28:16<08:45, 5.97s/it]
97%|ββββββββββ| 2739/2826 [4:28:21<08:14, 5.69s/it]
97%|ββββββββββ| 2740/2826 [4:28:27<08:14, 5.75s/it]
{'loss': 0.1873, 'grad_norm': 2.193751096725464, 'learning_rate': 1.442571784246699e-08, 'epoch': 2.91} |
|
97%|ββββββββββ| 2740/2826 [4:28:27<08:14, 5.75s/it]
97%|ββββββββββ| 2741/2826 [4:28:35<08:48, 6.22s/it]
97%|ββββββββββ| 2742/2826 [4:28:42<09:04, 6.48s/it]
97%|ββββββββββ| 2743/2826 [4:28:47<08:25, 6.09s/it]
97%|ββββββββββ| 2744/2826 [4:28:53<08:24, 6.16s/it]
97%|ββββββββββ| 2745/2826 [4:28:59<07:59, 5.92s/it]
97%|ββββββββββ| 2746/2826 [4:29:04<07:34, 5.68s/it]
97%|ββββββββββ| 2747/2826 [4:29:10<07:53, 5.99s/it]
97%|ββββββββββ| 2748/2826 [4:29:16<07:27, 5.74s/it]
97%|ββββββββββ| 2749/2826 [4:29:21<07:14, 5.65s/it]
97%|ββββββββββ| 2750/2826 [4:29:27<07:19, 5.78s/it]
{'loss': 0.1653, 'grad_norm': 1.5729731321334839, 'learning_rate': 1.1302407947173522e-08, 'epoch': 2.92} |
|
97%|ββββββββββ| 2750/2826 [4:29:27<07:19, 5.78s/it]
97%|ββββββββββ| 2751/2826 [4:29:33<07:26, 5.96s/it]
97%|ββββββββββ| 2752/2826 [4:29:40<07:35, 6.16s/it]
97%|ββββββββββ| 2753/2826 [4:29:47<07:56, 6.53s/it]
97%|ββββββββββ| 2754/2826 [4:29:55<08:02, 6.70s/it]
97%|ββββββββββ| 2755/2826 [4:30:00<07:26, 6.28s/it]
98%|ββββββββββ| 2756/2826 [4:30:06<07:10, 6.15s/it]
98%|ββββββββββ| 2757/2826 [4:30:11<06:44, 5.86s/it]
98%|ββββββββββ| 2758/2826 [4:30:16<06:21, 5.62s/it]
98%|ββββββββββ| 2759/2826 [4:30:21<06:13, 5.57s/it]
98%|ββββββββββ| 2760/2826 [4:30:28<06:28, 5.89s/it]
{'loss': 0.1743, 'grad_norm': 1.7562044858932495, 'learning_rate': 8.558914378481996e-09, 'epoch': 2.93} |
|
98%|ββββββββββ| 2760/2826 [4:30:28<06:28, 5.89s/it]
98%|ββββββββββ| 2761/2826 [4:30:34<06:20, 5.86s/it]
98%|ββββββββββ| 2762/2826 [4:30:39<06:00, 5.63s/it]
98%|ββββββββββ| 2763/2826 [4:30:44<05:47, 5.52s/it]
98%|ββββββββββ| 2764/2826 [4:30:49<05:32, 5.36s/it]
98%|ββββββββββ| 2765/2826 [4:30:56<06:00, 5.91s/it]
98%|ββββββββββ| 2766/2826 [4:31:03<06:07, 6.12s/it]
98%|ββββββββββ| 2767/2826 [4:31:09<06:06, 6.21s/it]
98%|ββββββββββ| 2768/2826 [4:31:15<05:54, 6.12s/it]
98%|ββββββββββ| 2769/2826 [4:31:21<05:36, 5.90s/it]
98%|ββββββββββ| 2770/2826 [4:31:26<05:24, 5.80s/it]
{'loss': 0.1821, 'grad_norm': 2.183967351913452, 'learning_rate': 6.195655838790726e-09, 'epoch': 2.94} |
|
98%|ββββββββββ| 2770/2826 [4:31:26<05:24, 5.80s/it]
98%|ββββββββββ| 2771/2826 [4:31:32<05:14, 5.73s/it]
98%|ββββββββββ| 2772/2826 [4:31:37<05:00, 5.56s/it]
98%|ββββββββββ| 2773/2826 [4:31:44<05:12, 5.90s/it]
98%|ββββββββββ| 2774/2826 [4:31:49<05:02, 5.81s/it]
98%|ββββββββββ| 2775/2826 [4:31:55<04:49, 5.67s/it]
98%|ββββββββββ| 2776/2826 [4:32:00<04:35, 5.50s/it]
98%|ββββββββββ| 2777/2826 [4:32:07<04:50, 5.92s/it]
98%|ββββββββββ| 2778/2826 [4:32:14<05:03, 6.33s/it]
98%|ββββββββββ| 2779/2826 [4:32:20<04:51, 6.20s/it]
98%|ββββββββββ| 2780/2826 [4:32:26<04:42, 6.13s/it]
{'loss': 0.1954, 'grad_norm': 1.9312433004379272, 'learning_rate': 4.212993000356491e-09, 'epoch': 2.95} |
|
98%|ββββββββββ| 2780/2826 [4:32:26<04:42, 6.13s/it]
98%|ββββββββββ| 2781/2826 [4:32:31<04:22, 5.84s/it]
98%|ββββββββββ| 2782/2826 [4:32:36<04:06, 5.61s/it]
98%|ββββββββββ| 2783/2826 [4:32:42<04:02, 5.64s/it]
99%|ββββββββββ| 2784/2826 [4:32:49<04:13, 6.03s/it]
99%|ββββββββββ| 2785/2826 [4:32:54<03:57, 5.79s/it]
99%|ββββββββββ| 2786/2826 [4:32:59<03:43, 5.58s/it]
99%|ββββββββββ| 2787/2826 [4:33:04<03:31, 5.42s/it]
99%|ββββββββββ| 2788/2826 [4:33:09<03:24, 5.38s/it]
99%|ββββββββββ| 2789/2826 [4:33:15<03:23, 5.50s/it]
99%|ββββββββββ| 2790/2826 [4:33:21<03:21, 5.61s/it]
{'loss': 0.1925, 'grad_norm': 2.2055087089538574, 'learning_rate': 2.611228450250802e-09, 'epoch': 2.96} |
|
99%|ββββββββββ| 2790/2826 [4:33:21<03:21, 5.61s/it]
99%|ββββββββββ| 2791/2826 [4:33:28<03:29, 6.00s/it]
99%|ββββββββββ| 2792/2826 [4:33:35<03:30, 6.19s/it]
99%|ββββββββββ| 2793/2826 [4:33:40<03:19, 6.06s/it]
99%|ββββββββββ| 2794/2826 [4:33:46<03:12, 6.01s/it]
99%|ββββββββββ| 2795/2826 [4:33:53<03:14, 6.28s/it]
99%|ββββββββββ| 2796/2826 [4:33:59<03:01, 6.06s/it]
99%|ββββββββββ| 2797/2826 [4:34:04<02:46, 5.75s/it]
99%|ββββββββββ| 2798/2826 [4:34:09<02:37, 5.62s/it]
99%|ββββββββββ| 2799/2826 [4:34:16<02:41, 5.97s/it]
99%|ββββββββββ| 2800/2826 [4:34:23<02:48, 6.49s/it]
{'loss': 0.1805, 'grad_norm': 1.6606404781341553, 'learning_rate': 1.3906066441798927e-09, 'epoch': 2.97} |
|
99%|ββββββββββ| 2800/2826 [4:34:23<02:48, 6.49s/it]
99%|ββββββββββ| 2801/2826 [4:34:30<02:45, 6.62s/it]
99%|ββββββββββ| 2802/2826 [4:34:36<02:29, 6.22s/it]
99%|ββββββββββ| 2803/2826 [4:34:41<02:19, 6.05s/it]
99%|ββββββββββ| 2804/2826 [4:34:47<02:10, 5.92s/it]
99%|ββββββββββ| 2805/2826 [4:34:53<02:04, 5.93s/it]
99%|ββββββββββ| 2806/2826 [4:34:59<02:02, 6.11s/it]
99%|ββββββββββ| 2807/2826 [4:35:05<01:51, 5.86s/it]
99%|ββββββββββ| 2808/2826 [4:35:10<01:41, 5.66s/it]
99%|ββββββββββ| 2809/2826 [4:35:15<01:34, 5.54s/it]
99%|ββββββββββ| 2810/2826 [4:35:21<01:30, 5.63s/it]
{'loss': 0.2084, 'grad_norm': 2.594404458999634, 'learning_rate': 5.513138691767839e-10, 'epoch': 2.98} |
|
99%|ββββββββββ| 2810/2826 [4:35:21<01:30, 5.63s/it]
99%|ββββββββββ| 2811/2826 [4:35:28<01:28, 5.93s/it]
100%|ββββββββββ| 2812/2826 [4:35:35<01:27, 6.26s/it]
100%|ββββββββββ| 2813/2826 [4:35:42<01:25, 6.55s/it]
100%|ββββββββββ| 2814/2826 [4:35:47<01:13, 6.16s/it]
100%|ββββββββββ| 2815/2826 [4:35:53<01:06, 6.01s/it]
100%|ββββββββββ| 2816/2826 [4:35:58<00:57, 5.75s/it]
100%|ββββββββββ| 2817/2826 [4:36:05<00:54, 6.09s/it]
100%|ββββββββββ| 2818/2826 [4:36:12<00:50, 6.37s/it]
100%|ββββββββββ| 2819/2826 [4:36:17<00:42, 6.13s/it]
100%|ββββββββββ| 2820/2826 [4:36:24<00:38, 6.37s/it]
{'loss': 0.2115, 'grad_norm': 2.007861375808716, 'learning_rate': 9.347821517069477e-11, 'epoch': 2.99} |
|
100%|ββββββββββ| 2820/2826 [4:36:24<00:38, 6.37s/it]
100%|ββββββββββ| 2821/2826 [4:36:31<00:33, 6.60s/it]
100%|ββββββββββ| 2822/2826 [4:36:37<00:24, 6.17s/it]
100%|ββββββββββ| 2823/2826 [4:36:42<00:17, 5.84s/it]
100%|ββββββββββ| 2824/2826 [4:36:48<00:11, 5.91s/it]
100%|ββββββββββ| 2825/2826 [4:36:54<00:05, 5.88s/it]
100%|ββββββββββ| 2826/2826 [4:36:59<00:00, 5.66s/it][INFO|trainer.py:3984] 2025-10-18 11:23:15,155 >> Saving model checkpoint to /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826 |
| [INFO|configuration_utils.py:419] 2025-10-18 11:23:15,160 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/config.json |
| [INFO|configuration_utils.py:911] 2025-10-18 11:23:15,162 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/generation_config.json |
| [INFO|modeling_utils.py:3580] 2025-10-18 11:23:35,979 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/model.safetensors.index.json. |
| [INFO|tokenization_utils_base.py:2510] 2025-10-18 11:23:35,982 >> tokenizer config file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/tokenizer_config.json |
| [INFO|tokenization_utils_base.py:2519] 2025-10-18 11:23:35,983 >> Special tokens file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/special_tokens_map.json |
| [2025-10-18 11:23:36,183] [INFO] [logging.py:107:log_dist] [Rank 0] [Torch] Checkpoint global_step2825 is about to be saved! |
| [2025-10-18 11:23:36,615] [INFO] [logging.py:107:log_dist] [Rank 0] Saving model checkpoint: /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/zero_pp_rank_0_mp_rank_00_model_states.pt |
| [2025-10-18 11:23:36,615] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/zero_pp_rank_0_mp_rank_00_model_states.pt... |
| [2025-10-18 11:23:36,633] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/zero_pp_rank_0_mp_rank_00_model_states.pt. |
| [2025-10-18 11:23:36,637] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... |
| [2025-10-18 11:23:55,201] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. |
| [2025-10-18 11:23:55,202] [INFO] [engine.py:3701:_save_zero_checkpoint] zero checkpoint saved /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/checkpoint-2826/global_step2825/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt |
| [2025-10-18 11:23:55,663] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step2825 is ready now! |
| [INFO|trainer.py:2681] 2025-10-18 11:23:55,685 >> |
| |
| Training completed. Do not forget to share your model on huggingface.co/models =) |
| |
| |
|
{'train_runtime': 16671.2674, 'train_samples_per_second': 2.713, 'train_steps_per_second': 0.17, 'train_loss': 0.34044326600333263, 'epoch': 3.0} |
|
100%|ββββββββββ| 2826/2826 [4:37:51<00:00, 5.66s/it]
100%|ββββββββββ| 2826/2826 [4:37:51<00:00, 5.90s/it] |
| [INFO|trainer.py:3984] 2025-10-18 11:24:06,471 >> Saving model checkpoint to /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17 |
| [INFO|configuration_utils.py:419] 2025-10-18 11:24:06,477 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/config.json |
| [INFO|configuration_utils.py:911] 2025-10-18 11:24:06,480 >> Configuration saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/generation_config.json |
| [INFO|modeling_utils.py:3580] 2025-10-18 11:24:26,439 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/model.safetensors.index.json. |
| [INFO|tokenization_utils_base.py:2510] 2025-10-18 11:24:26,442 >> tokenizer config file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/tokenizer_config.json |
| [INFO|tokenization_utils_base.py:2519] 2025-10-18 11:24:26,443 >> Special tokens file saved in /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/special_tokens_map.json |
| ***** train metrics ***** |
| epoch = 2.9973 |
| total_flos = 101656586GF |
| train_loss = 0.3404 |
| train_runtime = 4:37:51.26 |
| train_samples_per_second = 2.713 |
| train_steps_per_second = 0.17 |
| Figure saved at: /mmu_nlp_hdd/dongguanting/checkpoint_tool_light/method7_qwen2.5-7b-instruct-llama-factory-sft-edition17/training_loss.png |
| [WARNING|2025-10-18 11:24:27] llamafactory.extras.ploting:148 >> No metric eval_loss to plot. |
| [WARNING|2025-10-18 11:24:27] llamafactory.extras.ploting:148 >> No metric eval_accuracy to plot. |
| [INFO|modelcard.py:450] 2025-10-18 11:24:27,224 >> Dropping the following result as it does not have all the necessary fields: |
| {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}} |
| |