diff --git "a/logs/20250526_191827/train.log" "b/logs/20250526_191827/train.log" deleted file mode 100644--- "a/logs/20250526_191827/train.log" +++ /dev/null @@ -1,2057 +0,0 @@ -2025-05-26 19:18:47,642 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_321e0871e56ca1df.zip. -2025-05-26 19:18:47,643 INFO packaging.py:575 -- Creating a file package for local module '/mnt/petrelfs/luyiting/MultiAgentEval/lmm-r1'. -2025-05-26 19:18:46,714 INFO cli.py:39 -- Job submission server address: http://127.0.0.1:2983 -2025-05-26 19:18:53,403 SUCC cli.py:63 -- ------------------------------------------------------- -2025-05-26 19:18:53,403 SUCC cli.py:64 -- Job 'raysubmit_YRVyrdpJQsux5E4C' submitted successfully -2025-05-26 19:18:53,403 SUCC cli.py:65 -- ------------------------------------------------------- -2025-05-26 19:18:53,403 INFO cli.py:289 -- Next steps -2025-05-26 19:18:53,403 INFO cli.py:290 -- Query the logs of the job: -2025-05-26 19:18:53,403 INFO cli.py:292 -- ray job logs raysubmit_YRVyrdpJQsux5E4C -2025-05-26 19:18:53,403 INFO cli.py:294 -- Query the status of the job: -2025-05-26 19:18:53,403 INFO cli.py:296 -- ray job status raysubmit_YRVyrdpJQsux5E4C -2025-05-26 19:18:53,403 INFO cli.py:298 -- Request the job to be stopped: -2025-05-26 19:18:53,404 INFO cli.py:300 -- ray job stop raysubmit_YRVyrdpJQsux5E4C -2025-05-26 19:18:53,406 INFO cli.py:307 -- Tailing logs until the job exits (disable with --no-wait): -2025-05-26 19:18:52,847 INFO job_manager.py:531 -- Runtime env is setting up. -[2025-05-26 19:19:13,190] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) -INFO 05-26 19:19:17 [__init__.py:239] Automatically detected platform cuda. -2025-05-26 19:19:18,557 INFO worker.py:1520 -- Using address 10.140.1.87:6231 set in the environment variable RAY_ADDRESS -2025-05-26 19:19:18,559 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.140.1.87:6231... -2025-05-26 19:19:18,580 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at 10.140.1.87:2983  -(pid=89991) INFO 05-26 19:19:38 [__init__.py:239] Automatically detected platform cuda. -(LLMRayActor pid=89992) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'score', 'reward', 'classify', 'embed', 'generate'}. Defaulting to 'generate'. -(pid=89985) INFO 05-26 19:19:38 [__init__.py:239] Automatically detected platform cuda. [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) -(LLMRayActor pid=89991) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'reward', 'generate', 'embed', 'classify', 'score'}. Defaulting to 'generate'. -(LLMRayActor pid=89991) WARNING 05-26 19:20:05 [arg_utils.py:1846] VLLM_ATTENTION_BACKEND=triton is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=triton from your config in favor of the V1 Engine. -(LLMRayActor pid=89991) WARNING 05-26 19:20:05 [arg_utils.py:1745] --enable-prefix-caching is not supported for multimodal models in V0 and has been disabled. -(LLMRayActor pid=89991) INFO 05-26 19:20:05 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev76+gf68cce8) with config: model='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', speculative_config=None, tokenizer='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=42, served_model_name=/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, -(LLMRayActor pid=89988) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'embed', 'generate', 'classify', 'reward', 'score'}. Defaulting to 'generate'. -(LLMRayActor pid=89993) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'score', 'reward', 'embed', 'classify', 'generate'}. Defaulting to 'generate'. -(LLMRayActor pid=89986) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'classify', 'generate', 'score', 'embed', 'reward'}. Defaulting to 'generate'. -(LLMRayActor pid=89989) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'generate', 'reward', 'score', 'embed', 'classify'}. Defaulting to 'generate'. -(LLMRayActor pid=89990) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'reward', 'classify', 'embed', 'generate', 'score'}. Defaulting to 'generate'. -(LLMRayActor pid=89985) INFO 05-26 19:20:05 [config.py:585] This model supports multiple tasks: {'classify', 'embed', 'reward', 'score', 'generate'}. Defaulting to 'generate'. -(LLMRayActor pid=89991) [2025-05-26 19:20:08,722] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) -(LLMRayActor pid=89991) INFO 05-26 19:20:13 [cuda.py:293] Using Flash Attention backend. -(LLMRayActor pid=89985) WARNING 05-26 19:20:05 [arg_utils.py:1846] VLLM_ATTENTION_BACKEND=triton is not supported by the V1 Engine. Falling back to V0. We recommend to remove VLLM_ATTENTION_BACKEND=triton from your config in favor of the V1 Engine. [repeated 7x across cluster] -(LLMRayActor pid=89985) WARNING 05-26 19:20:05 [arg_utils.py:1745] --enable-prefix-caching is not supported for multimodal models in V0 and has been disabled. [repeated 7x across cluster] -(LLMRayActor pid=89985) INFO 05-26 19:20:05 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2.dev76+gf68cce8) with config: model='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', speculative_config=None, tokenizer='/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=49, served_model_name=/mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,  [repeated 7x across cluster] -(LLMRayActor pid=89988) INFO 05-26 19:20:16 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0 -(LLMRayActor pid=89988) INFO 05-26 19:20:16 [model_runner.py:1110] Starting to load model /mnt/petrelfs/luyiting/ckt/Qwen2.5-VL-7B-Instruct/Qwen2.5-VL-7B-Instruct/... -(LLMRayActor pid=89985) [2025-05-26 19:20:08,723] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [repeated 7x across cluster] -(LLMRayActor pid=89988) INFO 05-26 19:20:16 [config.py:3229] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248] -(LLMRayActor pid=89990) -Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,377] [INFO] [logging.py:128:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,377] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer -(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,607] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning -(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,607] [INFO] [utils.py:782:see_memory_usage] MA 1.94 GB Max_MA 3.98 GB CA 4.04 GB Max_CA 4 GB -(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,608] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 469.96 GB, percent = 46.7% -(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,611] [INFO] [stage3.py:170:__init__] Reduce bucket size 500000000 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,611] [INFO] [stage3.py:171:__init__] Prefetch bucket size 50000000 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,839] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] -(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,840] [INFO] [utils.py:782:see_memory_usage] MA 1.94 GB Max_MA 1.94 GB CA 4.04 GB Max_CA 4 GB -(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,841] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 469.95 GB, percent = 46.7% -(ActorModelRayActor pid=100959) Parameter Offload: Total persistent parameters: 848896 in 368 params -(ActorModelRayActor pid=100959) [2025-05-26 19:23:15,090] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end] -(ActorModelRayActor pid=100959) [2025-05-26 19:23:15,090] [INFO] [utils.py:782:see_memory_usage] MA 1.94 GB Max_MA 1.94 GB CA 4.04 GB Max_CA 4 GB -(ActorModelRayActor pid=100959) [2025-05-26 19:23:15,091] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 469.96 GB, percent = 46.7% -(ActorModelRayActor pid=100959) [2025-05-26 19:23:15,310] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions -(ActorModelRayActor pid=100959) [2025-05-26 19:23:15,311] [INFO] [utils.py:782:see_memory_usage] MA 1.94 GB Max_MA 1.94 GB CA 4.04 GB Max_CA 4 GB -(ActorModelRayActor pid=100959) [2025-05-26 19:23:15,312] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 469.96 GB, percent = 46.7% -(ActorModelRayActor pid=100959) [2025-05-26 19:23:17,718] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 2 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:17,719] [INFO] [utils.py:782:see_memory_usage] MA 1.93 GB Max_MA 1.94 GB CA 1.94 GB Max_CA 4 GB -(ActorModelRayActor pid=100959) [2025-05-26 19:23:17,719] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 473.35 GB, percent = 47.0% -(ActorModelRayActor pid=100959) [2025-05-26 19:23:17,939] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions -(ActorModelRayActor pid=100959) [2025-05-26 19:23:17,940] [INFO] [utils.py:782:see_memory_usage] MA 1.93 GB Max_MA 1.93 GB CA 1.94 GB Max_CA 2 GB -(ActorModelRayActor pid=100959) [2025-05-26 19:23:17,941] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 476.15 GB, percent = 47.3% -(ReferenceModelRayActor pid=101495) -Loading checkpoint shards: 60%|██████ | 3/5 [00:20<00:12, 6.39s/it] [repeated 16x across cluster] -(ActorModelRayActor pid=100959) [2025-05-26 19:23:22,363] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions -(ActorModelRayActor pid=100959) [2025-05-26 19:23:22,364] [INFO] [utils.py:782:see_memory_usage] MA 1.93 GB Max_MA 1.93 GB CA 1.94 GB Max_CA 2 GB -(ActorModelRayActor pid=100959) [2025-05-26 19:23:22,364] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 500.86 GB, percent = 49.7% -(ActorModelRayActor pid=100959) [2025-05-26 19:23:22,595] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states -(ActorModelRayActor pid=100959) [2025-05-26 19:23:22,596] [INFO] [utils.py:782:see_memory_usage] MA 1.93 GB Max_MA 1.93 GB CA 1.94 GB Max_CA 2 GB -(ActorModelRayActor pid=100959) [2025-05-26 19:23:22,597] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 505.19 GB, percent = 50.2% -(ActorModelRayActor pid=100959) in preprocess_data None False [repeated 26000x across cluster] -(ActorModelRayActor pid=100959) [2025-05-26 19:23:14,318] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [repeated 7x across cluster] -(ReferenceModelRayActor pid=103677) -Loading checkpoint shards: 80%|████████ | 4/5 [00:25<00:06, 6.04s/it] -(ReferenceModelRayActor pid=103679) -Loading checkpoint shards: 80%|████████ | 4/5 [00:25<00:06, 6.04s/it] -(ReferenceModelRayActor pid=103677) -Loading checkpoint shards: 100%|██████████| 5/5 [00:25<00:00, 3.96s/it] -Loading checkpoint shards: 100%|██████████| 5/5 [00:25<00:00, 5.16s/it] -(ReferenceModelRayActor pid=101495) Actor( -(ReferenceModelRayActor pid=101495) (model): Qwen2_5_VLForConditionalGeneration( -(ReferenceModelRayActor pid=101495) (visual): Qwen2_5_VisionTransformerPretrainedModel( -(ReferenceModelRayActor pid=101495) (patch_embed): Qwen2_5_VisionPatchEmbed( -(ReferenceModelRayActor pid=101495) (proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False) -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) (rotary_pos_emb): Qwen2_5_VisionRotaryEmbedding() -(ReferenceModelRayActor pid=101495) (blocks): ModuleList( -(ReferenceModelRayActor pid=101495) (0-31): 32 x Qwen2_5_VLVisionBlock( -(ReferenceModelRayActor pid=101495) (norm1): Qwen2RMSNorm((0,), eps=1e-06) -(ReferenceModelRayActor pid=101495) (norm2): Qwen2RMSNorm((0,), eps=1e-06) -(ReferenceModelRayActor pid=101495) (attn): Qwen2_5_VLVisionFlashAttention2( -(ReferenceModelRayActor pid=101495) (qkv): Linear(in_features=1280, out_features=3840, bias=True) -(ReferenceModelRayActor pid=101495) (proj): Linear(in_features=1280, out_features=1280, bias=True) -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) (mlp): Qwen2_5_VLMLP( -(ReferenceModelRayActor pid=101495) (gate_proj): Linear(in_features=1280, out_features=3420, bias=True) -(ReferenceModelRayActor pid=101495) (up_proj): Linear(in_features=1280, out_features=3420, bias=True) -(ReferenceModelRayActor pid=101495) (down_proj): Linear(in_features=3420, out_features=1280, bias=True) -(ReferenceModelRayActor pid=101495) (act_fn): SiLU() -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) (merger): Qwen2_5_VLPatchMerger( -(ReferenceModelRayActor pid=101495) (ln_q): Qwen2RMSNorm((0,), eps=1e-06) -(ReferenceModelRayActor pid=101495) (mlp): Sequential( -(ReferenceModelRayActor pid=101495) (0): Linear(in_features=5120, out_features=5120, bias=True) -(ReferenceModelRayActor pid=101495) (1): GELU(approximate='none') -(ReferenceModelRayActor pid=101495) (2): Linear(in_features=5120, out_features=3584, bias=True) -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) (model): Qwen2_5_VLModel( -(ReferenceModelRayActor pid=101495) (embed_tokens): Embedding(152064, 3584) -(ReferenceModelRayActor pid=101495) (layers): ModuleList( -(ReferenceModelRayActor pid=101495) (0-27): 28 x Qwen2_5_VLDecoderLayer( -(ReferenceModelRayActor pid=101495) (self_attn): Qwen2_5_VLFlashAttention2( -(ReferenceModelRayActor pid=101495) (q_proj): Linear(in_features=3584, out_features=3584, bias=True) -(ReferenceModelRayActor pid=101495) (k_proj): Linear(in_features=3584, out_features=512, bias=True) -(ReferenceModelRayActor pid=101495) (v_proj): Linear(in_features=3584, out_features=512, bias=True) -(ReferenceModelRayActor pid=101495) (o_proj): Linear(in_features=3584, out_features=3584, bias=False) -(ReferenceModelRayActor pid=101495) (rotary_emb): Qwen2_5_VLRotaryEmbedding() -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) (mlp): Qwen2MLP( -(ReferenceModelRayActor pid=101495) (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) -(ReferenceModelRayActor pid=101495) (up_proj): Linear(in_features=3584, out_features=18944, bias=False) -(ReferenceModelRayActor pid=101495) (down_proj): Linear(in_features=18944, out_features=3584, bias=False) -(ReferenceModelRayActor pid=101495) (act_fn): SiLU() -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) (input_layernorm): Qwen2RMSNorm((0,), eps=1e-06) -(ReferenceModelRayActor pid=101495) (post_attention_layernorm): Qwen2RMSNorm((0,), eps=1e-06) -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) (norm): Qwen2RMSNorm((0,), eps=1e-06) -(ReferenceModelRayActor pid=101495) (rotary_emb): Qwen2_5_VLRotaryEmbedding() -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) (lm_head): Linear(in_features=3584, out_features=152064, bias=False) -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) ) -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,658] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.4, git-hash=unknown, git-branch=unknown -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,658] [INFO] [comm.py:683:init_distributed] Distributed backend already initialized -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,677] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,679] [INFO] [logging.py:128:log_dist] [Rank 0] Creating ZeRO Offload -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,901] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,902] [INFO] [utils.py:782:see_memory_usage] MA 1.94 GB Max_MA 3.98 GB CA 4.04 GB Max_CA 4 GB -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,903] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 582.59 GB, percent = 57.8% -(ReferenceModelRayActor pid=101495) Parameter Offload: Total persistent parameters: 848896 in 368 params -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,125] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end] -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,125] [INFO] [utils.py:782:see_memory_usage] MA 1.94 GB Max_MA 1.94 GB CA 4.04 GB Max_CA 4 GB -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,126] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 586.3 GB, percent = 58.2% -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,128] [INFO] [config.py:1001:print] DeepSpeedEngine configuration: -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] activation_checkpointing_config { -(ReferenceModelRayActor pid=101495) "partition_activations": false, -(ReferenceModelRayActor pid=101495) "contiguous_memory_optimization": false, -(ReferenceModelRayActor pid=101495) "cpu_checkpointing": false, -(ReferenceModelRayActor pid=101495) "number_checkpoints": null, -(ReferenceModelRayActor pid=101495) "synchronize_checkpoint_boundary": false, -(ReferenceModelRayActor pid=101495) "profile": false -(ReferenceModelRayActor pid=101495) } -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] amp_enabled .................. False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] amp_params ................... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] autotuning_config ............ { -(ReferenceModelRayActor pid=101495) "enabled": false, -(ReferenceModelRayActor pid=101495) "start_step": null, -(ReferenceModelRayActor pid=101495) "end_step": null, -(ReferenceModelRayActor pid=101495) "metric_path": null, -(ReferenceModelRayActor pid=101495) "arg_mappings": null, -(ReferenceModelRayActor pid=101495) "metric": "throughput", -(ReferenceModelRayActor pid=101495) "model_info": null, -(ReferenceModelRayActor pid=101495) "results_dir": "autotuning_results", -(ReferenceModelRayActor pid=101495) "exps_dir": "autotuning_exps", -(ReferenceModelRayActor pid=101495) "overwrite": true, -(ReferenceModelRayActor pid=101495) "fast": true, -(ReferenceModelRayActor pid=101495) "start_profile_step": 3, -(ReferenceModelRayActor pid=101495) "end_profile_step": 5, -(ReferenceModelRayActor pid=101495) "tuner_type": "gridsearch", -(ReferenceModelRayActor pid=101495) "tuner_early_stopping": 5, -(ReferenceModelRayActor pid=101495) "tuner_num_trials": 50, -(ReferenceModelRayActor pid=101495) "model_info_path": null, -(ReferenceModelRayActor pid=101495) "mp_size": 1, -(ReferenceModelRayActor pid=101495) "max_train_batch_size": null, -(ReferenceModelRayActor pid=101495) "min_train_batch_size": 1, -(ReferenceModelRayActor pid=101495) "max_train_micro_batch_size_per_gpu": 1.024000e+03, -(ReferenceModelRayActor pid=101495) "min_train_micro_batch_size_per_gpu": 1, -(ReferenceModelRayActor pid=101495) "num_tuning_micro_batch_sizes": 3 -(ReferenceModelRayActor pid=101495) } -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] bfloat16_enabled ............. True -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] bfloat16_immediate_grad_update False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] checkpoint_parallel_write_pipeline False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] checkpoint_tag_validation_enabled True -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] checkpoint_tag_validation_fail False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] comms_config ................. -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] communication_data_type ...... None -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] curriculum_enabled_legacy .... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] curriculum_params_legacy ..... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,129] [INFO] [config.py:1005:print] data_efficiency_enabled ...... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] dataloader_drop_last ......... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] disable_allgather ............ False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] dump_state ................... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] dynamic_loss_scale_args ...... None -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_enabled ........... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_gas_boundary_resolution 1 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_layer_name ........ bert.encoder.layer -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_layer_num ......... 0 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_max_iter .......... 100 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_stability ......... 1e-06 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_tol ............... 0.01 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] eigenvalue_verbose ........... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] elasticity_enabled ........... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] flops_profiler_config ........ { -(ReferenceModelRayActor pid=101495) "enabled": false, -(ReferenceModelRayActor pid=101495) "recompute_fwd_factor": 0.0, -(ReferenceModelRayActor pid=101495) "profile_step": 1, -(ReferenceModelRayActor pid=101495) "module_depth": -1, -(ReferenceModelRayActor pid=101495) "top_modules": 1, -(ReferenceModelRayActor pid=101495) "detailed": true, -(ReferenceModelRayActor pid=101495) "output_file": null -(ReferenceModelRayActor pid=101495) } -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] fp16_auto_cast ............... None -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] fp16_enabled ................. False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] fp16_master_weights_and_gradients False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] global_rank .................. 0 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] grad_accum_dtype ............. None -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] gradient_accumulation_steps .. 8 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] gradient_clipping ............ 1.0 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,130] [INFO] [config.py:1005:print] gradient_predivide_factor .... 1.0 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] graph_harvesting ............. False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] initial_dynamic_scale ........ 1 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] load_universal_checkpoint .... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] loss_scale ................... 1.0 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] memory_breakdown ............. False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] mics_hierarchial_params_gather False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] mics_shard_size .............. -1 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] nebula_config ................ { -(ReferenceModelRayActor pid=101495) "enabled": false, -(ReferenceModelRayActor pid=101495) "persistent_storage_path": null, -(ReferenceModelRayActor pid=101495) "persistent_time_interval": 100, -(ReferenceModelRayActor pid=101495) "num_of_version_in_retention": 2, -(ReferenceModelRayActor pid=101495) "enable_nebula_load": true, -(ReferenceModelRayActor pid=101495) "load_path": null -(ReferenceModelRayActor pid=101495) } -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] optimizer_legacy_fusion ...... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] optimizer_name ............... None -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] optimizer_params ............. None -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] pld_enabled .................. False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] pld_params ................... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] prescale_gradients ........... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] scheduler_name ............... None -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] scheduler_params ............. None -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] seq_parallel_communication_data_type torch.float32 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,131] [INFO] [config.py:1005:print] sparse_attention ............. None -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] sparse_gradients_enabled ..... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] steps_per_print .............. 100 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] timers_config ................ enabled=True synchronized=True -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] train_batch_size ............. 128 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] train_micro_batch_size_per_gpu 2 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] use_data_before_expert_parallel_ False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] use_node_local_storage ....... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] wall_clock_breakdown ......... False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] weight_quantization_config ... None -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] world_size ................... 8 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] zero_allow_untested_optimizer False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=True) offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] zero_enabled ................. True -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] zero_force_ds_cpu_optimizer .. True -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:1005:print] zero_optimization_stage ...... 3 -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:27,132] [INFO] [config.py:991:print_user_config] json = { -(ReferenceModelRayActor pid=101495) "steps_per_print": 100, -(ReferenceModelRayActor pid=101495) "zero_optimization": { -(ReferenceModelRayActor pid=101495) "stage": 3, -(ReferenceModelRayActor pid=101495) "stage3_max_live_parameters": "auto", -(ReferenceModelRayActor pid=101495) "stage3_max_reuse_distance": "auto", -(ReferenceModelRayActor pid=101495) "stage3_param_persistence_threshold": "auto", -(ReferenceModelRayActor pid=101495) "stage3_prefetch_bucket_size": "auto", -(ReferenceModelRayActor pid=101495) "offload_param": { -(ReferenceModelRayActor pid=101495) "device": "none", -(ReferenceModelRayActor pid=101495) "pin_memory": true -(ReferenceModelRayActor pid=101495) } -(ReferenceModelRayActor pid=101495) }, -(ReferenceModelRayActor pid=101495) "bf16": { -(ReferenceModelRayActor pid=101495) "enabled": true -(ReferenceModelRayActor pid=101495) }, -(ReferenceModelRayActor pid=101495) "gradient_clipping": 1.0, -(ReferenceModelRayActor pid=101495) "prescale_gradients": false, -(ReferenceModelRayActor pid=101495) "wall_clock_breakdown": false, -(ReferenceModelRayActor pid=101495) "train_micro_batch_size_per_gpu": 2, -(ReferenceModelRayActor pid=101495) "train_batch_size": 128 -(ReferenceModelRayActor pid=101495) } -(ActorModelRayActor pid=100959) [2025-05-26 19:23:30,762] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states -(ActorModelRayActor pid=100959) [2025-05-26 19:23:30,764] [INFO] [stage3.py:534:_setup_for_real_optimizer] optimizer state initialized -(ReferenceModelRayActor pid=101495) [2025-05-26 19:23:26,658] [INFO] [config.py:734:__init__] Config mesh_device None world_size = 8 [repeated 8x across cluster] -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,511] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,513] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,513] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client LR scheduler -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,513] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,513] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95)] -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,519] [INFO] [config.py:1005:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False -(ActorModelRayActor pid=100959) "device": "none" -(ActorModelRayActor pid=100959) "offload_optimizer": { -(ActorModelRayActor pid=100959) "device": "cpu", -(ActorModelRayActor pid=100959) "sub_group_size": "auto", -(ActorModelRayActor pid=100959) "reduce_bucket_size": "auto", -(ActorModelRayActor pid=100959) "zero_hpz_partition_size": 1, -(ActorModelRayActor pid=100959) "zero_quantized_weights": false, -(ActorModelRayActor pid=100959) "zero_quantized_gradients": false, -(ActorModelRayActor pid=100959) "reduce_scatter": true -(ActorModelRayActor pid=100959) "data_types": { -(ActorModelRayActor pid=100959) "grad_accum_dtype": null -(ActorModelRayActor pid=100959) "checkpoint": { -(ActorModelRayActor pid=100959) "load_universal": false -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,512] [INFO] [utils.py:782:see_memory_usage] MA 2.86 GB Max_MA 4.89 GB CA 5.02 GB Max_CA 5 GB  [repeated 2x across cluster] -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,513] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 549.69 GB, percent = 54.6% [repeated 2x across cluster] -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,515] [INFO] [config.py:1001:print] DeepSpeedEngine configuration: -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,515] [INFO] [config.py:1005:print] activation_checkpointing_config { -(ActorModelRayActor pid=100959) "partition_activations": false, -(ActorModelRayActor pid=100959) "contiguous_memory_optimization": false, -(ActorModelRayActor pid=100959) "cpu_checkpointing": false, -(ActorModelRayActor pid=100959) "number_checkpoints": null, -(ActorModelRayActor pid=100959) "synchronize_checkpoint_boundary": false, -(ActorModelRayActor pid=100959) "profile": false -(ActorModelRayActor pid=100959) } [repeated 5x across cluster] -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,515] [INFO] [config.py:1005:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,515] [INFO] [config.py:1005:print] amp_enabled .................. False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,515] [INFO] [config.py:1005:print] amp_params ................... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,515] [INFO] [config.py:1005:print] autotuning_config ............ { -(ActorModelRayActor pid=100959) "enabled": false,  [repeated 3x across cluster] -(ActorModelRayActor pid=100959) "start_step": null, -(ActorModelRayActor pid=100959) "end_step": null, -(ActorModelRayActor pid=100959) "metric_path": null, -(ActorModelRayActor pid=100959) "arg_mappings": null, -(ActorModelRayActor pid=100959) "metric": "throughput", -(ActorModelRayActor pid=100959) "model_info": null, -(ActorModelRayActor pid=100959) "results_dir": "autotuning_results", -(ActorModelRayActor pid=100959) "exps_dir": "autotuning_exps", -(ActorModelRayActor pid=100959) "overwrite": true, -(ActorModelRayActor pid=100959) "fast": true, -(ActorModelRayActor pid=100959) "start_profile_step": 3, -(ActorModelRayActor pid=100959) "end_profile_step": 5, -(ActorModelRayActor pid=100959) "tuner_type": "gridsearch", -(ActorModelRayActor pid=100959) "tuner_early_stopping": 5, -(ActorModelRayActor pid=100959) "tuner_num_trials": 50, -(ActorModelRayActor pid=100959) "model_info_path": null, -(ActorModelRayActor pid=100959) "mp_size": 1, -(ActorModelRayActor pid=100959) "max_train_batch_size": null, -(ActorModelRayActor pid=100959) "min_train_batch_size": 1, -(ActorModelRayActor pid=100959) "max_train_micro_batch_size_per_gpu": 1.024000e+03, -(ActorModelRayActor pid=100959) "min_train_micro_batch_size_per_gpu": 1, -(ActorModelRayActor pid=100959) "num_tuning_micro_batch_sizes": 3 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,515] [INFO] [config.py:1005:print] bfloat16_enabled ............. True -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] fp16_master_weights_and_gradients False [repeated 2x across cluster] -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] checkpoint_parallel_write_pipeline False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] checkpoint_tag_validation_enabled True -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] checkpoint_tag_validation_fail False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] comms_config ................. -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] communication_data_type ...... None -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] curriculum_enabled_legacy .... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] curriculum_params_legacy ..... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] data_efficiency_enabled ...... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] dataloader_drop_last ......... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] disable_allgather ............ False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] dump_state ................... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] dynamic_loss_scale_args ...... None -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] eigenvalue_enabled ........... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] eigenvalue_gas_boundary_resolution 1 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] eigenvalue_layer_name ........ bert.encoder.layer -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] eigenvalue_layer_num ......... 0 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] eigenvalue_max_iter .......... 100 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,516] [INFO] [config.py:1005:print] eigenvalue_stability ......... 1e-06 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] eigenvalue_tol ............... 0.01 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] eigenvalue_verbose ........... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] elasticity_enabled ........... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] flops_profiler_config ........ { -(ActorModelRayActor pid=100959) "recompute_fwd_factor": 0.0, -(ActorModelRayActor pid=100959) "profile_step": 1, -(ActorModelRayActor pid=100959) "module_depth": -1, -(ActorModelRayActor pid=100959) "top_modules": 1, -(ActorModelRayActor pid=100959) "detailed": true, -(ActorModelRayActor pid=100959) "output_file": null -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] fp16_auto_cast ............... None -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] fp16_enabled ................. False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] global_rank .................. 0 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] grad_accum_dtype ............. None -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] gradient_accumulation_steps .. 8 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] gradient_clipping ............ 1.0 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] gradient_predivide_factor .... 1.0 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] graph_harvesting ............. False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] initial_dynamic_scale ........ 1 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] load_universal_checkpoint .... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] loss_scale ................... 1.0 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] memory_breakdown ............. False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] mics_hierarchial_params_gather False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] mics_shard_size .............. -1 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,517] [INFO] [config.py:1005:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] nebula_config ................ { -(ActorModelRayActor pid=100959) "persistent_storage_path": null, -(ActorModelRayActor pid=100959) "persistent_time_interval": 100, -(ActorModelRayActor pid=100959) "num_of_version_in_retention": 2, -(ActorModelRayActor pid=100959) "enable_nebula_load": true, -(ActorModelRayActor pid=100959) "load_path": null -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] optimizer_legacy_fusion ...... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] optimizer_name ............... None -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] optimizer_params ............. None -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] pld_enabled .................. False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] pld_params ................... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] prescale_gradients ........... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] scheduler_name ............... None -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] scheduler_params ............. None -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] seq_parallel_communication_data_type torch.float32 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] sparse_attention ............. None -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] sparse_gradients_enabled ..... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] steps_per_print .............. 100 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] timers_config ................ enabled=True synchronized=True -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] train_batch_size ............. 128 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] train_micro_batch_size_per_gpu 2 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] use_data_before_expert_parallel_ False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] use_node_local_storage ....... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] wall_clock_breakdown ......... False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,518] [INFO] [config.py:1005:print] weight_quantization_config ... None -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,519] [INFO] [config.py:1005:print] world_size ................... 8 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,519] [INFO] [config.py:1005:print] zero_allow_untested_optimizer False -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,519] [INFO] [config.py:1005:print] zero_enabled ................. True -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,519] [INFO] [config.py:1005:print] zero_force_ds_cpu_optimizer .. True -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,519] [INFO] [config.py:1005:print] zero_optimization_stage ...... 3 -(ActorModelRayActor pid=100959) [2025-05-26 19:23:33,519] [INFO] [config.py:991:print_user_config] json = { -(ActorModelRayActor pid=100959) "steps_per_print": 100, -(ActorModelRayActor pid=100959) "zero_optimization": { -(ActorModelRayActor pid=100959) "stage": 3, -(ActorModelRayActor pid=100959) "stage3_prefetch_bucket_size": "auto",  [repeated 4x across cluster] -(ActorModelRayActor pid=100959) "offload_param": { -(ActorModelRayActor pid=100959) "pin_memory": true -(ActorModelRayActor pid=100959) },  [repeated 6x across cluster] -(ActorModelRayActor pid=100959) "bf16": { -(ActorModelRayActor pid=100959) "enabled": true -(ActorModelRayActor pid=100959) "gradient_clipping": 1.0, -(ActorModelRayActor pid=100959) "prescale_gradients": false, -(ActorModelRayActor pid=100959) "wall_clock_breakdown": false, -(ActorModelRayActor pid=100959) "train_micro_batch_size_per_gpu": 2, -(ActorModelRayActor pid=100959) "train_batch_size": 128 -(ActorModelRayActor pid=100959) wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. -(ReferenceModelRayActor pid=101495) -Loading checkpoint shards: 80%|████████ | 4/5 [00:26<00:06, 6.26s/it] [repeated 6x across cluster] -(ReferenceModelRayActor pid=101495) -Loading checkpoint shards: 100%|██████████| 5/5 [00:27<00:00, 4.45s/it] -Loading checkpoint shards: 100%|██████████| 5/5 [00:27<00:00, 5.51s/it] [repeated 7x across cluster] -(ActorModelRayActor pid=100959) wandb: Tracking run with wandb version 0.19.8 -(ActorModelRayActor pid=100959) wandb: W&B syncing is set to `offline` in this directory. -(ActorModelRayActor pid=100959) wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing. -(LLMRayActor pid=89986) init_process_group: master_address=10.140.1.87, master_port=1652, rank=6, world_size=9, group_name=openrlhf -(ActorModelRayActor pid=100959) -Episode [1/2]: 0%| | 0/187 [00:00