Junyi42 commited on
Commit
bbc51e2
·
verified ·
1 Parent(s): 65e4489

Upload checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins

Browse files
checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/wandb/offline-run-20260129_223514-checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins-run0/files/config.yaml CHANGED
@@ -0,0 +1,459 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ wandb_version: 1
2
+
3
+ _wandb:
4
+ desc: null
5
+ value:
6
+ python_version: 3.11.10
7
+ cli_version: 0.23.1
8
+ framework: huggingface
9
+ huggingface_version: 4.49.0
10
+ is_jupyter_run: false
11
+ is_kaggle_kernel: false
12
+ start_time: 1769726114
13
+ t:
14
+ 1:
15
+ - 1
16
+ - 5
17
+ - 11
18
+ - 41
19
+ - 49
20
+ - 53
21
+ - 71
22
+ - 105
23
+ 2:
24
+ - 1
25
+ - 5
26
+ - 11
27
+ - 41
28
+ - 49
29
+ - 53
30
+ - 71
31
+ - 105
32
+ 3:
33
+ - 2
34
+ - 4
35
+ - 13
36
+ - 14
37
+ - 37
38
+ - 42
39
+ - 61
40
+ 4: 3.11.10
41
+ 5: 0.23.1
42
+ 6: 4.49.0
43
+ 13: linux-x86_64
44
+ e:
45
+ cf0m5et0f6ffnx56j9dnkbvf0iq03h00:
46
+ os: Linux-6.6.93+-x86_64-with-glibc2.35
47
+ python: CPython 3.11.10
48
+ started_at: '2026-01-29T22:35:14.394399Z'
49
+ args:
50
+ - --dataset_config_file
51
+ - ./data/configs/vlm_gym_toy_maze_2d_train_celoss_no_mse.yaml
52
+ - --eval_dataset_config_file
53
+ - ./data/configs/vlm_gym_toy_maze_2d_eval_celoss_no_mse.yaml
54
+ - --viz_dataset_config_file
55
+ - ./data/configs/vlm_gym_toy_maze_2d_eval_celoss_no_mse.yaml
56
+ - --train_data_dir
57
+ - /home/clouduser/Code/data/gym/toy_maze_2d_v5/train/
58
+ - --train_jsonl_path
59
+ - /home/clouduser/Code/data/gym/toy_maze_2d_v5/train/
60
+ - --eval_data_dir
61
+ - /home/clouduser/Code/data/gym/toy_maze_2d_v5/val/
62
+ - --eval_jsonl_path
63
+ - /home/clouduser/Code/data/gym/toy_maze_2d_v5/val/
64
+ - --inference_hash_file
65
+ - /home/clouduser/Code/Github/launch_new/hashes_test_set_v10.json
66
+ - --task_name
67
+ - toy_maze_2d_v5
68
+ - --instructions_dir
69
+ - ./data/instructions
70
+ - --model_path
71
+ - /home/clouduser/Code/Models/BAGEL-7B-MoT
72
+ - --layer_module
73
+ - Qwen2MoTDecoderLayer
74
+ - --max_latent_size
75
+ - '64'
76
+ - --resume-from
77
+ - /home/clouduser/Code/Models/BAGEL-7B-MoT
78
+ - --finetune_from_hf
79
+ - 'True'
80
+ - --auto_resume
81
+ - 'False'
82
+ - --resume-model-only
83
+ - 'True'
84
+ - --finetune-from-ema
85
+ - 'True'
86
+ - --log_every
87
+ - '1'
88
+ - --lr
89
+ - 2e-5
90
+ - --warmup_steps
91
+ - '300'
92
+ - --lr_scheduler
93
+ - cosine
94
+ - --num_worker
95
+ - '1'
96
+ - --expected_num_tokens
97
+ - '30000'
98
+ - --max_num_tokens
99
+ - '30000'
100
+ - --max_num_tokens_per_sample
101
+ - '30000'
102
+ - --visual_und
103
+ - 'True'
104
+ - --visual_gen
105
+ - 'False'
106
+ - --save_every
107
+ - '5000'
108
+ - --total_steps
109
+ - '5000'
110
+ - --text_cond_dropout_prob
111
+ - '0.0'
112
+ - --vae_cond_dropout_prob
113
+ - '0.3'
114
+ - --vit_cond_dropout_prob
115
+ - '0.0'
116
+ - --ema
117
+ - '0.993'
118
+ - --checkpoint_dir
119
+ - /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins
120
+ - --wandb_project
121
+ - bagel
122
+ - --wandb_name
123
+ - checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins
124
+ - --wandb_dir
125
+ - /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins
126
+ - --wandb_offline
127
+ - 'True'
128
+ program: /home/clouduser/Code/Github/unified_world_model/train/pretrain_unified_navit.py
129
+ code_path: train/pretrain_unified_navit.py
130
+ code_path_local: train/pretrain_unified_navit.py
131
+ git:
132
+ remote_url: https://github.com/para-lost/unified_world_model
133
+ commit: 8d7b26b7e552fc87b592cf3be94d93be7aeca2a9
134
+ root: /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins
135
+ host: junyizhang-launch-new-226786579-1-0
136
+ executable: /opt/conda/bin/python3.11
137
+ cpu_count: 48
138
+ cpu_count_logical: 96
139
+ gpu_type: NVIDIA A100-SXM4-80GB
140
+ gpu_count: 8
141
+ disk:
142
+ /:
143
+ total: '1052461830144'
144
+ used: '187289595904'
145
+ memory:
146
+ total: '1437332606976'
147
+ gpu_nvidia:
148
+ - name: NVIDIA A100-SXM4-80GB
149
+ memory_total: '85899345920'
150
+ cuda_cores: 6912
151
+ architecture: Ampere
152
+ uuid: GPU-7fb62b69-ecda-78ca-6a78-e9242da41257
153
+ - name: NVIDIA A100-SXM4-80GB
154
+ memory_total: '85899345920'
155
+ cuda_cores: 6912
156
+ architecture: Ampere
157
+ uuid: GPU-63537668-be67-064d-0339-a4807fac2e17
158
+ - name: NVIDIA A100-SXM4-80GB
159
+ memory_total: '85899345920'
160
+ cuda_cores: 6912
161
+ architecture: Ampere
162
+ uuid: GPU-3d52c06a-5ceb-fc52-708a-f5c75328664b
163
+ - name: NVIDIA A100-SXM4-80GB
164
+ memory_total: '85899345920'
165
+ cuda_cores: 6912
166
+ architecture: Ampere
167
+ uuid: GPU-befa8eef-f6ca-6ce5-9b17-6d284a071265
168
+ - name: NVIDIA A100-SXM4-80GB
169
+ memory_total: '85899345920'
170
+ cuda_cores: 6912
171
+ architecture: Ampere
172
+ uuid: GPU-87f27493-86c3-9a50-4e27-a7100ec26ac4
173
+ - name: NVIDIA A100-SXM4-80GB
174
+ memory_total: '85899345920'
175
+ cuda_cores: 6912
176
+ architecture: Ampere
177
+ uuid: GPU-2d4912d0-bf74-6ded-3c18-922df6d46383
178
+ - name: NVIDIA A100-SXM4-80GB
179
+ memory_total: '85899345920'
180
+ cuda_cores: 6912
181
+ architecture: Ampere
182
+ uuid: GPU-2eb784d3-6f27-8d12-081c-a0de47b17206
183
+ - name: NVIDIA A100-SXM4-80GB
184
+ memory_total: '85899345920'
185
+ cuda_cores: 6912
186
+ architecture: Ampere
187
+ uuid: GPU-ac067d32-e7bf-7556-dcab-e5a49410eb3d
188
+ cuda_version: '12.2'
189
+ writer_id: cf0m5et0f6ffnx56j9dnkbvf0iq03h00
190
+ visual_gen:
191
+ desc: null
192
+ value: false
193
+ visual_und:
194
+ desc: null
195
+ value: true
196
+ results_dir:
197
+ desc: null
198
+ value: results
199
+ checkpoint_dir:
200
+ desc: null
201
+ value: /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins
202
+ wandb_project:
203
+ desc: null
204
+ value: bagel
205
+ wandb_name:
206
+ desc: null
207
+ value: checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins
208
+ wandb_runid:
209
+ desc: null
210
+ value: '0'
211
+ wandb_resume:
212
+ desc: null
213
+ value: allow
214
+ wandb_offline:
215
+ desc: null
216
+ value: true
217
+ wandb_dir:
218
+ desc: null
219
+ value: /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins
220
+ global_seed:
221
+ desc: null
222
+ value: 4396
223
+ auto_resume:
224
+ desc: null
225
+ value: false
226
+ resume_from:
227
+ desc: null
228
+ value: /home/clouduser/Code/Models/BAGEL-7B-MoT
229
+ resume_model_only:
230
+ desc: null
231
+ value: true
232
+ finetune_from_ema:
233
+ desc: null
234
+ value: true
235
+ finetune_from_hf:
236
+ desc: null
237
+ value: true
238
+ log_every:
239
+ desc: null
240
+ value: 1
241
+ save_every:
242
+ desc: null
243
+ value: 5000
244
+ total_steps:
245
+ desc: null
246
+ value: 5000
247
+ warmup_steps:
248
+ desc: null
249
+ value: 300
250
+ lr_scheduler:
251
+ desc: null
252
+ value: cosine
253
+ lr:
254
+ desc: null
255
+ value: 2.0e-05
256
+ min_lr:
257
+ desc: null
258
+ value: 1.0e-07
259
+ beta1:
260
+ desc: null
261
+ value: 0.9
262
+ beta2:
263
+ desc: null
264
+ value: 0.95
265
+ eps:
266
+ desc: null
267
+ value: 1.0e-15
268
+ ema:
269
+ desc: null
270
+ value: 0.993
271
+ max_grad_norm:
272
+ desc: null
273
+ value: 1.0
274
+ timestep_shift:
275
+ desc: null
276
+ value: 1.0
277
+ mse_weight:
278
+ desc: null
279
+ value: 1.0
280
+ ce_weight:
281
+ desc: null
282
+ value: 1.0
283
+ ce_loss_reweighting:
284
+ desc: null
285
+ value: false
286
+ expected_num_tokens:
287
+ desc: null
288
+ value: 30000
289
+ num_replicate:
290
+ desc: null
291
+ value: 1
292
+ num_shard:
293
+ desc: null
294
+ value: 8
295
+ sharding_strategy:
296
+ desc: null
297
+ value: HYBRID_SHARD
298
+ backward_prefetch:
299
+ desc: null
300
+ value: BACKWARD_PRE
301
+ cpu_offload:
302
+ desc: null
303
+ value: false
304
+ freeze_llm:
305
+ desc: null
306
+ value: false
307
+ freeze_vit:
308
+ desc: null
309
+ value: false
310
+ freeze_vae:
311
+ desc: null
312
+ value: true
313
+ freeze_und:
314
+ desc: null
315
+ value: false
316
+ copy_init_moe:
317
+ desc: null
318
+ value: true
319
+ use_flex:
320
+ desc: null
321
+ value: false
322
+ eval_every:
323
+ desc: null
324
+ value: 500
325
+ num_eval_batches:
326
+ desc: null
327
+ value: 20
328
+ use_ema_for_eval:
329
+ desc: null
330
+ value: true
331
+ eval_log_dir:
332
+ desc: null
333
+ value: null
334
+ eval_run_tag:
335
+ desc: null
336
+ value: ''
337
+ viz_every:
338
+ desc: null
339
+ value: 500
340
+ viz_n:
341
+ desc: null
342
+ value: 8
343
+ viz_outdir:
344
+ desc: null
345
+ value: results/viz
346
+ eval_dataset_config_file:
347
+ desc: null
348
+ value: ./data/configs/vlm_gym_toy_maze_2d_eval_celoss_no_mse.yaml
349
+ viz_dataset_config_file:
350
+ desc: null
351
+ value: ./data/configs/vlm_gym_toy_maze_2d_eval_celoss_no_mse.yaml
352
+ eval_print_n:
353
+ desc: null
354
+ value: 3
355
+ save_ema_only:
356
+ desc: null
357
+ value: true
358
+ save_optimizer:
359
+ desc: null
360
+ value: false
361
+ model_path:
362
+ desc: null
363
+ value: /home/clouduser/Code/Models/BAGEL-7B-MoT
364
+ llm_path:
365
+ desc: null
366
+ value: hf/Qwen2.5-0.5B-Instruct/
367
+ llm_qk_norm:
368
+ desc: null
369
+ value: true
370
+ tie_word_embeddings:
371
+ desc: null
372
+ value: false
373
+ layer_module:
374
+ desc: null
375
+ value: Qwen2MoTDecoderLayer
376
+ vae_path:
377
+ desc: null
378
+ value: flux/vae/ae.safetensors
379
+ vit_path:
380
+ desc: null
381
+ value: hf/siglip-so400m-14-980-flash-attn2-navit/
382
+ max_latent_size:
383
+ desc: null
384
+ value: 64
385
+ latent_patch_size:
386
+ desc: null
387
+ value: 2
388
+ vit_patch_size:
389
+ desc: null
390
+ value: 14
391
+ vit_max_num_patch_per_side:
392
+ desc: null
393
+ value: 70
394
+ connector_act:
395
+ desc: null
396
+ value: gelu_pytorch_tanh
397
+ interpolate_pos:
398
+ desc: null
399
+ value: false
400
+ vit_select_layer:
401
+ desc: null
402
+ value: -2
403
+ vit_rope:
404
+ desc: null
405
+ value: false
406
+ text_cond_dropout_prob:
407
+ desc: null
408
+ value: 0.0
409
+ vae_cond_dropout_prob:
410
+ desc: null
411
+ value: 0.3
412
+ vit_cond_dropout_prob:
413
+ desc: null
414
+ value: 0.0
415
+ dataset_config_file:
416
+ desc: null
417
+ value: ./data/configs/vlm_gym_toy_maze_2d_train_celoss_no_mse.yaml
418
+ train_data_dir:
419
+ desc: null
420
+ value: /home/clouduser/Code/data/gym/toy_maze_2d_v5/train/
421
+ train_jsonl_path:
422
+ desc: null
423
+ value: /home/clouduser/Code/data/gym/toy_maze_2d_v5/train/
424
+ eval_data_dir:
425
+ desc: null
426
+ value: /home/clouduser/Code/data/gym/toy_maze_2d_v5/val/
427
+ eval_jsonl_path:
428
+ desc: null
429
+ value: /home/clouduser/Code/data/gym/toy_maze_2d_v5/val/
430
+ inference_hash_file:
431
+ desc: null
432
+ value: /home/clouduser/Code/Github/launch_new/hashes_test_set_v10.json
433
+ task_name:
434
+ desc: null
435
+ value: toy_maze_2d_v5
436
+ instructions_dir:
437
+ desc: null
438
+ value: ./data/instructions
439
+ prefetch_factor:
440
+ desc: null
441
+ value: 2
442
+ num_workers:
443
+ desc: null
444
+ value: 1
445
+ max_num_tokens_per_sample:
446
+ desc: null
447
+ value: 30000
448
+ max_num_tokens:
449
+ desc: null
450
+ value: 30000
451
+ prefer_buffer_before:
452
+ desc: null
453
+ value: 16384
454
+ max_buffer_size:
455
+ desc: null
456
+ value: 50
457
+ data_seed:
458
+ desc: null
459
+ value: 42
checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/wandb/offline-run-20260129_223514-checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins-run0/files/output.log CHANGED
@@ -1,3 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  wandb: Detected [huggingface_hub.inference] in use.
2
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
3
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
@@ -917,176 +1087,6 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
917
  [2026-01-29 23:51:13] (step=0000906) Train Loss mse: 0.0000, Train Loss ce: 0.0422, Train Steps/Sec: 0.21,
918
  [2026-01-29 23:51:17] (step=0000907) Train Loss mse: 0.0000, Train Loss ce: 0.0435, Train Steps/Sec: 0.24,
919
  [2026-01-29 23:51:22] (step=0000908) Train Loss mse: 0.0000, Train Loss ce: 0.0414, Train Steps/Sec: 0.21,
920
- FullyShardedDataParallel(
921
- (_fsdp_wrapped_module): Bagel(
922
- (language_model): Qwen2ForCausalLM(
923
- (model): Qwen2Model(
924
- (embed_tokens): Embedding(152064, 3584)
925
- (layers): ModuleList(
926
- (0-27): 28 x FullyShardedDataParallel(
927
- (_fsdp_wrapped_module): CheckpointWrapper(
928
- (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
929
- (self_attn): PackedAttentionMoT(
930
- (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
931
- (k_proj): Linear(in_features=3584, out_features=512, bias=True)
932
- (v_proj): Linear(in_features=3584, out_features=512, bias=True)
933
- (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
934
- (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
935
- (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
936
- (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
937
- (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
938
- (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
939
- (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
940
- (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
941
- (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
942
- )
943
- (mlp): Qwen2MLP(
944
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
945
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
946
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
947
- (act_fn): SiLU()
948
- )
949
- (mlp_moe_gen): Qwen2MLP(
950
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
951
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
952
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
953
- (act_fn): SiLU()
954
- )
955
- (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
956
- (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
957
- (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
958
- (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
959
- )
960
- )
961
- )
962
- )
963
- (norm): Qwen2RMSNorm((3584,), eps=1e-06)
964
- (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
965
- (rotary_emb): Qwen2RotaryEmbedding()
966
- )
967
- (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
968
- )
969
- (vit_model): SiglipVisionModel(
970
- (vision_model): FullyShardedDataParallel(
971
- (_fsdp_wrapped_module): SiglipVisionTransformer(
972
- (embeddings): SiglipVisionEmbeddings(
973
- (position_embedding): Embedding(4900, 1152)
974
- (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
975
- )
976
- (encoder): SiglipEncoder(
977
- (layers): ModuleList(
978
- (0-25): 26 x FullyShardedDataParallel(
979
- (_fsdp_wrapped_module): CheckpointWrapper(
980
- (_checkpoint_wrapped_module): SiglipEncoderLayer(
981
- (self_attn): SiglipFlashAttention2(
982
- (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
983
- (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
984
- (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
985
- (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
986
- )
987
- (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
988
- (mlp): SiglipMLP(
989
- (activation_fn): PytorchGELUTanh()
990
- (fc1): Linear(in_features=1152, out_features=4304, bias=True)
991
- (fc2): Linear(in_features=4304, out_features=1152, bias=True)
992
- )
993
- (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
994
- )
995
- )
996
- )
997
- )
998
- )
999
- (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1000
- )
1001
- )
1002
- )
1003
- (connector): FullyShardedDataParallel(
1004
- (_fsdp_wrapped_module): CheckpointWrapper(
1005
- (_checkpoint_wrapped_module): MLPconnector(
1006
- (activation_fn): PytorchGELUTanh()
1007
- (fc1): Linear(in_features=1152, out_features=3584, bias=True)
1008
- (fc2): Linear(in_features=3584, out_features=3584, bias=True)
1009
- )
1010
- )
1011
- )
1012
- (vit_pos_embed): FullyShardedDataParallel(
1013
- (_fsdp_wrapped_module): PositionEmbedding()
1014
- )
1015
- )
1016
- )
1017
- _flat_param True
1018
- language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1019
- language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1020
- language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1021
- language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1022
- language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1023
- language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1024
- language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1025
- language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1026
- language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1027
- language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1028
- language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1029
- language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1030
- language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1031
- language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1032
- language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1033
- language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1034
- language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1035
- language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1036
- language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1037
- language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1038
- language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1039
- language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1040
- language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1041
- language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1042
- language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1043
- language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1044
- language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1045
- language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1046
- vit_model.vision_model._fsdp_wrapped_module._flat_param True
1047
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1048
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1049
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1050
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1051
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1052
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1053
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1054
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1055
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1056
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1057
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1058
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1059
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1060
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1061
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1062
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1063
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1064
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1065
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1066
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1067
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1068
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1069
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1070
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1071
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1072
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1073
- connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1074
- vit_pos_embed._fsdp_wrapped_module._flat_param False
1075
- Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse/vlm_gym_toy_maze_2d_train
1076
- base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step0
1077
- Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
1078
- [eval debug] first 3 batch fingerprints:
1079
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1080
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1081
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1082
- ce_avg: 0.33020690083503723, mse_avg: 0.0
1083
- base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step500
1084
- Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
1085
- [eval debug] first 3 batch fingerprints:
1086
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1087
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1088
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1089
- ce_avg: 0.04093625023961067, mse_avg: 0.0
1090
  [2026-01-29 23:51:26] (step=0000909) Train Loss mse: 0.0000, Train Loss ce: 0.0391, Train Steps/Sec: 0.22,
1091
  [2026-01-29 23:51:31] (step=0000910) Train Loss mse: 0.0000, Train Loss ce: 0.0372, Train Steps/Sec: 0.21,
1092
  [2026-01-29 23:51:36] (step=0000911) Train Loss mse: 0.0000, Train Loss ce: 0.0413, Train Steps/Sec: 0.20,
@@ -1301,6 +1301,27 @@ ce_avg: 0.04093625023961067, mse_avg: 0.0
1301
  [2026-01-30 00:07:51] (step=0001120) Train Loss mse: 0.0000, Train Loss ce: 0.0360, Train Steps/Sec: 0.21,
1302
  [2026-01-30 00:07:55] (step=0001121) Train Loss mse: 0.0000, Train Loss ce: 0.0358, Train Steps/Sec: 0.24,
1303
  [2026-01-30 00:07:59] (step=0001122) Train Loss mse: 0.0000, Train Loss ce: 0.0416, Train Steps/Sec: 0.22,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1304
  [2026-01-30 00:08:04] (step=0001123) Train Loss mse: 0.0000, Train Loss ce: 0.0416, Train Steps/Sec: 0.23,
1305
  [2026-01-30 00:08:09] (step=0001124) Train Loss mse: 0.0000, Train Loss ce: 0.0348, Train Steps/Sec: 0.20,
1306
  [2026-01-30 00:08:13] (step=0001125) Train Loss mse: 0.0000, Train Loss ce: 0.0423, Train Steps/Sec: 0.21,
@@ -2467,41 +2488,6 @@ ce_avg: 0.04093625023961067, mse_avg: 0.0
2467
  [2026-01-30 01:37:24] (step=0002286) Train Loss mse: 0.0000, Train Loss ce: 0.0391, Train Steps/Sec: 0.22,
2468
  [2026-01-30 01:37:29] (step=0002287) Train Loss mse: 0.0000, Train Loss ce: 0.0358, Train Steps/Sec: 0.21,
2469
  [2026-01-30 01:37:33] (step=0002288) Train Loss mse: 0.0000, Train Loss ce: 0.0337, Train Steps/Sec: 0.22,
2470
- base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step1000
2471
- Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
2472
- [eval debug] first 3 batch fingerprints:
2473
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2474
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2475
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2476
- ce_avg: 0.0632656142115593, mse_avg: 0.0
2477
- base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step1500
2478
- Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
2479
- [eval debug] first 3 batch fingerprints:
2480
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2481
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2482
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2483
- ce_avg: 0.07227745652198792, mse_avg: 0.0
2484
- base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step2000
2485
- Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
2486
- [eval debug] first 3 batch fingerprints:
2487
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2488
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2489
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2490
- ce_avg: 0.09361912310123444, mse_avg: 0.0
2491
- base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step2500
2492
- Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
2493
- [eval debug] first 3 batch fingerprints:
2494
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2495
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2496
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2497
- ce_avg: 0.11742664873600006, mse_avg: 0.0
2498
- base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step3000
2499
- Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
2500
- [eval debug] first 3 batch fingerprints:
2501
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2502
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2503
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2504
- ce_avg: 0.15227733552455902, mse_avg: 0.0
2505
  [2026-01-30 01:37:38] (step=0002289) Train Loss mse: 0.0000, Train Loss ce: 0.0288, Train Steps/Sec: 0.21,
2506
  [2026-01-30 01:37:42] (step=0002290) Train Loss mse: 0.0000, Train Loss ce: 0.0374, Train Steps/Sec: 0.22,
2507
  [2026-01-30 01:37:47] (step=0002291) Train Loss mse: 0.0000, Train Loss ce: 0.0294, Train Steps/Sec: 0.22,
@@ -2834,6 +2820,27 @@ ce_avg: 0.15227733552455902, mse_avg: 0.0
2834
  [2026-01-30 02:03:01] (step=0002618) Train Loss mse: 0.0000, Train Loss ce: 0.0383, Train Steps/Sec: 0.22,
2835
  [2026-01-30 02:03:05] (step=0002619) Train Loss mse: 0.0000, Train Loss ce: 0.0307, Train Steps/Sec: 0.22,
2836
  [2026-01-30 02:03:10] (step=0002620) Train Loss mse: 0.0000, Train Loss ce: 0.0402, Train Steps/Sec: 0.22,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2837
  [2026-01-30 02:03:14] (step=0002621) Train Loss mse: 0.0000, Train Loss ce: 0.0373, Train Steps/Sec: 0.23,
2838
  [2026-01-30 02:03:19] (step=0002622) Train Loss mse: 0.0000, Train Loss ce: 0.0383, Train Steps/Sec: 0.22,
2839
  [2026-01-30 02:03:23] (step=0002623) Train Loss mse: 0.0000, Train Loss ce: 0.0315, Train Steps/Sec: 0.26,
@@ -3453,20 +3460,6 @@ ce_avg: 0.15227733552455902, mse_avg: 0.0
3453
  [2026-01-30 02:50:41] (step=0003237) Train Loss mse: 0.0000, Train Loss ce: 0.0404, Train Steps/Sec: 0.21,
3454
  [2026-01-30 02:50:45] (step=0003238) Train Loss mse: 0.0000, Train Loss ce: 0.0343, Train Steps/Sec: 0.22,
3455
  [2026-01-30 02:50:50] (step=0003239) Train Loss mse: 0.0000, Train Loss ce: 0.0265, Train Steps/Sec: 0.21,
3456
- base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step3500
3457
- Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
3458
- [eval debug] first 3 batch fingerprints:
3459
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
3460
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
3461
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
3462
- ce_avg: 0.19604933261871338, mse_avg: 0.0
3463
- base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step4000
3464
- Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
3465
- [eval debug] first 3 batch fingerprints:
3466
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
3467
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
3468
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
3469
- ce_avg: 0.19754347205162048, mse_avg: 0.0
3470
  [2026-01-30 02:50:54] (step=0003240) Train Loss mse: 0.0000, Train Loss ce: 0.0303, Train Steps/Sec: 0.24,
3471
  [2026-01-30 02:50:59] (step=0003241) Train Loss mse: 0.0000, Train Loss ce: 0.0399, Train Steps/Sec: 0.20,
3472
  [2026-01-30 02:51:03] (step=0003242) Train Loss mse: 0.0000, Train Loss ce: 0.0342, Train Steps/Sec: 0.23,
@@ -3910,6 +3903,20 @@ ce_avg: 0.19754347205162048, mse_avg: 0.0
3910
  [2026-01-30 03:25:05] (step=0003680) Train Loss mse: 0.0000, Train Loss ce: 0.0270, Train Steps/Sec: 0.22,
3911
  [2026-01-30 03:25:09] (step=0003681) Train Loss mse: 0.0000, Train Loss ce: 0.0386, Train Steps/Sec: 0.22,
3912
  [2026-01-30 03:25:14] (step=0003682) Train Loss mse: 0.0000, Train Loss ce: 0.0292, Train Steps/Sec: 0.20,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3913
  [2026-01-30 03:25:18] (step=0003683) Train Loss mse: 0.0000, Train Loss ce: 0.0377, Train Steps/Sec: 0.24,
3914
  [2026-01-30 03:25:23] (step=0003684) Train Loss mse: 0.0000, Train Loss ce: 0.0461, Train Steps/Sec: 0.21,
3915
  [2026-01-30 03:25:28] (step=0003685) Train Loss mse: 0.0000, Train Loss ce: 0.0395, Train Steps/Sec: 0.21,
@@ -4889,20 +4896,6 @@ ce_avg: 0.19754347205162048, mse_avg: 0.0
4889
  [2026-01-30 04:40:52] (step=0004659) Train Loss mse: 0.0000, Train Loss ce: 0.0359, Train Steps/Sec: 0.20,
4890
  [2026-01-30 04:40:56] (step=0004660) Train Loss mse: 0.0000, Train Loss ce: 0.0324, Train Steps/Sec: 0.22,
4891
  [2026-01-30 04:41:01] (step=0004661) Train Loss mse: 0.0000, Train Loss ce: 0.0409, Train Steps/Sec: 0.22,
4892
- base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step4500
4893
- Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
4894
- [eval debug] first 3 batch fingerprints:
4895
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
4896
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
4897
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
4898
- ce_avg: 0.20177732408046722, mse_avg: 0.0
4899
- base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step5000
4900
- Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
4901
- [eval debug] first 3 batch fingerprints:
4902
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
4903
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
4904
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
4905
- ce_avg: 0.20619651675224304, mse_avg: 0.0
4906
  [2026-01-30 04:41:05] (step=0004662) Train Loss mse: 0.0000, Train Loss ce: 0.0324, Train Steps/Sec: 0.21,
4907
  [2026-01-30 04:41:10] (step=0004663) Train Loss mse: 0.0000, Train Loss ce: 0.0382, Train Steps/Sec: 0.22,
4908
  [2026-01-30 04:41:15] (step=0004664) Train Loss mse: 0.0000, Train Loss ce: 0.0338, Train Steps/Sec: 0.21,
@@ -5245,4 +5238,11 @@ ce_avg: 0.20619651675224304, mse_avg: 0.0
5245
  [2026-01-30 05:07:25] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/0005000.
5246
  /opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
5247
  warnings.warn(
5248
- [2026-01-30 05:09:58] Done!
 
 
 
 
 
 
 
 
1
+ FullyShardedDataParallel(
2
+ (_fsdp_wrapped_module): Bagel(
3
+ (language_model): Qwen2ForCausalLM(
4
+ (model): Qwen2Model(
5
+ (embed_tokens): Embedding(152064, 3584)
6
+ (layers): ModuleList(
7
+ (0-27): 28 x FullyShardedDataParallel(
8
+ (_fsdp_wrapped_module): CheckpointWrapper(
9
+ (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
10
+ (self_attn): PackedAttentionMoT(
11
+ (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
12
+ (k_proj): Linear(in_features=3584, out_features=512, bias=True)
13
+ (v_proj): Linear(in_features=3584, out_features=512, bias=True)
14
+ (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
15
+ (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
16
+ (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
17
+ (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
18
+ (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
19
+ (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
20
+ (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
21
+ (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
22
+ (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
23
+ )
24
+ (mlp): Qwen2MLP(
25
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
26
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
27
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
28
+ (act_fn): SiLU()
29
+ )
30
+ (mlp_moe_gen): Qwen2MLP(
31
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
32
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
33
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
34
+ (act_fn): SiLU()
35
+ )
36
+ (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
37
+ (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
38
+ (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
39
+ (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
40
+ )
41
+ )
42
+ )
43
+ )
44
+ (norm): Qwen2RMSNorm((3584,), eps=1e-06)
45
+ (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
46
+ (rotary_emb): Qwen2RotaryEmbedding()
47
+ )
48
+ (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
49
+ )
50
+ (vit_model): SiglipVisionModel(
51
+ (vision_model): FullyShardedDataParallel(
52
+ (_fsdp_wrapped_module): SiglipVisionTransformer(
53
+ (embeddings): SiglipVisionEmbeddings(
54
+ (position_embedding): Embedding(4900, 1152)
55
+ (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
56
+ )
57
+ (encoder): SiglipEncoder(
58
+ (layers): ModuleList(
59
+ (0-25): 26 x FullyShardedDataParallel(
60
+ (_fsdp_wrapped_module): CheckpointWrapper(
61
+ (_checkpoint_wrapped_module): SiglipEncoderLayer(
62
+ (self_attn): SiglipFlashAttention2(
63
+ (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
64
+ (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
65
+ (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
66
+ (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
67
+ )
68
+ (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
69
+ (mlp): SiglipMLP(
70
+ (activation_fn): PytorchGELUTanh()
71
+ (fc1): Linear(in_features=1152, out_features=4304, bias=True)
72
+ (fc2): Linear(in_features=4304, out_features=1152, bias=True)
73
+ )
74
+ (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
75
+ )
76
+ )
77
+ )
78
+ )
79
+ )
80
+ (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
81
+ )
82
+ )
83
+ )
84
+ (connector): FullyShardedDataParallel(
85
+ (_fsdp_wrapped_module): CheckpointWrapper(
86
+ (_checkpoint_wrapped_module): MLPconnector(
87
+ (activation_fn): PytorchGELUTanh()
88
+ (fc1): Linear(in_features=1152, out_features=3584, bias=True)
89
+ (fc2): Linear(in_features=3584, out_features=3584, bias=True)
90
+ )
91
+ )
92
+ )
93
+ (vit_pos_embed): FullyShardedDataParallel(
94
+ (_fsdp_wrapped_module): PositionEmbedding()
95
+ )
96
+ )
97
+ )
98
+ _flat_param True
99
+ language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
100
+ language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
101
+ language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
102
+ language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
103
+ language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
104
+ language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
105
+ language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
106
+ language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
107
+ language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
108
+ language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
109
+ language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
110
+ language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
111
+ language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
112
+ language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
113
+ language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
114
+ language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
115
+ language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
116
+ language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
117
+ language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
118
+ language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
119
+ language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
120
+ language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
121
+ language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
122
+ language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
123
+ language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
124
+ language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
125
+ language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
126
+ language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
127
+ vit_model.vision_model._fsdp_wrapped_module._flat_param True
128
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
129
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
130
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
131
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
132
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
133
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
134
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
135
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
136
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
137
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
138
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
139
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
140
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
141
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
142
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
143
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
144
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
145
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
146
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
147
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
148
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
149
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
150
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
151
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
152
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
153
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
154
+ connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
155
+ vit_pos_embed._fsdp_wrapped_module._flat_param False
156
+ Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse/vlm_gym_toy_maze_2d_train
157
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step0
158
+ Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
159
+ [eval debug] first 3 batch fingerprints:
160
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
161
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
162
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
163
+ ce_avg: 0.33020690083503723, mse_avg: 0.0
164
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step500
165
+ Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
166
+ [eval debug] first 3 batch fingerprints:
167
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
168
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
169
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
170
+ ce_avg: 0.04093625023961067, mse_avg: 0.0
171
  wandb: Detected [huggingface_hub.inference] in use.
172
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
173
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
 
1087
  [2026-01-29 23:51:13] (step=0000906) Train Loss mse: 0.0000, Train Loss ce: 0.0422, Train Steps/Sec: 0.21,
1088
  [2026-01-29 23:51:17] (step=0000907) Train Loss mse: 0.0000, Train Loss ce: 0.0435, Train Steps/Sec: 0.24,
1089
  [2026-01-29 23:51:22] (step=0000908) Train Loss mse: 0.0000, Train Loss ce: 0.0414, Train Steps/Sec: 0.21,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1090
  [2026-01-29 23:51:26] (step=0000909) Train Loss mse: 0.0000, Train Loss ce: 0.0391, Train Steps/Sec: 0.22,
1091
  [2026-01-29 23:51:31] (step=0000910) Train Loss mse: 0.0000, Train Loss ce: 0.0372, Train Steps/Sec: 0.21,
1092
  [2026-01-29 23:51:36] (step=0000911) Train Loss mse: 0.0000, Train Loss ce: 0.0413, Train Steps/Sec: 0.20,
 
1301
  [2026-01-30 00:07:51] (step=0001120) Train Loss mse: 0.0000, Train Loss ce: 0.0360, Train Steps/Sec: 0.21,
1302
  [2026-01-30 00:07:55] (step=0001121) Train Loss mse: 0.0000, Train Loss ce: 0.0358, Train Steps/Sec: 0.24,
1303
  [2026-01-30 00:07:59] (step=0001122) Train Loss mse: 0.0000, Train Loss ce: 0.0416, Train Steps/Sec: 0.22,
1304
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step1000
1305
+ Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
1306
+ [eval debug] first 3 batch fingerprints:
1307
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1308
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1309
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1310
+ ce_avg: 0.0632656142115593, mse_avg: 0.0
1311
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step1500
1312
+ Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
1313
+ [eval debug] first 3 batch fingerprints:
1314
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1315
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1316
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1317
+ ce_avg: 0.07227745652198792, mse_avg: 0.0
1318
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step2000
1319
+ Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
1320
+ [eval debug] first 3 batch fingerprints:
1321
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1322
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1323
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
1324
+ ce_avg: 0.09361912310123444, mse_avg: 0.0
1325
  [2026-01-30 00:08:04] (step=0001123) Train Loss mse: 0.0000, Train Loss ce: 0.0416, Train Steps/Sec: 0.23,
1326
  [2026-01-30 00:08:09] (step=0001124) Train Loss mse: 0.0000, Train Loss ce: 0.0348, Train Steps/Sec: 0.20,
1327
  [2026-01-30 00:08:13] (step=0001125) Train Loss mse: 0.0000, Train Loss ce: 0.0423, Train Steps/Sec: 0.21,
 
2488
  [2026-01-30 01:37:24] (step=0002286) Train Loss mse: 0.0000, Train Loss ce: 0.0391, Train Steps/Sec: 0.22,
2489
  [2026-01-30 01:37:29] (step=0002287) Train Loss mse: 0.0000, Train Loss ce: 0.0358, Train Steps/Sec: 0.21,
2490
  [2026-01-30 01:37:33] (step=0002288) Train Loss mse: 0.0000, Train Loss ce: 0.0337, Train Steps/Sec: 0.22,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2491
  [2026-01-30 01:37:38] (step=0002289) Train Loss mse: 0.0000, Train Loss ce: 0.0288, Train Steps/Sec: 0.21,
2492
  [2026-01-30 01:37:42] (step=0002290) Train Loss mse: 0.0000, Train Loss ce: 0.0374, Train Steps/Sec: 0.22,
2493
  [2026-01-30 01:37:47] (step=0002291) Train Loss mse: 0.0000, Train Loss ce: 0.0294, Train Steps/Sec: 0.22,
 
2820
  [2026-01-30 02:03:01] (step=0002618) Train Loss mse: 0.0000, Train Loss ce: 0.0383, Train Steps/Sec: 0.22,
2821
  [2026-01-30 02:03:05] (step=0002619) Train Loss mse: 0.0000, Train Loss ce: 0.0307, Train Steps/Sec: 0.22,
2822
  [2026-01-30 02:03:10] (step=0002620) Train Loss mse: 0.0000, Train Loss ce: 0.0402, Train Steps/Sec: 0.22,
2823
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step2500
2824
+ Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
2825
+ [eval debug] first 3 batch fingerprints:
2826
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2827
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2828
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2829
+ ce_avg: 0.11742664873600006, mse_avg: 0.0
2830
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step3000
2831
+ Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
2832
+ [eval debug] first 3 batch fingerprints:
2833
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2834
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2835
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2836
+ ce_avg: 0.15227733552455902, mse_avg: 0.0
2837
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step3500
2838
+ Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
2839
+ [eval debug] first 3 batch fingerprints:
2840
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2841
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2842
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
2843
+ ce_avg: 0.19604933261871338, mse_avg: 0.0
2844
  [2026-01-30 02:03:14] (step=0002621) Train Loss mse: 0.0000, Train Loss ce: 0.0373, Train Steps/Sec: 0.23,
2845
  [2026-01-30 02:03:19] (step=0002622) Train Loss mse: 0.0000, Train Loss ce: 0.0383, Train Steps/Sec: 0.22,
2846
  [2026-01-30 02:03:23] (step=0002623) Train Loss mse: 0.0000, Train Loss ce: 0.0315, Train Steps/Sec: 0.26,
 
3460
  [2026-01-30 02:50:41] (step=0003237) Train Loss mse: 0.0000, Train Loss ce: 0.0404, Train Steps/Sec: 0.21,
3461
  [2026-01-30 02:50:45] (step=0003238) Train Loss mse: 0.0000, Train Loss ce: 0.0343, Train Steps/Sec: 0.22,
3462
  [2026-01-30 02:50:50] (step=0003239) Train Loss mse: 0.0000, Train Loss ce: 0.0265, Train Steps/Sec: 0.21,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3463
  [2026-01-30 02:50:54] (step=0003240) Train Loss mse: 0.0000, Train Loss ce: 0.0303, Train Steps/Sec: 0.24,
3464
  [2026-01-30 02:50:59] (step=0003241) Train Loss mse: 0.0000, Train Loss ce: 0.0399, Train Steps/Sec: 0.20,
3465
  [2026-01-30 02:51:03] (step=0003242) Train Loss mse: 0.0000, Train Loss ce: 0.0342, Train Steps/Sec: 0.23,
 
3903
  [2026-01-30 03:25:05] (step=0003680) Train Loss mse: 0.0000, Train Loss ce: 0.0270, Train Steps/Sec: 0.22,
3904
  [2026-01-30 03:25:09] (step=0003681) Train Loss mse: 0.0000, Train Loss ce: 0.0386, Train Steps/Sec: 0.22,
3905
  [2026-01-30 03:25:14] (step=0003682) Train Loss mse: 0.0000, Train Loss ce: 0.0292, Train Steps/Sec: 0.20,
3906
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step4000
3907
+ Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
3908
+ [eval debug] first 3 batch fingerprints:
3909
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
3910
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
3911
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
3912
+ ce_avg: 0.19754347205162048, mse_avg: 0.0
3913
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step4500
3914
+ Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
3915
+ [eval debug] first 3 batch fingerprints:
3916
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
3917
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
3918
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
3919
+ ce_avg: 0.20177732408046722, mse_avg: 0.0
3920
  [2026-01-30 03:25:18] (step=0003683) Train Loss mse: 0.0000, Train Loss ce: 0.0377, Train Steps/Sec: 0.24,
3921
  [2026-01-30 03:25:23] (step=0003684) Train Loss mse: 0.0000, Train Loss ce: 0.0461, Train Steps/Sec: 0.21,
3922
  [2026-01-30 03:25:28] (step=0003685) Train Loss mse: 0.0000, Train Loss ce: 0.0395, Train Steps/Sec: 0.21,
 
4896
  [2026-01-30 04:40:52] (step=0004659) Train Loss mse: 0.0000, Train Loss ce: 0.0359, Train Steps/Sec: 0.20,
4897
  [2026-01-30 04:40:56] (step=0004660) Train Loss mse: 0.0000, Train Loss ce: 0.0324, Train Steps/Sec: 0.22,
4898
  [2026-01-30 04:41:01] (step=0004661) Train Loss mse: 0.0000, Train Loss ce: 0.0409, Train Steps/Sec: 0.22,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4899
  [2026-01-30 04:41:05] (step=0004662) Train Loss mse: 0.0000, Train Loss ce: 0.0324, Train Steps/Sec: 0.21,
4900
  [2026-01-30 04:41:10] (step=0004663) Train Loss mse: 0.0000, Train Loss ce: 0.0382, Train Steps/Sec: 0.22,
4901
  [2026-01-30 04:41:15] (step=0004664) Train Loss mse: 0.0000, Train Loss ce: 0.0338, Train Steps/Sec: 0.21,
 
5238
  [2026-01-30 05:07:25] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/0005000.
5239
  /opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
5240
  warnings.warn(
5241
+ [2026-01-30 05:09:58] Done!
5242
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_toy_maze_2d_one_image_lr2e_5_ce_no_mse_ins_step5000
5243
+ Preparing Dataset vlm_gym_toy_maze_2d_celoss_no_mse_evalonce/vlm_gym_toy_maze_2d_val
5244
+ [eval debug] first 3 batch fingerprints:
5245
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
5246
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
5247
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_toy_maze_2d_celoss_no_mse_evalonce'}]
5248
+ ce_avg: 0.20619651675224304, mse_avg: 0.0