Junyi42 commited on
Commit
e0cdf99
·
verified ·
1 Parent(s): f09196e

Upload checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins

Browse files
checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/wandb/offline-run-20260129_223732-checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins-run0/files/config.yaml CHANGED
@@ -0,0 +1,459 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ wandb_version: 1
2
+
3
+ _wandb:
4
+ desc: null
5
+ value:
6
+ python_version: 3.11.10
7
+ cli_version: 0.23.1
8
+ framework: huggingface
9
+ huggingface_version: 4.49.0
10
+ is_jupyter_run: false
11
+ is_kaggle_kernel: false
12
+ start_time: 1769726252
13
+ t:
14
+ 1:
15
+ - 1
16
+ - 5
17
+ - 11
18
+ - 41
19
+ - 49
20
+ - 53
21
+ - 71
22
+ - 105
23
+ 2:
24
+ - 1
25
+ - 5
26
+ - 11
27
+ - 41
28
+ - 49
29
+ - 53
30
+ - 71
31
+ - 105
32
+ 3:
33
+ - 2
34
+ - 4
35
+ - 13
36
+ - 14
37
+ - 37
38
+ - 42
39
+ - 61
40
+ 4: 3.11.10
41
+ 5: 0.23.1
42
+ 6: 4.49.0
43
+ 13: linux-x86_64
44
+ e:
45
+ 5yuouhq3thf4bogl7q9medqf1fa1fnec:
46
+ os: Linux-6.6.93+-x86_64-with-glibc2.35
47
+ python: CPython 3.11.10
48
+ started_at: '2026-01-29T22:37:32.112548Z'
49
+ args:
50
+ - --dataset_config_file
51
+ - ./data/configs/vlm_gym_sliding_block_train_celoss_no_mse.yaml
52
+ - --eval_dataset_config_file
53
+ - ./data/configs/vlm_gym_sliding_block_eval_celoss_no_mse.yaml
54
+ - --viz_dataset_config_file
55
+ - ./data/configs/vlm_gym_sliding_block_eval_celoss_no_mse.yaml
56
+ - --train_data_dir
57
+ - /home/clouduser/Code/data/gym/sliding_block_v5/train/
58
+ - --train_jsonl_path
59
+ - /home/clouduser/Code/data/gym/sliding_block_v5/train/
60
+ - --eval_data_dir
61
+ - /home/clouduser/Code/data/gym/sliding_block_v5/val/
62
+ - --eval_jsonl_path
63
+ - /home/clouduser/Code/data/gym/sliding_block_v5/val/
64
+ - --inference_hash_file
65
+ - /home/clouduser/Code/Github/launch_new/hashes_test_set_v10.json
66
+ - --task_name
67
+ - sliding_block_v5
68
+ - --instructions_dir
69
+ - ./data/instructions
70
+ - --model_path
71
+ - /home/clouduser/Code/Models/BAGEL-7B-MoT
72
+ - --layer_module
73
+ - Qwen2MoTDecoderLayer
74
+ - --max_latent_size
75
+ - '64'
76
+ - --resume-from
77
+ - /home/clouduser/Code/Models/BAGEL-7B-MoT
78
+ - --finetune_from_hf
79
+ - 'True'
80
+ - --auto_resume
81
+ - 'False'
82
+ - --resume-model-only
83
+ - 'True'
84
+ - --finetune-from-ema
85
+ - 'True'
86
+ - --log_every
87
+ - '1'
88
+ - --lr
89
+ - 2e-5
90
+ - --warmup_steps
91
+ - '300'
92
+ - --lr_scheduler
93
+ - cosine
94
+ - --num_worker
95
+ - '1'
96
+ - --expected_num_tokens
97
+ - '30000'
98
+ - --max_num_tokens
99
+ - '30000'
100
+ - --max_num_tokens_per_sample
101
+ - '30000'
102
+ - --visual_und
103
+ - 'True'
104
+ - --visual_gen
105
+ - 'False'
106
+ - --save_every
107
+ - '5000'
108
+ - --total_steps
109
+ - '5000'
110
+ - --text_cond_dropout_prob
111
+ - '0.0'
112
+ - --vae_cond_dropout_prob
113
+ - '0.3'
114
+ - --vit_cond_dropout_prob
115
+ - '0.0'
116
+ - --ema
117
+ - '0.993'
118
+ - --checkpoint_dir
119
+ - /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins
120
+ - --wandb_project
121
+ - bagel
122
+ - --wandb_name
123
+ - checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins
124
+ - --wandb_dir
125
+ - /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins
126
+ - --wandb_offline
127
+ - 'True'
128
+ program: /home/clouduser/Code/Github/unified_world_model/train/pretrain_unified_navit.py
129
+ code_path: train/pretrain_unified_navit.py
130
+ code_path_local: train/pretrain_unified_navit.py
131
+ git:
132
+ remote_url: https://github.com/para-lost/unified_world_model
133
+ commit: 8d7b26b7e552fc87b592cf3be94d93be7aeca2a9
134
+ root: /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins
135
+ host: junyizhang-launch-new-226786293-1-0
136
+ executable: /opt/conda/bin/python3.11
137
+ cpu_count: 48
138
+ cpu_count_logical: 96
139
+ gpu_type: NVIDIA A100-SXM4-80GB
140
+ gpu_count: 8
141
+ disk:
142
+ /:
143
+ total: '1052461830144'
144
+ used: '262257246208'
145
+ memory:
146
+ total: '1437332611072'
147
+ gpu_nvidia:
148
+ - name: NVIDIA A100-SXM4-80GB
149
+ memory_total: '85899345920'
150
+ cuda_cores: 6912
151
+ architecture: Ampere
152
+ uuid: GPU-85328465-4be9-5572-aae1-2054ee3a5504
153
+ - name: NVIDIA A100-SXM4-80GB
154
+ memory_total: '85899345920'
155
+ cuda_cores: 6912
156
+ architecture: Ampere
157
+ uuid: GPU-7fc6fccb-5444-180b-da82-07623178479f
158
+ - name: NVIDIA A100-SXM4-80GB
159
+ memory_total: '85899345920'
160
+ cuda_cores: 6912
161
+ architecture: Ampere
162
+ uuid: GPU-1f343832-64cd-ceae-5a05-d216e3227750
163
+ - name: NVIDIA A100-SXM4-80GB
164
+ memory_total: '85899345920'
165
+ cuda_cores: 6912
166
+ architecture: Ampere
167
+ uuid: GPU-a044c281-4d09-5ac7-7eb1-6680c537b74a
168
+ - name: NVIDIA A100-SXM4-80GB
169
+ memory_total: '85899345920'
170
+ cuda_cores: 6912
171
+ architecture: Ampere
172
+ uuid: GPU-0de1843c-8ae8-08a7-700e-5c5fa83603f2
173
+ - name: NVIDIA A100-SXM4-80GB
174
+ memory_total: '85899345920'
175
+ cuda_cores: 6912
176
+ architecture: Ampere
177
+ uuid: GPU-abd24a4f-3423-4a2d-86ac-9a0e7f7a5420
178
+ - name: NVIDIA A100-SXM4-80GB
179
+ memory_total: '85899345920'
180
+ cuda_cores: 6912
181
+ architecture: Ampere
182
+ uuid: GPU-0f33aea7-420e-8f0d-31cc-b7bbb27e4c69
183
+ - name: NVIDIA A100-SXM4-80GB
184
+ memory_total: '85899345920'
185
+ cuda_cores: 6912
186
+ architecture: Ampere
187
+ uuid: GPU-a4427a2c-07b2-31e3-9098-f7fd5f48368f
188
+ cuda_version: '12.2'
189
+ writer_id: 5yuouhq3thf4bogl7q9medqf1fa1fnec
190
+ visual_gen:
191
+ desc: null
192
+ value: false
193
+ visual_und:
194
+ desc: null
195
+ value: true
196
+ results_dir:
197
+ desc: null
198
+ value: results
199
+ checkpoint_dir:
200
+ desc: null
201
+ value: /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins
202
+ wandb_project:
203
+ desc: null
204
+ value: bagel
205
+ wandb_name:
206
+ desc: null
207
+ value: checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins
208
+ wandb_runid:
209
+ desc: null
210
+ value: '0'
211
+ wandb_resume:
212
+ desc: null
213
+ value: allow
214
+ wandb_offline:
215
+ desc: null
216
+ value: true
217
+ wandb_dir:
218
+ desc: null
219
+ value: /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins
220
+ global_seed:
221
+ desc: null
222
+ value: 4396
223
+ auto_resume:
224
+ desc: null
225
+ value: false
226
+ resume_from:
227
+ desc: null
228
+ value: /home/clouduser/Code/Models/BAGEL-7B-MoT
229
+ resume_model_only:
230
+ desc: null
231
+ value: true
232
+ finetune_from_ema:
233
+ desc: null
234
+ value: true
235
+ finetune_from_hf:
236
+ desc: null
237
+ value: true
238
+ log_every:
239
+ desc: null
240
+ value: 1
241
+ save_every:
242
+ desc: null
243
+ value: 5000
244
+ total_steps:
245
+ desc: null
246
+ value: 5000
247
+ warmup_steps:
248
+ desc: null
249
+ value: 300
250
+ lr_scheduler:
251
+ desc: null
252
+ value: cosine
253
+ lr:
254
+ desc: null
255
+ value: 2.0e-05
256
+ min_lr:
257
+ desc: null
258
+ value: 1.0e-07
259
+ beta1:
260
+ desc: null
261
+ value: 0.9
262
+ beta2:
263
+ desc: null
264
+ value: 0.95
265
+ eps:
266
+ desc: null
267
+ value: 1.0e-15
268
+ ema:
269
+ desc: null
270
+ value: 0.993
271
+ max_grad_norm:
272
+ desc: null
273
+ value: 1.0
274
+ timestep_shift:
275
+ desc: null
276
+ value: 1.0
277
+ mse_weight:
278
+ desc: null
279
+ value: 1.0
280
+ ce_weight:
281
+ desc: null
282
+ value: 1.0
283
+ ce_loss_reweighting:
284
+ desc: null
285
+ value: false
286
+ expected_num_tokens:
287
+ desc: null
288
+ value: 30000
289
+ num_replicate:
290
+ desc: null
291
+ value: 1
292
+ num_shard:
293
+ desc: null
294
+ value: 8
295
+ sharding_strategy:
296
+ desc: null
297
+ value: HYBRID_SHARD
298
+ backward_prefetch:
299
+ desc: null
300
+ value: BACKWARD_PRE
301
+ cpu_offload:
302
+ desc: null
303
+ value: false
304
+ freeze_llm:
305
+ desc: null
306
+ value: false
307
+ freeze_vit:
308
+ desc: null
309
+ value: false
310
+ freeze_vae:
311
+ desc: null
312
+ value: true
313
+ freeze_und:
314
+ desc: null
315
+ value: false
316
+ copy_init_moe:
317
+ desc: null
318
+ value: true
319
+ use_flex:
320
+ desc: null
321
+ value: false
322
+ eval_every:
323
+ desc: null
324
+ value: 500
325
+ num_eval_batches:
326
+ desc: null
327
+ value: 20
328
+ use_ema_for_eval:
329
+ desc: null
330
+ value: true
331
+ eval_log_dir:
332
+ desc: null
333
+ value: null
334
+ eval_run_tag:
335
+ desc: null
336
+ value: ''
337
+ viz_every:
338
+ desc: null
339
+ value: 500
340
+ viz_n:
341
+ desc: null
342
+ value: 8
343
+ viz_outdir:
344
+ desc: null
345
+ value: results/viz
346
+ eval_dataset_config_file:
347
+ desc: null
348
+ value: ./data/configs/vlm_gym_sliding_block_eval_celoss_no_mse.yaml
349
+ viz_dataset_config_file:
350
+ desc: null
351
+ value: ./data/configs/vlm_gym_sliding_block_eval_celoss_no_mse.yaml
352
+ eval_print_n:
353
+ desc: null
354
+ value: 3
355
+ save_ema_only:
356
+ desc: null
357
+ value: true
358
+ save_optimizer:
359
+ desc: null
360
+ value: false
361
+ model_path:
362
+ desc: null
363
+ value: /home/clouduser/Code/Models/BAGEL-7B-MoT
364
+ llm_path:
365
+ desc: null
366
+ value: hf/Qwen2.5-0.5B-Instruct/
367
+ llm_qk_norm:
368
+ desc: null
369
+ value: true
370
+ tie_word_embeddings:
371
+ desc: null
372
+ value: false
373
+ layer_module:
374
+ desc: null
375
+ value: Qwen2MoTDecoderLayer
376
+ vae_path:
377
+ desc: null
378
+ value: flux/vae/ae.safetensors
379
+ vit_path:
380
+ desc: null
381
+ value: hf/siglip-so400m-14-980-flash-attn2-navit/
382
+ max_latent_size:
383
+ desc: null
384
+ value: 64
385
+ latent_patch_size:
386
+ desc: null
387
+ value: 2
388
+ vit_patch_size:
389
+ desc: null
390
+ value: 14
391
+ vit_max_num_patch_per_side:
392
+ desc: null
393
+ value: 70
394
+ connector_act:
395
+ desc: null
396
+ value: gelu_pytorch_tanh
397
+ interpolate_pos:
398
+ desc: null
399
+ value: false
400
+ vit_select_layer:
401
+ desc: null
402
+ value: -2
403
+ vit_rope:
404
+ desc: null
405
+ value: false
406
+ text_cond_dropout_prob:
407
+ desc: null
408
+ value: 0.0
409
+ vae_cond_dropout_prob:
410
+ desc: null
411
+ value: 0.3
412
+ vit_cond_dropout_prob:
413
+ desc: null
414
+ value: 0.0
415
+ dataset_config_file:
416
+ desc: null
417
+ value: ./data/configs/vlm_gym_sliding_block_train_celoss_no_mse.yaml
418
+ train_data_dir:
419
+ desc: null
420
+ value: /home/clouduser/Code/data/gym/sliding_block_v5/train/
421
+ train_jsonl_path:
422
+ desc: null
423
+ value: /home/clouduser/Code/data/gym/sliding_block_v5/train/
424
+ eval_data_dir:
425
+ desc: null
426
+ value: /home/clouduser/Code/data/gym/sliding_block_v5/val/
427
+ eval_jsonl_path:
428
+ desc: null
429
+ value: /home/clouduser/Code/data/gym/sliding_block_v5/val/
430
+ inference_hash_file:
431
+ desc: null
432
+ value: /home/clouduser/Code/Github/launch_new/hashes_test_set_v10.json
433
+ task_name:
434
+ desc: null
435
+ value: sliding_block_v5
436
+ instructions_dir:
437
+ desc: null
438
+ value: ./data/instructions
439
+ prefetch_factor:
440
+ desc: null
441
+ value: 2
442
+ num_workers:
443
+ desc: null
444
+ value: 1
445
+ max_num_tokens_per_sample:
446
+ desc: null
447
+ value: 30000
448
+ max_num_tokens:
449
+ desc: null
450
+ value: 30000
451
+ prefer_buffer_before:
452
+ desc: null
453
+ value: 16384
454
+ max_buffer_size:
455
+ desc: null
456
+ value: 50
457
+ data_seed:
458
+ desc: null
459
+ value: 42
checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/wandb/offline-run-20260129_223732-checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins-run0/files/output.log CHANGED
@@ -1,173 +1,3 @@
1
- FullyShardedDataParallel(
2
- (_fsdp_wrapped_module): Bagel(
3
- (language_model): Qwen2ForCausalLM(
4
- (model): Qwen2Model(
5
- (embed_tokens): Embedding(152064, 3584)
6
- (layers): ModuleList(
7
- (0-27): 28 x FullyShardedDataParallel(
8
- (_fsdp_wrapped_module): CheckpointWrapper(
9
- (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
10
- (self_attn): PackedAttentionMoT(
11
- (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
12
- (k_proj): Linear(in_features=3584, out_features=512, bias=True)
13
- (v_proj): Linear(in_features=3584, out_features=512, bias=True)
14
- (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
15
- (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
16
- (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
17
- (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
18
- (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
19
- (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
20
- (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
21
- (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
22
- (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
23
- )
24
- (mlp): Qwen2MLP(
25
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
26
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
27
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
28
- (act_fn): SiLU()
29
- )
30
- (mlp_moe_gen): Qwen2MLP(
31
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
32
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
33
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
34
- (act_fn): SiLU()
35
- )
36
- (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
37
- (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
38
- (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
39
- (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
40
- )
41
- )
42
- )
43
- )
44
- (norm): Qwen2RMSNorm((3584,), eps=1e-06)
45
- (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
46
- (rotary_emb): Qwen2RotaryEmbedding()
47
- )
48
- (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
49
- )
50
- (vit_model): SiglipVisionModel(
51
- (vision_model): FullyShardedDataParallel(
52
- (_fsdp_wrapped_module): SiglipVisionTransformer(
53
- (embeddings): SiglipVisionEmbeddings(
54
- (position_embedding): Embedding(4900, 1152)
55
- (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
56
- )
57
- (encoder): SiglipEncoder(
58
- (layers): ModuleList(
59
- (0-25): 26 x FullyShardedDataParallel(
60
- (_fsdp_wrapped_module): CheckpointWrapper(
61
- (_checkpoint_wrapped_module): SiglipEncoderLayer(
62
- (self_attn): SiglipFlashAttention2(
63
- (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
64
- (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
65
- (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
66
- (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
67
- )
68
- (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
69
- (mlp): SiglipMLP(
70
- (activation_fn): PytorchGELUTanh()
71
- (fc1): Linear(in_features=1152, out_features=4304, bias=True)
72
- (fc2): Linear(in_features=4304, out_features=1152, bias=True)
73
- )
74
- (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
75
- )
76
- )
77
- )
78
- )
79
- )
80
- (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
81
- )
82
- )
83
- )
84
- (connector): FullyShardedDataParallel(
85
- (_fsdp_wrapped_module): CheckpointWrapper(
86
- (_checkpoint_wrapped_module): MLPconnector(
87
- (activation_fn): PytorchGELUTanh()
88
- (fc1): Linear(in_features=1152, out_features=3584, bias=True)
89
- (fc2): Linear(in_features=3584, out_features=3584, bias=True)
90
- )
91
- )
92
- )
93
- (vit_pos_embed): FullyShardedDataParallel(
94
- (_fsdp_wrapped_module): PositionEmbedding()
95
- )
96
- )
97
- )
98
- _flat_param True
99
- language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
100
- language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
101
- language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
102
- language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
103
- language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
104
- language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
105
- language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
106
- language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
107
- language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
108
- language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
109
- language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
110
- language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
111
- language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
112
- language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
113
- language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
114
- language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
115
- language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
116
- language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
117
- language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
118
- language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
119
- language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
120
- language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
121
- language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
122
- language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
123
- language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
124
- language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
125
- language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
126
- language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
127
- vit_model.vision_model._fsdp_wrapped_module._flat_param True
128
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
129
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
130
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
131
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
132
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
133
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
134
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
135
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
136
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
137
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
138
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
139
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
140
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
141
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
142
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
143
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
144
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
145
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
146
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
147
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
148
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
149
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
150
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
151
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
152
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
153
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
154
- connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
155
- vit_pos_embed._fsdp_wrapped_module._flat_param False
156
- Preparing Dataset vlm_gym_sliding_block_celoss_no_mse/vlm_gym_sliding_block_train
157
- base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step0
158
- Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
159
- [eval debug] first 3 batch fingerprints:
160
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
161
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
162
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
163
- ce_avg: 0.444692462682724, mse_avg: 0.0
164
- base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step500
165
- Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
166
- [eval debug] first 3 batch fingerprints:
167
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
168
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
169
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
170
- ce_avg: 0.053060635924339294, mse_avg: 0.0
171
  wandb: Detected [huggingface_hub.inference] in use.
172
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
173
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
@@ -1232,27 +1062,6 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
1232
  [2026-01-29 23:38:35] (step=0001051) Train Loss mse: 0.0000, Train Loss ce: 0.0434, Train Steps/Sec: 0.31,
1233
  [2026-01-29 23:38:38] (step=0001052) Train Loss mse: 0.0000, Train Loss ce: 0.0524, Train Steps/Sec: 0.32,
1234
  [2026-01-29 23:38:41] (step=0001053) Train Loss mse: 0.0000, Train Loss ce: 0.0374, Train Steps/Sec: 0.37,
1235
- base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step1000
1236
- Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
1237
- [eval debug] first 3 batch fingerprints:
1238
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1239
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1240
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1241
- ce_avg: 0.06684262305498123, mse_avg: 0.0
1242
- base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step1500
1243
- Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
1244
- [eval debug] first 3 batch fingerprints:
1245
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1246
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1247
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1248
- ce_avg: 0.07035679370164871, mse_avg: 0.0
1249
- base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step2000
1250
- Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
1251
- [eval debug] first 3 batch fingerprints:
1252
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1253
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1254
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1255
- ce_avg: 0.08155465126037598, mse_avg: 0.0
1256
  [2026-01-29 23:38:43] (step=0001054) Train Loss mse: 0.0000, Train Loss ce: 0.0506, Train Steps/Sec: 0.34,
1257
  [2026-01-29 23:38:46] (step=0001055) Train Loss mse: 0.0000, Train Loss ce: 0.0255, Train Steps/Sec: 0.35,
1258
  [2026-01-29 23:38:50] (step=0001056) Train Loss mse: 0.0000, Train Loss ce: 0.0475, Train Steps/Sec: 0.32,
@@ -1279,6 +1088,204 @@ ce_avg: 0.08155465126037598, mse_avg: 0.0
1279
  [2026-01-29 23:39:51] (step=0001077) Train Loss mse: 0.0000, Train Loss ce: 0.0367, Train Steps/Sec: 0.35,
1280
  [2026-01-29 23:39:54] (step=0001078) Train Loss mse: 0.0000, Train Loss ce: 0.0507, Train Steps/Sec: 0.37,
1281
  [2026-01-29 23:39:57] (step=0001079) Train Loss mse: 0.0000, Train Loss ce: 0.0492, Train Steps/Sec: 0.31,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1282
  [2026-01-29 23:40:00] (step=0001080) Train Loss mse: 0.0000, Train Loss ce: 0.0459, Train Steps/Sec: 0.30,
1283
  [2026-01-29 23:40:04] (step=0001081) Train Loss mse: 0.0000, Train Loss ce: 0.0491, Train Steps/Sec: 0.33,
1284
  [2026-01-29 23:40:06] (step=0001082) Train Loss mse: 0.0000, Train Loss ce: 0.0408, Train Steps/Sec: 0.35,
@@ -2763,13 +2770,6 @@ ce_avg: 0.08155465126037598, mse_avg: 0.0
2763
  [2026-01-30 00:54:29] (step=0002561) Train Loss mse: 0.0000, Train Loss ce: 0.0370, Train Steps/Sec: 0.38,
2764
  [2026-01-30 00:54:31] (step=0002562) Train Loss mse: 0.0000, Train Loss ce: 0.0446, Train Steps/Sec: 0.35,
2765
  [2026-01-30 00:54:34] (step=0002563) Train Loss mse: 0.0000, Train Loss ce: 0.0322, Train Steps/Sec: 0.33,
2766
- base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step2500
2767
- Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
2768
- [eval debug] first 3 batch fingerprints:
2769
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
2770
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
2771
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
2772
- ce_avg: 0.07872410863637924, mse_avg: 0.0
2773
  base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step3000
2774
  Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
2775
  [eval debug] first 3 batch fingerprints:
@@ -3822,6 +3822,27 @@ ce_avg: 0.0932040885090828, mse_avg: 0.0
3822
  [2026-01-30 01:46:49] (step=0003599) Train Loss mse: 0.0000, Train Loss ce: 0.0327, Train Steps/Sec: 0.41,
3823
  [2026-01-30 01:46:52] (step=0003600) Train Loss mse: 0.0000, Train Loss ce: 0.0397, Train Steps/Sec: 0.34,
3824
  [2026-01-30 01:46:55] (step=0003601) Train Loss mse: 0.0000, Train Loss ce: 0.0383, Train Steps/Sec: 0.33,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3825
  [2026-01-30 01:46:58] (step=0003602) Train Loss mse: 0.0000, Train Loss ce: 0.0428, Train Steps/Sec: 0.30,
3826
  [2026-01-30 01:47:02] (step=0003603) Train Loss mse: 0.0000, Train Loss ce: 0.0263, Train Steps/Sec: 0.30,
3827
  [2026-01-30 01:47:05] (step=0003604) Train Loss mse: 0.0000, Train Loss ce: 0.0435, Train Steps/Sec: 0.32,
@@ -3867,20 +3888,6 @@ ce_avg: 0.0932040885090828, mse_avg: 0.0
3867
  [2026-01-30 01:49:04] (step=0003644) Train Loss mse: 0.0000, Train Loss ce: 0.0392, Train Steps/Sec: 0.38,
3868
  [2026-01-30 01:49:07] (step=0003645) Train Loss mse: 0.0000, Train Loss ce: 0.0400, Train Steps/Sec: 0.32,
3869
  [2026-01-30 01:49:10] (step=0003646) Train Loss mse: 0.0000, Train Loss ce: 0.0462, Train Steps/Sec: 0.30,
3870
- base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step4000
3871
- Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
3872
- [eval debug] first 3 batch fingerprints:
3873
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3874
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3875
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3876
- ce_avg: 0.087554931640625, mse_avg: 0.0
3877
- base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step4500
3878
- Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
3879
- [eval debug] first 3 batch fingerprints:
3880
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3881
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3882
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3883
- ce_avg: 0.08504438400268555, mse_avg: 0.0
3884
  [2026-01-30 01:49:14] (step=0003647) Train Loss mse: 0.0000, Train Loss ce: 0.0434, Train Steps/Sec: 0.31,
3885
  [2026-01-30 01:49:17] (step=0003648) Train Loss mse: 0.0000, Train Loss ce: 0.0381, Train Steps/Sec: 0.32,
3886
  [2026-01-30 01:49:20] (step=0003649) Train Loss mse: 0.0000, Train Loss ce: 0.0434, Train Steps/Sec: 0.34,
@@ -5238,11 +5245,4 @@ ce_avg: 0.08504438400268555, mse_avg: 0.0
5238
  [2026-01-30 02:57:26] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/0005000.
5239
  /opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
5240
  warnings.warn(
5241
- [2026-01-30 03:00:45] Done!
5242
- base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step5000
5243
- Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
5244
- [eval debug] first 3 batch fingerprints:
5245
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
5246
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
5247
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
5248
- ce_avg: 0.08504004031419754, mse_avg: 0.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  wandb: Detected [huggingface_hub.inference] in use.
2
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
3
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
 
1062
  [2026-01-29 23:38:35] (step=0001051) Train Loss mse: 0.0000, Train Loss ce: 0.0434, Train Steps/Sec: 0.31,
1063
  [2026-01-29 23:38:38] (step=0001052) Train Loss mse: 0.0000, Train Loss ce: 0.0524, Train Steps/Sec: 0.32,
1064
  [2026-01-29 23:38:41] (step=0001053) Train Loss mse: 0.0000, Train Loss ce: 0.0374, Train Steps/Sec: 0.37,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1065
  [2026-01-29 23:38:43] (step=0001054) Train Loss mse: 0.0000, Train Loss ce: 0.0506, Train Steps/Sec: 0.34,
1066
  [2026-01-29 23:38:46] (step=0001055) Train Loss mse: 0.0000, Train Loss ce: 0.0255, Train Steps/Sec: 0.35,
1067
  [2026-01-29 23:38:50] (step=0001056) Train Loss mse: 0.0000, Train Loss ce: 0.0475, Train Steps/Sec: 0.32,
 
1088
  [2026-01-29 23:39:51] (step=0001077) Train Loss mse: 0.0000, Train Loss ce: 0.0367, Train Steps/Sec: 0.35,
1089
  [2026-01-29 23:39:54] (step=0001078) Train Loss mse: 0.0000, Train Loss ce: 0.0507, Train Steps/Sec: 0.37,
1090
  [2026-01-29 23:39:57] (step=0001079) Train Loss mse: 0.0000, Train Loss ce: 0.0492, Train Steps/Sec: 0.31,
1091
+ FullyShardedDataParallel(
1092
+ (_fsdp_wrapped_module): Bagel(
1093
+ (language_model): Qwen2ForCausalLM(
1094
+ (model): Qwen2Model(
1095
+ (embed_tokens): Embedding(152064, 3584)
1096
+ (layers): ModuleList(
1097
+ (0-27): 28 x FullyShardedDataParallel(
1098
+ (_fsdp_wrapped_module): CheckpointWrapper(
1099
+ (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
1100
+ (self_attn): PackedAttentionMoT(
1101
+ (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
1102
+ (k_proj): Linear(in_features=3584, out_features=512, bias=True)
1103
+ (v_proj): Linear(in_features=3584, out_features=512, bias=True)
1104
+ (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
1105
+ (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
1106
+ (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
1107
+ (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
1108
+ (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
1109
+ (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
1110
+ (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
1111
+ (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
1112
+ (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
1113
+ )
1114
+ (mlp): Qwen2MLP(
1115
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
1116
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
1117
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
1118
+ (act_fn): SiLU()
1119
+ )
1120
+ (mlp_moe_gen): Qwen2MLP(
1121
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
1122
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
1123
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
1124
+ (act_fn): SiLU()
1125
+ )
1126
+ (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
1127
+ (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
1128
+ (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
1129
+ (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
1130
+ )
1131
+ )
1132
+ )
1133
+ )
1134
+ (norm): Qwen2RMSNorm((3584,), eps=1e-06)
1135
+ (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
1136
+ (rotary_emb): Qwen2RotaryEmbedding()
1137
+ )
1138
+ (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
1139
+ )
1140
+ (vit_model): SiglipVisionModel(
1141
+ (vision_model): FullyShardedDataParallel(
1142
+ (_fsdp_wrapped_module): SiglipVisionTransformer(
1143
+ (embeddings): SiglipVisionEmbeddings(
1144
+ (position_embedding): Embedding(4900, 1152)
1145
+ (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
1146
+ )
1147
+ (encoder): SiglipEncoder(
1148
+ (layers): ModuleList(
1149
+ (0-25): 26 x FullyShardedDataParallel(
1150
+ (_fsdp_wrapped_module): CheckpointWrapper(
1151
+ (_checkpoint_wrapped_module): SiglipEncoderLayer(
1152
+ (self_attn): SiglipFlashAttention2(
1153
+ (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
1154
+ (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
1155
+ (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
1156
+ (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
1157
+ )
1158
+ (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1159
+ (mlp): SiglipMLP(
1160
+ (activation_fn): PytorchGELUTanh()
1161
+ (fc1): Linear(in_features=1152, out_features=4304, bias=True)
1162
+ (fc2): Linear(in_features=4304, out_features=1152, bias=True)
1163
+ )
1164
+ (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1165
+ )
1166
+ )
1167
+ )
1168
+ )
1169
+ )
1170
+ (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1171
+ )
1172
+ )
1173
+ )
1174
+ (connector): FullyShardedDataParallel(
1175
+ (_fsdp_wrapped_module): CheckpointWrapper(
1176
+ (_checkpoint_wrapped_module): MLPconnector(
1177
+ (activation_fn): PytorchGELUTanh()
1178
+ (fc1): Linear(in_features=1152, out_features=3584, bias=True)
1179
+ (fc2): Linear(in_features=3584, out_features=3584, bias=True)
1180
+ )
1181
+ )
1182
+ )
1183
+ (vit_pos_embed): FullyShardedDataParallel(
1184
+ (_fsdp_wrapped_module): PositionEmbedding()
1185
+ )
1186
+ )
1187
+ )
1188
+ _flat_param True
1189
+ language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1190
+ language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1191
+ language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1192
+ language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1193
+ language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1194
+ language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1195
+ language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1196
+ language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1197
+ language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1198
+ language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1199
+ language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1200
+ language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1201
+ language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1202
+ language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1203
+ language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1204
+ language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1205
+ language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1206
+ language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1207
+ language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1208
+ language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1209
+ language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1210
+ language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1211
+ language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1212
+ language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1213
+ language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1214
+ language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1215
+ language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1216
+ language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1217
+ vit_model.vision_model._fsdp_wrapped_module._flat_param True
1218
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1219
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1220
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1221
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1222
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1223
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1224
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1225
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1226
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1227
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1228
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1229
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1230
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1231
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1232
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1233
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1234
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1235
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1236
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1237
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1238
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1239
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1240
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1241
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1242
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1243
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1244
+ connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1245
+ vit_pos_embed._fsdp_wrapped_module._flat_param False
1246
+ Preparing Dataset vlm_gym_sliding_block_celoss_no_mse/vlm_gym_sliding_block_train
1247
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step0
1248
+ Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
1249
+ [eval debug] first 3 batch fingerprints:
1250
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1251
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1252
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1253
+ ce_avg: 0.444692462682724, mse_avg: 0.0
1254
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step500
1255
+ Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
1256
+ [eval debug] first 3 batch fingerprints:
1257
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1258
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1259
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1260
+ ce_avg: 0.053060635924339294, mse_avg: 0.0
1261
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step1000
1262
+ Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
1263
+ [eval debug] first 3 batch fingerprints:
1264
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1265
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1266
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1267
+ ce_avg: 0.06684262305498123, mse_avg: 0.0
1268
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step1500
1269
+ Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
1270
+ [eval debug] first 3 batch fingerprints:
1271
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1272
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1273
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1274
+ ce_avg: 0.07035679370164871, mse_avg: 0.0
1275
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step2000
1276
+ Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
1277
+ [eval debug] first 3 batch fingerprints:
1278
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1279
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1280
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1281
+ ce_avg: 0.08155465126037598, mse_avg: 0.0
1282
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step2500
1283
+ Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
1284
+ [eval debug] first 3 batch fingerprints:
1285
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1286
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1287
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
1288
+ ce_avg: 0.07872410863637924, mse_avg: 0.0
1289
  [2026-01-29 23:40:00] (step=0001080) Train Loss mse: 0.0000, Train Loss ce: 0.0459, Train Steps/Sec: 0.30,
1290
  [2026-01-29 23:40:04] (step=0001081) Train Loss mse: 0.0000, Train Loss ce: 0.0491, Train Steps/Sec: 0.33,
1291
  [2026-01-29 23:40:06] (step=0001082) Train Loss mse: 0.0000, Train Loss ce: 0.0408, Train Steps/Sec: 0.35,
 
2770
  [2026-01-30 00:54:29] (step=0002561) Train Loss mse: 0.0000, Train Loss ce: 0.0370, Train Steps/Sec: 0.38,
2771
  [2026-01-30 00:54:31] (step=0002562) Train Loss mse: 0.0000, Train Loss ce: 0.0446, Train Steps/Sec: 0.35,
2772
  [2026-01-30 00:54:34] (step=0002563) Train Loss mse: 0.0000, Train Loss ce: 0.0322, Train Steps/Sec: 0.33,
 
 
 
 
 
 
 
2773
  base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step3000
2774
  Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
2775
  [eval debug] first 3 batch fingerprints:
 
3822
  [2026-01-30 01:46:49] (step=0003599) Train Loss mse: 0.0000, Train Loss ce: 0.0327, Train Steps/Sec: 0.41,
3823
  [2026-01-30 01:46:52] (step=0003600) Train Loss mse: 0.0000, Train Loss ce: 0.0397, Train Steps/Sec: 0.34,
3824
  [2026-01-30 01:46:55] (step=0003601) Train Loss mse: 0.0000, Train Loss ce: 0.0383, Train Steps/Sec: 0.33,
3825
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step4000
3826
+ Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
3827
+ [eval debug] first 3 batch fingerprints:
3828
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3829
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3830
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3831
+ ce_avg: 0.087554931640625, mse_avg: 0.0
3832
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step4500
3833
+ Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
3834
+ [eval debug] first 3 batch fingerprints:
3835
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3836
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3837
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3838
+ ce_avg: 0.08504438400268555, mse_avg: 0.0
3839
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins_step5000
3840
+ Preparing Dataset vlm_gym_sliding_block_celoss_no_mse_evalonce/vlm_gym_sliding_block_val
3841
+ [eval debug] first 3 batch fingerprints:
3842
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3843
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3844
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_sliding_block_celoss_no_mse_evalonce'}]
3845
+ ce_avg: 0.08504004031419754, mse_avg: 0.0
3846
  [2026-01-30 01:46:58] (step=0003602) Train Loss mse: 0.0000, Train Loss ce: 0.0428, Train Steps/Sec: 0.30,
3847
  [2026-01-30 01:47:02] (step=0003603) Train Loss mse: 0.0000, Train Loss ce: 0.0263, Train Steps/Sec: 0.30,
3848
  [2026-01-30 01:47:05] (step=0003604) Train Loss mse: 0.0000, Train Loss ce: 0.0435, Train Steps/Sec: 0.32,
 
3888
  [2026-01-30 01:49:04] (step=0003644) Train Loss mse: 0.0000, Train Loss ce: 0.0392, Train Steps/Sec: 0.38,
3889
  [2026-01-30 01:49:07] (step=0003645) Train Loss mse: 0.0000, Train Loss ce: 0.0400, Train Steps/Sec: 0.32,
3890
  [2026-01-30 01:49:10] (step=0003646) Train Loss mse: 0.0000, Train Loss ce: 0.0462, Train Steps/Sec: 0.30,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3891
  [2026-01-30 01:49:14] (step=0003647) Train Loss mse: 0.0000, Train Loss ce: 0.0434, Train Steps/Sec: 0.31,
3892
  [2026-01-30 01:49:17] (step=0003648) Train Loss mse: 0.0000, Train Loss ce: 0.0381, Train Steps/Sec: 0.32,
3893
  [2026-01-30 01:49:20] (step=0003649) Train Loss mse: 0.0000, Train Loss ce: 0.0434, Train Steps/Sec: 0.34,
 
5245
  [2026-01-30 02:57:26] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_sliding_block_one_image_lr2e_5_ce_no_mse_ins/0005000.
5246
  /opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
5247
  warnings.warn(
5248
+ [2026-01-30 03:00:45] Done!