Junyi42 commited on
Commit
ced6c91
·
verified ·
1 Parent(s): c1a7829

Upload checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins

Browse files
checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/wandb/offline-run-20260129_223311-checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins-run0/files/config.yaml CHANGED
@@ -0,0 +1,459 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ wandb_version: 1
2
+
3
+ _wandb:
4
+ desc: null
5
+ value:
6
+ python_version: 3.11.10
7
+ cli_version: 0.23.1
8
+ framework: huggingface
9
+ huggingface_version: 4.49.0
10
+ is_jupyter_run: false
11
+ is_kaggle_kernel: false
12
+ start_time: 1769725992
13
+ t:
14
+ 1:
15
+ - 1
16
+ - 5
17
+ - 11
18
+ - 41
19
+ - 49
20
+ - 53
21
+ - 71
22
+ - 105
23
+ 2:
24
+ - 1
25
+ - 5
26
+ - 11
27
+ - 41
28
+ - 49
29
+ - 53
30
+ - 71
31
+ - 105
32
+ 3:
33
+ - 2
34
+ - 4
35
+ - 13
36
+ - 14
37
+ - 37
38
+ - 42
39
+ - 61
40
+ 4: 3.11.10
41
+ 5: 0.23.1
42
+ 6: 4.49.0
43
+ 13: linux-x86_64
44
+ e:
45
+ wfvoftkdc4um9761m8ix7oiachlmdvuv:
46
+ os: Linux-6.6.93+-x86_64-with-glibc2.35
47
+ python: CPython 3.11.10
48
+ started_at: '2026-01-29T22:33:11.859047Z'
49
+ args:
50
+ - --dataset_config_file
51
+ - ./data/configs/vlm_gym_reference_dot_train_celoss_no_mse.yaml
52
+ - --eval_dataset_config_file
53
+ - ./data/configs/vlm_gym_reference_dot_eval_celoss_no_mse.yaml
54
+ - --viz_dataset_config_file
55
+ - ./data/configs/vlm_gym_reference_dot_eval_celoss_no_mse.yaml
56
+ - --train_data_dir
57
+ - /home/clouduser/Code/data/gym/reference_dot_v5/train/
58
+ - --train_jsonl_path
59
+ - /home/clouduser/Code/data/gym/reference_dot_v5/train/
60
+ - --eval_data_dir
61
+ - /home/clouduser/Code/data/gym/reference_dot_v5/val/
62
+ - --eval_jsonl_path
63
+ - /home/clouduser/Code/data/gym/reference_dot_v5/val/
64
+ - --inference_hash_file
65
+ - /home/clouduser/Code/Github/launch_new/hashes_test_set_v10.json
66
+ - --task_name
67
+ - reference_dot_v5
68
+ - --instructions_dir
69
+ - ./data/instructions
70
+ - --model_path
71
+ - /home/clouduser/Code/Models/BAGEL-7B-MoT
72
+ - --layer_module
73
+ - Qwen2MoTDecoderLayer
74
+ - --max_latent_size
75
+ - '64'
76
+ - --resume-from
77
+ - /home/clouduser/Code/Models/BAGEL-7B-MoT
78
+ - --finetune_from_hf
79
+ - 'True'
80
+ - --auto_resume
81
+ - 'False'
82
+ - --resume-model-only
83
+ - 'True'
84
+ - --finetune-from-ema
85
+ - 'True'
86
+ - --log_every
87
+ - '1'
88
+ - --lr
89
+ - 2e-5
90
+ - --warmup_steps
91
+ - '300'
92
+ - --lr_scheduler
93
+ - cosine
94
+ - --num_worker
95
+ - '1'
96
+ - --expected_num_tokens
97
+ - '30000'
98
+ - --max_num_tokens
99
+ - '30000'
100
+ - --max_num_tokens_per_sample
101
+ - '30000'
102
+ - --visual_und
103
+ - 'True'
104
+ - --visual_gen
105
+ - 'False'
106
+ - --save_every
107
+ - '5000'
108
+ - --total_steps
109
+ - '5000'
110
+ - --text_cond_dropout_prob
111
+ - '0.0'
112
+ - --vae_cond_dropout_prob
113
+ - '0.3'
114
+ - --vit_cond_dropout_prob
115
+ - '0.0'
116
+ - --ema
117
+ - '0.993'
118
+ - --checkpoint_dir
119
+ - /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins
120
+ - --wandb_project
121
+ - bagel
122
+ - --wandb_name
123
+ - checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins
124
+ - --wandb_dir
125
+ - /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins
126
+ - --wandb_offline
127
+ - 'True'
128
+ program: /home/clouduser/Code/Github/unified_world_model/train/pretrain_unified_navit.py
129
+ code_path: train/pretrain_unified_navit.py
130
+ code_path_local: train/pretrain_unified_navit.py
131
+ git:
132
+ remote_url: https://github.com/para-lost/unified_world_model
133
+ commit: 8d7b26b7e552fc87b592cf3be94d93be7aeca2a9
134
+ root: /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins
135
+ host: junyizhang-launch-new-226785958-1-0
136
+ executable: /opt/conda/bin/python3.11
137
+ cpu_count: 48
138
+ cpu_count_logical: 96
139
+ gpu_type: NVIDIA A100-SXM4-80GB
140
+ gpu_count: 8
141
+ disk:
142
+ /:
143
+ total: '1052461830144'
144
+ used: '164471345152'
145
+ memory:
146
+ total: '1437332606976'
147
+ gpu_nvidia:
148
+ - name: NVIDIA A100-SXM4-80GB
149
+ memory_total: '85899345920'
150
+ cuda_cores: 6912
151
+ architecture: Ampere
152
+ uuid: GPU-27013fed-9784-d445-a1eb-01629cf403cc
153
+ - name: NVIDIA A100-SXM4-80GB
154
+ memory_total: '85899345920'
155
+ cuda_cores: 6912
156
+ architecture: Ampere
157
+ uuid: GPU-c4922cf6-bc87-9458-c12f-23210cb43686
158
+ - name: NVIDIA A100-SXM4-80GB
159
+ memory_total: '85899345920'
160
+ cuda_cores: 6912
161
+ architecture: Ampere
162
+ uuid: GPU-1af9405a-c062-486e-383f-7ea6c6ef5158
163
+ - name: NVIDIA A100-SXM4-80GB
164
+ memory_total: '85899345920'
165
+ cuda_cores: 6912
166
+ architecture: Ampere
167
+ uuid: GPU-793b7211-7436-7429-8bd7-cc05be70cc75
168
+ - name: NVIDIA A100-SXM4-80GB
169
+ memory_total: '85899345920'
170
+ cuda_cores: 6912
171
+ architecture: Ampere
172
+ uuid: GPU-5eb44009-8d7d-911d-0730-f219cb50498c
173
+ - name: NVIDIA A100-SXM4-80GB
174
+ memory_total: '85899345920'
175
+ cuda_cores: 6912
176
+ architecture: Ampere
177
+ uuid: GPU-62c85054-47c8-b915-18e9-e4433fc0f9bb
178
+ - name: NVIDIA A100-SXM4-80GB
179
+ memory_total: '85899345920'
180
+ cuda_cores: 6912
181
+ architecture: Ampere
182
+ uuid: GPU-c3b59f2c-b6b6-7730-54ff-8cf5fee4ea9c
183
+ - name: NVIDIA A100-SXM4-80GB
184
+ memory_total: '85899345920'
185
+ cuda_cores: 6912
186
+ architecture: Ampere
187
+ uuid: GPU-e988baaf-6bc5-3bb9-91fb-ab2cb214233d
188
+ cuda_version: '12.2'
189
+ writer_id: wfvoftkdc4um9761m8ix7oiachlmdvuv
190
+ visual_gen:
191
+ desc: null
192
+ value: false
193
+ visual_und:
194
+ desc: null
195
+ value: true
196
+ results_dir:
197
+ desc: null
198
+ value: results
199
+ checkpoint_dir:
200
+ desc: null
201
+ value: /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins
202
+ wandb_project:
203
+ desc: null
204
+ value: bagel
205
+ wandb_name:
206
+ desc: null
207
+ value: checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins
208
+ wandb_runid:
209
+ desc: null
210
+ value: '0'
211
+ wandb_resume:
212
+ desc: null
213
+ value: allow
214
+ wandb_offline:
215
+ desc: null
216
+ value: true
217
+ wandb_dir:
218
+ desc: null
219
+ value: /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins
220
+ global_seed:
221
+ desc: null
222
+ value: 4396
223
+ auto_resume:
224
+ desc: null
225
+ value: false
226
+ resume_from:
227
+ desc: null
228
+ value: /home/clouduser/Code/Models/BAGEL-7B-MoT
229
+ resume_model_only:
230
+ desc: null
231
+ value: true
232
+ finetune_from_ema:
233
+ desc: null
234
+ value: true
235
+ finetune_from_hf:
236
+ desc: null
237
+ value: true
238
+ log_every:
239
+ desc: null
240
+ value: 1
241
+ save_every:
242
+ desc: null
243
+ value: 5000
244
+ total_steps:
245
+ desc: null
246
+ value: 5000
247
+ warmup_steps:
248
+ desc: null
249
+ value: 300
250
+ lr_scheduler:
251
+ desc: null
252
+ value: cosine
253
+ lr:
254
+ desc: null
255
+ value: 2.0e-05
256
+ min_lr:
257
+ desc: null
258
+ value: 1.0e-07
259
+ beta1:
260
+ desc: null
261
+ value: 0.9
262
+ beta2:
263
+ desc: null
264
+ value: 0.95
265
+ eps:
266
+ desc: null
267
+ value: 1.0e-15
268
+ ema:
269
+ desc: null
270
+ value: 0.993
271
+ max_grad_norm:
272
+ desc: null
273
+ value: 1.0
274
+ timestep_shift:
275
+ desc: null
276
+ value: 1.0
277
+ mse_weight:
278
+ desc: null
279
+ value: 1.0
280
+ ce_weight:
281
+ desc: null
282
+ value: 1.0
283
+ ce_loss_reweighting:
284
+ desc: null
285
+ value: false
286
+ expected_num_tokens:
287
+ desc: null
288
+ value: 30000
289
+ num_replicate:
290
+ desc: null
291
+ value: 1
292
+ num_shard:
293
+ desc: null
294
+ value: 8
295
+ sharding_strategy:
296
+ desc: null
297
+ value: HYBRID_SHARD
298
+ backward_prefetch:
299
+ desc: null
300
+ value: BACKWARD_PRE
301
+ cpu_offload:
302
+ desc: null
303
+ value: false
304
+ freeze_llm:
305
+ desc: null
306
+ value: false
307
+ freeze_vit:
308
+ desc: null
309
+ value: false
310
+ freeze_vae:
311
+ desc: null
312
+ value: true
313
+ freeze_und:
314
+ desc: null
315
+ value: false
316
+ copy_init_moe:
317
+ desc: null
318
+ value: true
319
+ use_flex:
320
+ desc: null
321
+ value: false
322
+ eval_every:
323
+ desc: null
324
+ value: 500
325
+ num_eval_batches:
326
+ desc: null
327
+ value: 20
328
+ use_ema_for_eval:
329
+ desc: null
330
+ value: true
331
+ eval_log_dir:
332
+ desc: null
333
+ value: null
334
+ eval_run_tag:
335
+ desc: null
336
+ value: ''
337
+ viz_every:
338
+ desc: null
339
+ value: 500
340
+ viz_n:
341
+ desc: null
342
+ value: 8
343
+ viz_outdir:
344
+ desc: null
345
+ value: results/viz
346
+ eval_dataset_config_file:
347
+ desc: null
348
+ value: ./data/configs/vlm_gym_reference_dot_eval_celoss_no_mse.yaml
349
+ viz_dataset_config_file:
350
+ desc: null
351
+ value: ./data/configs/vlm_gym_reference_dot_eval_celoss_no_mse.yaml
352
+ eval_print_n:
353
+ desc: null
354
+ value: 3
355
+ save_ema_only:
356
+ desc: null
357
+ value: true
358
+ save_optimizer:
359
+ desc: null
360
+ value: false
361
+ model_path:
362
+ desc: null
363
+ value: /home/clouduser/Code/Models/BAGEL-7B-MoT
364
+ llm_path:
365
+ desc: null
366
+ value: hf/Qwen2.5-0.5B-Instruct/
367
+ llm_qk_norm:
368
+ desc: null
369
+ value: true
370
+ tie_word_embeddings:
371
+ desc: null
372
+ value: false
373
+ layer_module:
374
+ desc: null
375
+ value: Qwen2MoTDecoderLayer
376
+ vae_path:
377
+ desc: null
378
+ value: flux/vae/ae.safetensors
379
+ vit_path:
380
+ desc: null
381
+ value: hf/siglip-so400m-14-980-flash-attn2-navit/
382
+ max_latent_size:
383
+ desc: null
384
+ value: 64
385
+ latent_patch_size:
386
+ desc: null
387
+ value: 2
388
+ vit_patch_size:
389
+ desc: null
390
+ value: 14
391
+ vit_max_num_patch_per_side:
392
+ desc: null
393
+ value: 70
394
+ connector_act:
395
+ desc: null
396
+ value: gelu_pytorch_tanh
397
+ interpolate_pos:
398
+ desc: null
399
+ value: false
400
+ vit_select_layer:
401
+ desc: null
402
+ value: -2
403
+ vit_rope:
404
+ desc: null
405
+ value: false
406
+ text_cond_dropout_prob:
407
+ desc: null
408
+ value: 0.0
409
+ vae_cond_dropout_prob:
410
+ desc: null
411
+ value: 0.3
412
+ vit_cond_dropout_prob:
413
+ desc: null
414
+ value: 0.0
415
+ dataset_config_file:
416
+ desc: null
417
+ value: ./data/configs/vlm_gym_reference_dot_train_celoss_no_mse.yaml
418
+ train_data_dir:
419
+ desc: null
420
+ value: /home/clouduser/Code/data/gym/reference_dot_v5/train/
421
+ train_jsonl_path:
422
+ desc: null
423
+ value: /home/clouduser/Code/data/gym/reference_dot_v5/train/
424
+ eval_data_dir:
425
+ desc: null
426
+ value: /home/clouduser/Code/data/gym/reference_dot_v5/val/
427
+ eval_jsonl_path:
428
+ desc: null
429
+ value: /home/clouduser/Code/data/gym/reference_dot_v5/val/
430
+ inference_hash_file:
431
+ desc: null
432
+ value: /home/clouduser/Code/Github/launch_new/hashes_test_set_v10.json
433
+ task_name:
434
+ desc: null
435
+ value: reference_dot_v5
436
+ instructions_dir:
437
+ desc: null
438
+ value: ./data/instructions
439
+ prefetch_factor:
440
+ desc: null
441
+ value: 2
442
+ num_workers:
443
+ desc: null
444
+ value: 1
445
+ max_num_tokens_per_sample:
446
+ desc: null
447
+ value: 30000
448
+ max_num_tokens:
449
+ desc: null
450
+ value: 30000
451
+ prefer_buffer_before:
452
+ desc: null
453
+ value: 16384
454
+ max_buffer_size:
455
+ desc: null
456
+ value: 50
457
+ data_seed:
458
+ desc: null
459
+ value: 42
checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/wandb/offline-run-20260129_223311-checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins-run0/files/output.log CHANGED
@@ -1,180 +1,3 @@
1
- FullyShardedDataParallel(
2
- (_fsdp_wrapped_module): Bagel(
3
- (language_model): Qwen2ForCausalLM(
4
- (model): Qwen2Model(
5
- (embed_tokens): Embedding(152064, 3584)
6
- (layers): ModuleList(
7
- (0-27): 28 x FullyShardedDataParallel(
8
- (_fsdp_wrapped_module): CheckpointWrapper(
9
- (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
10
- (self_attn): PackedAttentionMoT(
11
- (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
12
- (k_proj): Linear(in_features=3584, out_features=512, bias=True)
13
- (v_proj): Linear(in_features=3584, out_features=512, bias=True)
14
- (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
15
- (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
16
- (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
17
- (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
18
- (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
19
- (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
20
- (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
21
- (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
22
- (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
23
- )
24
- (mlp): Qwen2MLP(
25
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
26
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
27
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
28
- (act_fn): SiLU()
29
- )
30
- (mlp_moe_gen): Qwen2MLP(
31
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
32
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
33
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
34
- (act_fn): SiLU()
35
- )
36
- (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
37
- (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
38
- (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
39
- (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
40
- )
41
- )
42
- )
43
- )
44
- (norm): Qwen2RMSNorm((3584,), eps=1e-06)
45
- (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
46
- (rotary_emb): Qwen2RotaryEmbedding()
47
- )
48
- (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
49
- )
50
- (vit_model): SiglipVisionModel(
51
- (vision_model): FullyShardedDataParallel(
52
- (_fsdp_wrapped_module): SiglipVisionTransformer(
53
- (embeddings): SiglipVisionEmbeddings(
54
- (position_embedding): Embedding(4900, 1152)
55
- (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
56
- )
57
- (encoder): SiglipEncoder(
58
- (layers): ModuleList(
59
- (0-25): 26 x FullyShardedDataParallel(
60
- (_fsdp_wrapped_module): CheckpointWrapper(
61
- (_checkpoint_wrapped_module): SiglipEncoderLayer(
62
- (self_attn): SiglipFlashAttention2(
63
- (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
64
- (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
65
- (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
66
- (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
67
- )
68
- (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
69
- (mlp): SiglipMLP(
70
- (activation_fn): PytorchGELUTanh()
71
- (fc1): Linear(in_features=1152, out_features=4304, bias=True)
72
- (fc2): Linear(in_features=4304, out_features=1152, bias=True)
73
- )
74
- (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
75
- )
76
- )
77
- )
78
- )
79
- )
80
- (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
81
- )
82
- )
83
- )
84
- (connector): FullyShardedDataParallel(
85
- (_fsdp_wrapped_module): CheckpointWrapper(
86
- (_checkpoint_wrapped_module): MLPconnector(
87
- (activation_fn): PytorchGELUTanh()
88
- (fc1): Linear(in_features=1152, out_features=3584, bias=True)
89
- (fc2): Linear(in_features=3584, out_features=3584, bias=True)
90
- )
91
- )
92
- )
93
- (vit_pos_embed): FullyShardedDataParallel(
94
- (_fsdp_wrapped_module): PositionEmbedding()
95
- )
96
- )
97
- )
98
- _flat_param True
99
- language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
100
- language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
101
- language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
102
- language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
103
- language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
104
- language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
105
- language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
106
- language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
107
- language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
108
- language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
109
- language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
110
- language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
111
- language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
112
- language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
113
- language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
114
- language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
115
- language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
116
- language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
117
- language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
118
- language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
119
- language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
120
- language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
121
- language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
122
- language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
123
- language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
124
- language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
125
- language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
126
- language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
127
- vit_model.vision_model._fsdp_wrapped_module._flat_param True
128
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
129
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
130
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
131
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
132
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
133
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
134
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
135
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
136
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
137
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
138
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
139
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
140
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
141
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
142
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
143
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
144
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
145
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
146
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
147
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
148
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
149
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
150
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
151
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
152
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
153
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
154
- connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
155
- vit_pos_embed._fsdp_wrapped_module._flat_param False
156
- Preparing Dataset vlm_gym_reference_dot_celoss_no_mse/vlm_gym_reference_dot_train
157
- base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step0
158
- Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
159
- [eval debug] first 3 batch fingerprints:
160
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
161
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
162
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
163
- ce_avg: 1.1923778057098389, mse_avg: 0.0
164
- base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step500
165
- Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
166
- [eval debug] first 3 batch fingerprints:
167
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
168
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
169
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
170
- ce_avg: 0.3785431981086731, mse_avg: 0.0
171
- base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step1000
172
- Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
173
- [eval debug] first 3 batch fingerprints:
174
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
175
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
176
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
177
- ce_avg: 0.40663474798202515, mse_avg: 0.0
178
  wandb: Detected [huggingface_hub.inference] in use.
179
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
180
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
@@ -1137,6 +960,197 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
1137
  [2026-01-29 22:58:42] (step=0000949) Train Loss mse: 0.0000, Train Loss ce: 0.3724, Train Steps/Sec: 0.90,
1138
  [2026-01-29 22:58:44] (step=0000950) Train Loss mse: 0.0000, Train Loss ce: 0.3805, Train Steps/Sec: 0.89,
1139
  [2026-01-29 22:58:45] (step=0000951) Train Loss mse: 0.0000, Train Loss ce: 0.3712, Train Steps/Sec: 0.84,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1140
  [2026-01-29 22:58:46] (step=0000952) Train Loss mse: 0.0000, Train Loss ce: 0.3810, Train Steps/Sec: 0.75,
1141
  [2026-01-29 22:58:47] (step=0000953) Train Loss mse: 0.0000, Train Loss ce: 0.3781, Train Steps/Sec: 0.84,
1142
  [2026-01-29 22:58:48] (step=0000954) Train Loss mse: 0.0000, Train Loss ce: 0.3897, Train Steps/Sec: 0.87,
@@ -1390,20 +1404,6 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
1390
  [2026-01-29 23:04:00] (step=0001202) Train Loss mse: 0.0000, Train Loss ce: 0.3769, Train Steps/Sec: 0.80,
1391
  [2026-01-29 23:04:01] (step=0001203) Train Loss mse: 0.0000, Train Loss ce: 0.3707, Train Steps/Sec: 0.89,
1392
  [2026-01-29 23:04:03] (step=0001204) Train Loss mse: 0.0000, Train Loss ce: 0.3861, Train Steps/Sec: 0.57,
1393
- base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step1500
1394
- Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
1395
- [eval debug] first 3 batch fingerprints:
1396
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1397
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1398
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1399
- ce_avg: 0.5510271191596985, mse_avg: 0.0
1400
- base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step2000
1401
- Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
1402
- [eval debug] first 3 batch fingerprints:
1403
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1404
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1405
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1406
- ce_avg: 0.5553194880485535, mse_avg: 0.0
1407
  [2026-01-29 23:04:04] (step=0001205) Train Loss mse: 0.0000, Train Loss ce: 0.3762, Train Steps/Sec: 0.71,
1408
  [2026-01-29 23:04:06] (step=0001206) Train Loss mse: 0.0000, Train Loss ce: 0.3861, Train Steps/Sec: 0.84,
1409
  [2026-01-29 23:04:07] (step=0001207) Train Loss mse: 0.0000, Train Loss ce: 0.3744, Train Steps/Sec: 0.90,
@@ -2477,6 +2477,27 @@ ce_avg: 0.5553194880485535, mse_avg: 0.0
2477
  [2026-01-29 23:26:13] (step=0002275) Train Loss mse: 0.0000, Train Loss ce: 0.3819, Train Steps/Sec: 0.70,
2478
  [2026-01-29 23:26:14] (step=0002276) Train Loss mse: 0.0000, Train Loss ce: 0.3862, Train Steps/Sec: 0.85,
2479
  [2026-01-29 23:26:15] (step=0002277) Train Loss mse: 0.0000, Train Loss ce: 0.3689, Train Steps/Sec: 0.86,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2480
  [2026-01-29 23:26:17] (step=0002278) Train Loss mse: 0.0000, Train Loss ce: 0.3637, Train Steps/Sec: 0.88,
2481
  [2026-01-29 23:26:18] (step=0002279) Train Loss mse: 0.0000, Train Loss ce: 0.3588, Train Steps/Sec: 0.84,
2482
  [2026-01-29 23:26:19] (step=0002280) Train Loss mse: 0.0000, Train Loss ce: 0.3704, Train Steps/Sec: 0.70,
@@ -2915,27 +2936,6 @@ ce_avg: 0.5553194880485535, mse_avg: 0.0
2915
  [2026-01-29 23:35:26] (step=0002713) Train Loss mse: 0.0000, Train Loss ce: 0.3485, Train Steps/Sec: 0.84,
2916
  [2026-01-29 23:35:27] (step=0002714) Train Loss mse: 0.0000, Train Loss ce: 0.3643, Train Steps/Sec: 0.90,
2917
  [2026-01-29 23:35:29] (step=0002715) Train Loss mse: 0.0000, Train Loss ce: 0.3573, Train Steps/Sec: 0.87,
2918
- [2026-01-29 23:35:30
2919
- base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step2500
2920
- Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
2921
- [eval debug] first 3 batch fingerprints:
2922
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2923
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2924
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2925
- ce_avg: 0.9706690311431885, mse_avg: 0.0
2926
- base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step3000
2927
- Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
2928
- [eval debug] first 3 batch fingerprints:
2929
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2930
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2931
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2932
- ce_avg: 1.6775857210159302, mse_avg: 0.0
2933
- base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step3500
2934
- Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
2935
- [eval debug] first 3 batch fingerprints:
2936
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2937
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2938
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2939
  [2026-01-29 23:35:30] (step=0002716) Train Loss mse: 0.0000, Train Loss ce: 0.3701, Train Steps/Sec: 0.85,
2940
  [2026-01-29 23:35:31] (step=0002717) Train Loss mse: 0.0000, Train Loss ce: 0.3544, Train Steps/Sec: 0.86,
2941
  [2026-01-29 23:35:32] (step=0002718) Train Loss mse: 0.0000, Train Loss ce: 0.3516, Train Steps/Sec: 0.89,
@@ -3522,6 +3522,20 @@ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference
3522
  [2026-01-29 23:47:48] (step=0003299) Train Loss mse: 0.0000, Train Loss ce: 0.3641, Train Steps/Sec: 0.67,
3523
  [2026-01-29 23:47:49] (step=0003300) Train Loss mse: 0.0000, Train Loss ce: 0.3764, Train Steps/Sec: 0.89,
3524
  [2026-01-29 23:47:51] (step=0003301) Train Loss mse: 0.0000, Train Loss ce: 0.3725, Train Steps/Sec: 0.72,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3525
  [2026-01-29 23:47:52] (step=0003302) Train Loss mse: 0.0000, Train Loss ce: 0.3544, Train Steps/Sec: 0.90,
3526
  [2026-01-29 23:47:53] (step=0003303) Train Loss mse: 0.0000, Train Loss ce: 0.3668, Train Steps/Sec: 0.84,
3527
  [2026-01-29 23:47:54] (step=0003304) Train Loss mse: 0.0000, Train Loss ce: 0.3708, Train Steps/Sec: 0.71,
@@ -4278,27 +4292,105 @@ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference
4278
  [2026-01-30 00:03:28] (step=0004055) Train Loss mse: 0.0000, Train Loss ce: 0.3722, Train Steps/Sec: 0.89,
4279
  [2026-01-30 00:03:29] (step=0004056) Train Loss mse: 0.0000, Train Loss ce: 0.3767, Train Steps/Sec: 0.84,
4280
  [2026-01-30 00:03:31] (step=0004057) Train Loss mse: 0.0000, Train Loss ce: 0.3793, Train Steps/Sec: 0.67,
4281
- [2026-01-29 23:35:30] (step=0002716) Train Loss mse: 0.0000, Train Loss ce: 0.3701, Train Steps/Sec: 0.85,
4282
- base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step4000
4283
- Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
4284
- [eval debug] first 3 batch fingerprints:
4285
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
4286
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
4287
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
4288
- ce_avg: 2.886735677719116, mse_avg: 0.0
4289
- base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step4500
4290
- Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
4291
- [eval debug] first 3 batch fingerprints:
4292
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
4293
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
4294
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
4295
- ce_avg: 3.8481361865997314, mse_avg: 0.0
4296
- base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step5000
4297
- Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
4298
- [eval debug] first 3 batch fingerprints:
4299
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
4300
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
4301
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4302
  [2026-01-30 00:05:39] (step=0004157) Train Loss mse: 0.0000, Train Loss ce: 0.3557, Train Steps/Sec: 0.71,
4303
  [2026-01-30 00:05:40] (step=0004158) Train Loss mse: 0.0000, Train Loss ce: 0.3751, Train Steps/Sec: 0.68,
4304
  [2026-01-30 00:05:41] (step=0004159) Train Loss mse: 0.0000, Train Loss ce: 0.3665, Train Steps/Sec: 0.91,
@@ -4874,6 +4966,13 @@ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference
4874
  [2026-01-30 00:17:44] (step=0004729) Train Loss mse: 0.0000, Train Loss ce: 0.3545, Train Steps/Sec: 0.71,
4875
  [2026-01-30 00:17:45] (step=0004730) Train Loss mse: 0.0000, Train Loss ce: 0.3697, Train Steps/Sec: 0.83,
4876
  [2026-01-30 00:17:47] (step=0004731) Train Loss mse: 0.0000, Train Loss ce: 0.3464, Train Steps/Sec: 0.58,
 
 
 
 
 
 
 
4877
  [2026-01-30 00:17:48] (step=0004732) Train Loss mse: 0.0000, Train Loss ce: 0.3669, Train Steps/Sec: 0.84,
4878
  [2026-01-30 00:17:49] (step=0004733) Train Loss mse: 0.0000, Train Loss ce: 0.3566, Train Steps/Sec: 0.89,
4879
  [2026-01-30 00:17:50] (step=0004734) Train Loss mse: 0.0000, Train Loss ce: 0.3595, Train Steps/Sec: 0.84,
@@ -5146,5 +5245,4 @@ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference
5146
  [2026-01-30 00:23:24] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/0005000.
5147
  /opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
5148
  warnings.warn(
5149
- [2026-01-30 00:25:59] Done!
5150
- [2026-01-30 00:05:39] (step=0004157) Train Loss mse: 0.0000, Train Loss ce: 0.3557, Train Steps/Sec: 0.71,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  wandb: Detected [huggingface_hub.inference] in use.
2
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
3
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
 
960
  [2026-01-29 22:58:42] (step=0000949) Train Loss mse: 0.0000, Train Loss ce: 0.3724, Train Steps/Sec: 0.90,
961
  [2026-01-29 22:58:44] (step=0000950) Train Loss mse: 0.0000, Train Loss ce: 0.3805, Train Steps/Sec: 0.89,
962
  [2026-01-29 22:58:45] (step=0000951) Train Loss mse: 0.0000, Train Loss ce: 0.3712, Train Steps/Sec: 0.84,
963
+ FullyShardedDataParallel(
964
+ (_fsdp_wrapped_module): Bagel(
965
+ (language_model): Qwen2ForCausalLM(
966
+ (model): Qwen2Model(
967
+ (embed_tokens): Embedding(152064, 3584)
968
+ (layers): ModuleList(
969
+ (0-27): 28 x FullyShardedDataParallel(
970
+ (_fsdp_wrapped_module): CheckpointWrapper(
971
+ (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
972
+ (self_attn): PackedAttentionMoT(
973
+ (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
974
+ (k_proj): Linear(in_features=3584, out_features=512, bias=True)
975
+ (v_proj): Linear(in_features=3584, out_features=512, bias=True)
976
+ (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
977
+ (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
978
+ (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
979
+ (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
980
+ (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
981
+ (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
982
+ (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
983
+ (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
984
+ (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
985
+ )
986
+ (mlp): Qwen2MLP(
987
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
988
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
989
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
990
+ (act_fn): SiLU()
991
+ )
992
+ (mlp_moe_gen): Qwen2MLP(
993
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
994
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
995
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
996
+ (act_fn): SiLU()
997
+ )
998
+ (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
999
+ (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
1000
+ (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
1001
+ (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
1002
+ )
1003
+ )
1004
+ )
1005
+ )
1006
+ (norm): Qwen2RMSNorm((3584,), eps=1e-06)
1007
+ (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
1008
+ (rotary_emb): Qwen2RotaryEmbedding()
1009
+ )
1010
+ (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
1011
+ )
1012
+ (vit_model): SiglipVisionModel(
1013
+ (vision_model): FullyShardedDataParallel(
1014
+ (_fsdp_wrapped_module): SiglipVisionTransformer(
1015
+ (embeddings): SiglipVisionEmbeddings(
1016
+ (position_embedding): Embedding(4900, 1152)
1017
+ (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
1018
+ )
1019
+ (encoder): SiglipEncoder(
1020
+ (layers): ModuleList(
1021
+ (0-25): 26 x FullyShardedDataParallel(
1022
+ (_fsdp_wrapped_module): CheckpointWrapper(
1023
+ (_checkpoint_wrapped_module): SiglipEncoderLayer(
1024
+ (self_attn): SiglipFlashAttention2(
1025
+ (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
1026
+ (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
1027
+ (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
1028
+ (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
1029
+ )
1030
+ (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1031
+ (mlp): SiglipMLP(
1032
+ (activation_fn): PytorchGELUTanh()
1033
+ (fc1): Linear(in_features=1152, out_features=4304, bias=True)
1034
+ (fc2): Linear(in_features=4304, out_features=1152, bias=True)
1035
+ )
1036
+ (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1037
+ )
1038
+ )
1039
+ )
1040
+ )
1041
+ )
1042
+ (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1043
+ )
1044
+ )
1045
+ )
1046
+ (connector): FullyShardedDataParallel(
1047
+ (_fsdp_wrapped_module): CheckpointWrapper(
1048
+ (_checkpoint_wrapped_module): MLPconnector(
1049
+ (activation_fn): PytorchGELUTanh()
1050
+ (fc1): Linear(in_features=1152, out_features=3584, bias=True)
1051
+ (fc2): Linear(in_features=3584, out_features=3584, bias=True)
1052
+ )
1053
+ )
1054
+ )
1055
+ (vit_pos_embed): FullyShardedDataParallel(
1056
+ (_fsdp_wrapped_module): PositionEmbedding()
1057
+ )
1058
+ )
1059
+ )
1060
+ _flat_param True
1061
+ language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1062
+ language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1063
+ language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1064
+ language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1065
+ language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1066
+ language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1067
+ language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1068
+ language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1069
+ language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1070
+ language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1071
+ language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1072
+ language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1073
+ language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1074
+ language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1075
+ language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1076
+ language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1077
+ language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1078
+ language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1079
+ language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1080
+ language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1081
+ language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1082
+ language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1083
+ language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1084
+ language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1085
+ language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1086
+ language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1087
+ language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1088
+ language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1089
+ vit_model.vision_model._fsdp_wrapped_module._flat_param True
1090
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1091
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1092
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1093
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1094
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1095
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1096
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1097
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1098
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1099
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1100
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1101
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1102
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1103
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1104
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1105
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1106
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1107
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1108
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1109
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1110
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1111
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1112
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1113
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1114
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1115
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1116
+ connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1117
+ vit_pos_embed._fsdp_wrapped_module._flat_param False
1118
+ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse/vlm_gym_reference_dot_train
1119
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step0
1120
+ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
1121
+ [eval debug] first 3 batch fingerprints:
1122
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1123
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1124
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1125
+ ce_avg: 1.1923778057098389, mse_avg: 0.0
1126
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step500
1127
+ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
1128
+ [eval debug] first 3 batch fingerprints:
1129
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1130
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1131
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1132
+ ce_avg: 0.3785431981086731, mse_avg: 0.0
1133
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step1000
1134
+ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
1135
+ [eval debug] first 3 batch fingerprints:
1136
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1137
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1138
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1139
+ ce_avg: 0.40663474798202515, mse_avg: 0.0
1140
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step1500
1141
+ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
1142
+ [eval debug] first 3 batch fingerprints:
1143
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1144
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1145
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1146
+ ce_avg: 0.5510271191596985, mse_avg: 0.0
1147
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step2000
1148
+ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
1149
+ [eval debug] first 3 batch fingerprints:
1150
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1151
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1152
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
1153
+ ce_avg: 0.5553194880485535, mse_avg: 0.0
1154
  [2026-01-29 22:58:46] (step=0000952) Train Loss mse: 0.0000, Train Loss ce: 0.3810, Train Steps/Sec: 0.75,
1155
  [2026-01-29 22:58:47] (step=0000953) Train Loss mse: 0.0000, Train Loss ce: 0.3781, Train Steps/Sec: 0.84,
1156
  [2026-01-29 22:58:48] (step=0000954) Train Loss mse: 0.0000, Train Loss ce: 0.3897, Train Steps/Sec: 0.87,
 
1404
  [2026-01-29 23:04:00] (step=0001202) Train Loss mse: 0.0000, Train Loss ce: 0.3769, Train Steps/Sec: 0.80,
1405
  [2026-01-29 23:04:01] (step=0001203) Train Loss mse: 0.0000, Train Loss ce: 0.3707, Train Steps/Sec: 0.89,
1406
  [2026-01-29 23:04:03] (step=0001204) Train Loss mse: 0.0000, Train Loss ce: 0.3861, Train Steps/Sec: 0.57,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1407
  [2026-01-29 23:04:04] (step=0001205) Train Loss mse: 0.0000, Train Loss ce: 0.3762, Train Steps/Sec: 0.71,
1408
  [2026-01-29 23:04:06] (step=0001206) Train Loss mse: 0.0000, Train Loss ce: 0.3861, Train Steps/Sec: 0.84,
1409
  [2026-01-29 23:04:07] (step=0001207) Train Loss mse: 0.0000, Train Loss ce: 0.3744, Train Steps/Sec: 0.90,
 
2477
  [2026-01-29 23:26:13] (step=0002275) Train Loss mse: 0.0000, Train Loss ce: 0.3819, Train Steps/Sec: 0.70,
2478
  [2026-01-29 23:26:14] (step=0002276) Train Loss mse: 0.0000, Train Loss ce: 0.3862, Train Steps/Sec: 0.85,
2479
  [2026-01-29 23:26:15] (step=0002277) Train Loss mse: 0.0000, Train Loss ce: 0.3689, Train Steps/Sec: 0.86,
2480
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step2500
2481
+ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
2482
+ [eval debug] first 3 batch fingerprints:
2483
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2484
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2485
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2486
+ ce_avg: 0.9706690311431885, mse_avg: 0.0
2487
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step3000
2488
+ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
2489
+ [eval debug] first 3 batch fingerprints:
2490
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2491
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2492
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2493
+ ce_avg: 1.6775857210159302, mse_avg: 0.0
2494
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step3500
2495
+ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
2496
+ [eval debug] first 3 batch fingerprints:
2497
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2498
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2499
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
2500
+ ce_avg: 2.2602760791778564, mse_avg: 0.0
2501
  [2026-01-29 23:26:17] (step=0002278) Train Loss mse: 0.0000, Train Loss ce: 0.3637, Train Steps/Sec: 0.88,
2502
  [2026-01-29 23:26:18] (step=0002279) Train Loss mse: 0.0000, Train Loss ce: 0.3588, Train Steps/Sec: 0.84,
2503
  [2026-01-29 23:26:19] (step=0002280) Train Loss mse: 0.0000, Train Loss ce: 0.3704, Train Steps/Sec: 0.70,
 
2936
  [2026-01-29 23:35:26] (step=0002713) Train Loss mse: 0.0000, Train Loss ce: 0.3485, Train Steps/Sec: 0.84,
2937
  [2026-01-29 23:35:27] (step=0002714) Train Loss mse: 0.0000, Train Loss ce: 0.3643, Train Steps/Sec: 0.90,
2938
  [2026-01-29 23:35:29] (step=0002715) Train Loss mse: 0.0000, Train Loss ce: 0.3573, Train Steps/Sec: 0.87,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2939
  [2026-01-29 23:35:30] (step=0002716) Train Loss mse: 0.0000, Train Loss ce: 0.3701, Train Steps/Sec: 0.85,
2940
  [2026-01-29 23:35:31] (step=0002717) Train Loss mse: 0.0000, Train Loss ce: 0.3544, Train Steps/Sec: 0.86,
2941
  [2026-01-29 23:35:32] (step=0002718) Train Loss mse: 0.0000, Train Loss ce: 0.3516, Train Steps/Sec: 0.89,
 
3522
  [2026-01-29 23:47:48] (step=0003299) Train Loss mse: 0.0000, Train Loss ce: 0.3641, Train Steps/Sec: 0.67,
3523
  [2026-01-29 23:47:49] (step=0003300) Train Loss mse: 0.0000, Train Loss ce: 0.3764, Train Steps/Sec: 0.89,
3524
  [2026-01-29 23:47:51] (step=0003301) Train Loss mse: 0.0000, Train Loss ce: 0.3725, Train Steps/Sec: 0.72,
3525
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step4000
3526
+ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
3527
+ [eval debug] first 3 batch fingerprints:
3528
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
3529
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
3530
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
3531
+ ce_avg: 2.886735677719116, mse_avg: 0.0
3532
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step4500
3533
+ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
3534
+ [eval debug] first 3 batch fingerprints:
3535
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
3536
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
3537
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
3538
+ ce_avg: 3.8481361865997314, mse_avg: 0.0
3539
  [2026-01-29 23:47:52] (step=0003302) Train Loss mse: 0.0000, Train Loss ce: 0.3544, Train Steps/Sec: 0.90,
3540
  [2026-01-29 23:47:53] (step=0003303) Train Loss mse: 0.0000, Train Loss ce: 0.3668, Train Steps/Sec: 0.84,
3541
  [2026-01-29 23:47:54] (step=0003304) Train Loss mse: 0.0000, Train Loss ce: 0.3708, Train Steps/Sec: 0.71,
 
4292
  [2026-01-30 00:03:28] (step=0004055) Train Loss mse: 0.0000, Train Loss ce: 0.3722, Train Steps/Sec: 0.89,
4293
  [2026-01-30 00:03:29] (step=0004056) Train Loss mse: 0.0000, Train Loss ce: 0.3767, Train Steps/Sec: 0.84,
4294
  [2026-01-30 00:03:31] (step=0004057) Train Loss mse: 0.0000, Train Loss ce: 0.3793, Train Steps/Sec: 0.67,
4295
+ [2026-01-30 00:03:32] (step=0004058) Train Loss mse: 0.0000, Train Loss ce: 0.3543, Train Steps/Sec: 0.90,
4296
+ [2026-01-30 00:03:34] (step=0004059) Train Loss mse: 0.0000, Train Loss ce: 0.3605, Train Steps/Sec: 0.56,
4297
+ [2026-01-30 00:03:35] (step=0004060) Train Loss mse: 0.0000, Train Loss ce: 0.3574, Train Steps/Sec: 0.85,
4298
+ [2026-01-30 00:03:36] (step=0004061) Train Loss mse: 0.0000, Train Loss ce: 0.3738, Train Steps/Sec: 0.74,
4299
+ [2026-01-30 00:03:38] (step=0004062) Train Loss mse: 0.0000, Train Loss ce: 0.3642, Train Steps/Sec: 0.68,
4300
+ [2026-01-30 00:03:39] (step=0004063) Train Loss mse: 0.0000, Train Loss ce: 0.3514, Train Steps/Sec: 0.85,
4301
+ [2026-01-30 00:03:40] (step=0004064) Train Loss mse: 0.0000, Train Loss ce: 0.3550, Train Steps/Sec: 0.90,
4302
+ [2026-01-30 00:03:41] (step=0004065) Train Loss mse: 0.0000, Train Loss ce: 0.3646, Train Steps/Sec: 0.90,
4303
+ [2026-01-30 00:03:43] (step=0004066) Train Loss mse: 0.0000, Train Loss ce: 0.3483, Train Steps/Sec: 0.72,
4304
+ [2026-01-30 00:03:44] (step=0004067) Train Loss mse: 0.0000, Train Loss ce: 0.3649, Train Steps/Sec: 0.67,
4305
+ [2026-01-30 00:03:45] (step=0004068) Train Loss mse: 0.0000, Train Loss ce: 0.3743, Train Steps/Sec: 0.69,
4306
+ [2026-01-30 00:03:47] (step=0004069) Train Loss mse: 0.0000, Train Loss ce: 0.3579, Train Steps/Sec: 0.90,
4307
+ [2026-01-30 00:03:48] (step=0004070) Train Loss mse: 0.0000, Train Loss ce: 0.3509, Train Steps/Sec: 0.75,
4308
+ [2026-01-30 00:03:49] (step=0004071) Train Loss mse: 0.0000, Train Loss ce: 0.3552, Train Steps/Sec: 0.69,
4309
+ [2026-01-30 00:03:50] (step=0004072) Train Loss mse: 0.0000, Train Loss ce: 0.3556, Train Steps/Sec: 0.89,
4310
+ [2026-01-30 00:03:52] (step=0004073) Train Loss mse: 0.0000, Train Loss ce: 0.3654, Train Steps/Sec: 0.90,
4311
+ [2026-01-30 00:03:53] (step=0004074) Train Loss mse: 0.0000, Train Loss ce: 0.3773, Train Steps/Sec: 0.71,
4312
+ [2026-01-30 00:03:54] (step=0004075) Train Loss mse: 0.0000, Train Loss ce: 0.3589, Train Steps/Sec: 0.69,
4313
+ [2026-01-30 00:03:56] (step=0004076) Train Loss mse: 0.0000, Train Loss ce: 0.3530, Train Steps/Sec: 0.72,
4314
+ [2026-01-30 00:03:57] (step=0004077) Train Loss mse: 0.0000, Train Loss ce: 0.3523, Train Steps/Sec: 0.90,
4315
+ [2026-01-30 00:03:58] (step=0004078) Train Loss mse: 0.0000, Train Loss ce: 0.3796, Train Steps/Sec: 0.72,
4316
+ [2026-01-30 00:03:59] (step=0004079) Train Loss mse: 0.0000, Train Loss ce: 0.3589, Train Steps/Sec: 0.90,
4317
+ [2026-01-30 00:04:01] (step=0004080) Train Loss mse: 0.0000, Train Loss ce: 0.3491, Train Steps/Sec: 0.69,
4318
+ [2026-01-30 00:04:02] (step=0004081) Train Loss mse: 0.0000, Train Loss ce: 0.3521, Train Steps/Sec: 0.84,
4319
+ [2026-01-30 00:04:04] (step=0004082) Train Loss mse: 0.0000, Train Loss ce: 0.3533, Train Steps/Sec: 0.68,
4320
+ [2026-01-30 00:04:05] (step=0004083) Train Loss mse: 0.0000, Train Loss ce: 0.3622, Train Steps/Sec: 0.71,
4321
+ [2026-01-30 00:04:06] (step=0004084) Train Loss mse: 0.0000, Train Loss ce: 0.3492, Train Steps/Sec: 0.91,
4322
+ [2026-01-30 00:04:07] (step=0004085) Train Loss mse: 0.0000, Train Loss ce: 0.3543, Train Steps/Sec: 0.90,
4323
+ [2026-01-30 00:04:09] (step=0004086) Train Loss mse: 0.0000, Train Loss ce: 0.3772, Train Steps/Sec: 0.72,
4324
+ [2026-01-30 00:04:10] (step=0004087) Train Loss mse: 0.0000, Train Loss ce: 0.3568, Train Steps/Sec: 0.88,
4325
+ [2026-01-30 00:04:11] (step=0004088) Train Loss mse: 0.0000, Train Loss ce: 0.3771, Train Steps/Sec: 0.70,
4326
+ [2026-01-30 00:04:12] (step=0004089) Train Loss mse: 0.0000, Train Loss ce: 0.3694, Train Steps/Sec: 0.89,
4327
+ [2026-01-30 00:04:14] (step=0004090) Train Loss mse: 0.0000, Train Loss ce: 0.3473, Train Steps/Sec: 0.69,
4328
+ [2026-01-30 00:04:15] (step=0004091) Train Loss mse: 0.0000, Train Loss ce: 0.3516, Train Steps/Sec: 0.71,
4329
+ [2026-01-30 00:04:16] (step=0004092) Train Loss mse: 0.0000, Train Loss ce: 0.3711, Train Steps/Sec: 0.90,
4330
+ [2026-01-30 00:04:17] (step=0004093) Train Loss mse: 0.0000, Train Loss ce: 0.3651, Train Steps/Sec: 0.90,
4331
+ [2026-01-30 00:04:19] (step=0004094) Train Loss mse: 0.0000, Train Loss ce: 0.3730, Train Steps/Sec: 0.75,
4332
+ [2026-01-30 00:04:20] (step=0004095) Train Loss mse: 0.0000, Train Loss ce: 0.3569, Train Steps/Sec: 0.86,
4333
+ [2026-01-30 00:04:21] (step=0004096) Train Loss mse: 0.0000, Train Loss ce: 0.3653, Train Steps/Sec: 0.66,
4334
+ [2026-01-30 00:04:23] (step=0004097) Train Loss mse: 0.0000, Train Loss ce: 0.3717, Train Steps/Sec: 0.89,
4335
+ [2026-01-30 00:04:24] (step=0004098) Train Loss mse: 0.0000, Train Loss ce: 0.3593, Train Steps/Sec: 0.57,
4336
+ [2026-01-30 00:04:25] (step=0004099) Train Loss mse: 0.0000, Train Loss ce: 0.3588, Train Steps/Sec: 0.85,
4337
+ [2026-01-30 00:04:27] (step=0004100) Train Loss mse: 0.0000, Train Loss ce: 0.3626, Train Steps/Sec: 0.91,
4338
+ [2026-01-30 00:04:28] (step=0004101) Train Loss mse: 0.0000, Train Loss ce: 0.3545, Train Steps/Sec: 0.85,
4339
+ [2026-01-30 00:04:29] (step=0004102) Train Loss mse: 0.0000, Train Loss ce: 0.3589, Train Steps/Sec: 0.76,
4340
+ [2026-01-30 00:04:30] (step=0004103) Train Loss mse: 0.0000, Train Loss ce: 0.3637, Train Steps/Sec: 0.90,
4341
+ [2026-01-30 00:04:32] (step=0004104) Train Loss mse: 0.0000, Train Loss ce: 0.3636, Train Steps/Sec: 0.70,
4342
+ [2026-01-30 00:04:33] (step=0004105) Train Loss mse: 0.0000, Train Loss ce: 0.3388, Train Steps/Sec: 0.71,
4343
+ [2026-01-30 00:04:34] (step=0004106) Train Loss mse: 0.0000, Train Loss ce: 0.3552, Train Steps/Sec: 0.72,
4344
+ [2026-01-30 00:04:36] (step=0004107) Train Loss mse: 0.0000, Train Loss ce: 0.3586, Train Steps/Sec: 0.88,
4345
+ [2026-01-30 00:04:37] (step=0004108) Train Loss mse: 0.0000, Train Loss ce: 0.3677, Train Steps/Sec: 0.86,
4346
+ [2026-01-30 00:04:38] (step=0004109) Train Loss mse: 0.0000, Train Loss ce: 0.3573, Train Steps/Sec: 0.76,
4347
+ [2026-01-30 00:04:39] (step=0004110) Train Loss mse: 0.0000, Train Loss ce: 0.3628, Train Steps/Sec: 0.90,
4348
+ [2026-01-30 00:04:40] (step=0004111) Train Loss mse: 0.0000, Train Loss ce: 0.3559, Train Steps/Sec: 0.89,
4349
+ [2026-01-30 00:04:42] (step=0004112) Train Loss mse: 0.0000, Train Loss ce: 0.3957, Train Steps/Sec: 0.67,
4350
+ [2026-01-30 00:04:43] (step=0004113) Train Loss mse: 0.0000, Train Loss ce: 0.3690, Train Steps/Sec: 0.69,
4351
+ [2026-01-30 00:04:44] (step=0004114) Train Loss mse: 0.0000, Train Loss ce: 0.3488, Train Steps/Sec: 0.84,
4352
+ [2026-01-30 00:04:46] (step=0004115) Train Loss mse: 0.0000, Train Loss ce: 0.3580, Train Steps/Sec: 0.90,
4353
+ [2026-01-30 00:04:47] (step=0004116) Train Loss mse: 0.0000, Train Loss ce: 0.3638, Train Steps/Sec: 0.84,
4354
+ [2026-01-30 00:04:48] (step=0004117) Train Loss mse: 0.0000, Train Loss ce: 0.3688, Train Steps/Sec: 0.76,
4355
+ [2026-01-30 00:04:49] (step=0004118) Train Loss mse: 0.0000, Train Loss ce: 0.3791, Train Steps/Sec: 0.84,
4356
+ [2026-01-30 00:04:51] (step=0004119) Train Loss mse: 0.0000, Train Loss ce: 0.3575, Train Steps/Sec: 0.69,
4357
+ [2026-01-30 00:04:52] (step=0004120) Train Loss mse: 0.0000, Train Loss ce: 0.3538, Train Steps/Sec: 0.71,
4358
+ [2026-01-30 00:04:53] (step=0004121) Train Loss mse: 0.0000, Train Loss ce: 0.3558, Train Steps/Sec: 0.84,
4359
+ [2026-01-30 00:04:55] (step=0004122) Train Loss mse: 0.0000, Train Loss ce: 0.3825, Train Steps/Sec: 0.69,
4360
+ [2026-01-30 00:04:56] (step=0004123) Train Loss mse: 0.0000, Train Loss ce: 0.3577, Train Steps/Sec: 0.90,
4361
+ [2026-01-30 00:04:57] (step=0004124) Train Loss mse: 0.0000, Train Loss ce: 0.3586, Train Steps/Sec: 0.86,
4362
+ [2026-01-30 00:04:58] (step=0004125) Train Loss mse: 0.0000, Train Loss ce: 0.3512, Train Steps/Sec: 0.84,
4363
+ [2026-01-30 00:04:59] (step=0004126) Train Loss mse: 0.0000, Train Loss ce: 0.3589, Train Steps/Sec: 0.91,
4364
+ [2026-01-30 00:05:01] (step=0004127) Train Loss mse: 0.0000, Train Loss ce: 0.3581, Train Steps/Sec: 0.68,
4365
+ [2026-01-30 00:05:02] (step=0004128) Train Loss mse: 0.0000, Train Loss ce: 0.3660, Train Steps/Sec: 0.91,
4366
+ [2026-01-30 00:05:03] (step=0004129) Train Loss mse: 0.0000, Train Loss ce: 0.3729, Train Steps/Sec: 0.67,
4367
+ [2026-01-30 00:05:05] (step=0004130) Train Loss mse: 0.0000, Train Loss ce: 0.3638, Train Steps/Sec: 0.88,
4368
+ [2026-01-30 00:05:06] (step=0004131) Train Loss mse: 0.0000, Train Loss ce: 0.3535, Train Steps/Sec: 0.84,
4369
+ [2026-01-30 00:05:07] (step=0004132) Train Loss mse: 0.0000, Train Loss ce: 0.3656, Train Steps/Sec: 0.71,
4370
+ [2026-01-30 00:05:08] (step=0004133) Train Loss mse: 0.0000, Train Loss ce: 0.3856, Train Steps/Sec: 0.91,
4371
+ [2026-01-30 00:05:10] (step=0004134) Train Loss mse: 0.0000, Train Loss ce: 0.3409, Train Steps/Sec: 0.70,
4372
+ [2026-01-30 00:05:11] (step=0004135) Train Loss mse: 0.0000, Train Loss ce: 0.3629, Train Steps/Sec: 0.84,
4373
+ [2026-01-30 00:05:12] (step=0004136) Train Loss mse: 0.0000, Train Loss ce: 0.3530, Train Steps/Sec: 0.76,
4374
+ [2026-01-30 00:05:13] (step=0004137) Train Loss mse: 0.0000, Train Loss ce: 0.3742, Train Steps/Sec: 0.90,
4375
+ [2026-01-30 00:05:14] (step=0004138) Train Loss mse: 0.0000, Train Loss ce: 0.3588, Train Steps/Sec: 0.90,
4376
+ [2026-01-30 00:05:16] (step=0004139) Train Loss mse: 0.0000, Train Loss ce: 0.3755, Train Steps/Sec: 0.71,
4377
+ [2026-01-30 00:05:17] (step=0004140) Train Loss mse: 0.0000, Train Loss ce: 0.3742, Train Steps/Sec: 0.86,
4378
+ [2026-01-30 00:05:18] (step=0004141) Train Loss mse: 0.0000, Train Loss ce: 0.3802, Train Steps/Sec: 0.70,
4379
+ [2026-01-30 00:05:20] (step=0004142) Train Loss mse: 0.0000, Train Loss ce: 0.3632, Train Steps/Sec: 0.84,
4380
+ [2026-01-30 00:05:21] (step=0004143) Train Loss mse: 0.0000, Train Loss ce: 0.3615, Train Steps/Sec: 0.70,
4381
+ [2026-01-30 00:05:22] (step=0004144) Train Loss mse: 0.0000, Train Loss ce: 0.3584, Train Steps/Sec: 0.87,
4382
+ [2026-01-30 00:05:23] (step=0004145) Train Loss mse: 0.0000, Train Loss ce: 0.3450, Train Steps/Sec: 0.90,
4383
+ [2026-01-30 00:05:25] (step=0004146) Train Loss mse: 0.0000, Train Loss ce: 0.3440, Train Steps/Sec: 0.71,
4384
+ [2026-01-30 00:05:26] (step=0004147) Train Loss mse: 0.0000, Train Loss ce: 0.3615, Train Steps/Sec: 0.83,
4385
+ [2026-01-30 00:05:27] (step=0004148) Train Loss mse: 0.0000, Train Loss ce: 0.3581, Train Steps/Sec: 0.71,
4386
+ [2026-01-30 00:05:29] (step=0004149) Train Loss mse: 0.0000, Train Loss ce: 0.3668, Train Steps/Sec: 0.84,
4387
+ [2026-01-30 00:05:30] (step=0004150) Train Loss mse: 0.0000, Train Loss ce: 0.3482, Train Steps/Sec: 0.69,
4388
+ [2026-01-30 00:05:31] (step=0004151) Train Loss mse: 0.0000, Train Loss ce: 0.3644, Train Steps/Sec: 0.86,
4389
+ [2026-01-30 00:05:32] (step=0004152) Train Loss mse: 0.0000, Train Loss ce: 0.3465, Train Steps/Sec: 0.83,
4390
+ [2026-01-30 00:05:34] (step=0004153) Train Loss mse: 0.0000, Train Loss ce: 0.3505, Train Steps/Sec: 0.70,
4391
+ [2026-01-30 00:05:35] (step=0004154) Train Loss mse: 0.0000, Train Loss ce: 0.3737, Train Steps/Sec: 0.83,
4392
+ [2026-01-30 00:05:36] (step=0004155) Train Loss mse: 0.0000, Train Loss ce: 0.3745, Train Steps/Sec: 0.73,
4393
+ [2026-01-30 00:05:37] (step=0004156) Train Loss mse: 0.0000, Train Loss ce: 0.3861, Train Steps/Sec: 0.90,
4394
  [2026-01-30 00:05:39] (step=0004157) Train Loss mse: 0.0000, Train Loss ce: 0.3557, Train Steps/Sec: 0.71,
4395
  [2026-01-30 00:05:40] (step=0004158) Train Loss mse: 0.0000, Train Loss ce: 0.3751, Train Steps/Sec: 0.68,
4396
  [2026-01-30 00:05:41] (step=0004159) Train Loss mse: 0.0000, Train Loss ce: 0.3665, Train Steps/Sec: 0.91,
 
4966
  [2026-01-30 00:17:44] (step=0004729) Train Loss mse: 0.0000, Train Loss ce: 0.3545, Train Steps/Sec: 0.71,
4967
  [2026-01-30 00:17:45] (step=0004730) Train Loss mse: 0.0000, Train Loss ce: 0.3697, Train Steps/Sec: 0.83,
4968
  [2026-01-30 00:17:47] (step=0004731) Train Loss mse: 0.0000, Train Loss ce: 0.3464, Train Steps/Sec: 0.58,
4969
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins_step5000
4970
+ Preparing Dataset vlm_gym_reference_dot_celoss_no_mse_evalonce/vlm_gym_reference_dot_val
4971
+ [eval debug] first 3 batch fingerprints:
4972
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
4973
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
4974
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_reference_dot_celoss_no_mse_evalonce'}]
4975
+ ce_avg: 6.3412766456604, mse_avg: 0.0
4976
  [2026-01-30 00:17:48] (step=0004732) Train Loss mse: 0.0000, Train Loss ce: 0.3669, Train Steps/Sec: 0.84,
4977
  [2026-01-30 00:17:49] (step=0004733) Train Loss mse: 0.0000, Train Loss ce: 0.3566, Train Steps/Sec: 0.89,
4978
  [2026-01-30 00:17:50] (step=0004734) Train Loss mse: 0.0000, Train Loss ce: 0.3595, Train Steps/Sec: 0.84,
 
5245
  [2026-01-30 00:23:24] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_reference_dot_one_image_lr2e_5_ce_no_mse_ins/0005000.
5246
  /opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
5247
  warnings.warn(
5248
+ [2026-01-30 00:25:59] Done!