Junyi42 commited on
Commit
24ceec0
·
verified ·
1 Parent(s): 39dd173

Upload checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins

Browse files
checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/wandb/offline-run-20260127_014845-vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins-run0/files/output.log CHANGED
@@ -1,3 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  wandb: Detected [huggingface_hub.inference] in use.
2
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
3
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
@@ -1100,192 +1286,11 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
1100
  [2026-01-27 02:23:48] (step=0001089) Train Loss mse: 0.0100, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
1101
  [2026-01-27 02:23:50] (step=0001090) Train Loss mse: 0.0099, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
1102
  [2026-01-27 02:23:51] (step=0001091) Train Loss mse: 0.0082, Train Loss ce: 0.0000, Train Steps/Sec: 0.58,
1103
- FullyShardedDataParallel(
1104
- (_fsdp_wrapped_module): Bagel(
1105
- (language_model): Qwen2ForCausalLM(
1106
- (model): Qwen2Model(
1107
- (embed_tokens): Embedding(152064, 3584)
1108
- (layers): ModuleList(
1109
- (0-27): 28 x FullyShardedDataParallel(
1110
- (_fsdp_wrapped_module): CheckpointWrapper(
1111
- (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
1112
- (self_attn): PackedAttentionMoT(
1113
- (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
1114
- (k_proj): Linear(in_features=3584, out_features=512, bias=True)
1115
- (v_proj): Linear(in_features=3584, out_features=512, bias=True)
1116
- (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
1117
- (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
1118
- (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
1119
- (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
1120
- (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
1121
- (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
1122
- (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
1123
- (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
1124
- (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
1125
- )
1126
- (mlp): Qwen2MLP(
1127
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
1128
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
1129
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
1130
- (act_fn): SiLU()
1131
- )
1132
- (mlp_moe_gen): Qwen2MLP(
1133
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
1134
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
1135
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
1136
- (act_fn): SiLU()
1137
- )
1138
- (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
1139
- (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
1140
- (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
1141
- (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
1142
- )
1143
- )
1144
- )
1145
- )
1146
- (norm): Qwen2RMSNorm((3584,), eps=1e-06)
1147
- (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
1148
- (rotary_emb): Qwen2RotaryEmbedding()
1149
- )
1150
- (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
1151
- )
1152
- (time_embedder): FullyShardedDataParallel(
1153
- (_fsdp_wrapped_module): TimestepEmbedder(
1154
- (mlp): Sequential(
1155
- (0): Linear(in_features=256, out_features=3584, bias=True)
1156
- (1): SiLU()
1157
- (2): Linear(in_features=3584, out_features=3584, bias=True)
1158
- )
1159
- )
1160
- )
1161
- (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
1162
- (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
1163
- (latent_pos_embed): FullyShardedDataParallel(
1164
- (_fsdp_wrapped_module): PositionEmbedding()
1165
- )
1166
- (vit_model): SiglipVisionModel(
1167
- (vision_model): FullyShardedDataParallel(
1168
- (_fsdp_wrapped_module): SiglipVisionTransformer(
1169
- (embeddings): SiglipVisionEmbeddings(
1170
- (position_embedding): Embedding(4900, 1152)
1171
- (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
1172
- )
1173
- (encoder): SiglipEncoder(
1174
- (layers): ModuleList(
1175
- (0-25): 26 x FullyShardedDataParallel(
1176
- (_fsdp_wrapped_module): CheckpointWrapper(
1177
- (_checkpoint_wrapped_module): SiglipEncoderLayer(
1178
- (self_attn): SiglipFlashAttention2(
1179
- (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
1180
- (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
1181
- (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
1182
- (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
1183
- )
1184
- (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1185
- (mlp): SiglipMLP(
1186
- (activation_fn): PytorchGELUTanh()
1187
- (fc1): Linear(in_features=1152, out_features=4304, bias=True)
1188
- (fc2): Linear(in_features=4304, out_features=1152, bias=True)
1189
- )
1190
- (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1191
- )
1192
- )
1193
- )
1194
- )
1195
- )
1196
- (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1197
- )
1198
- )
1199
- )
1200
- (connector): FullyShardedDataParallel(
1201
- (_fsdp_wrapped_module): CheckpointWrapper(
1202
- (_checkpoint_wrapped_module): MLPconnector(
1203
- (activation_fn): PytorchGELUTanh()
1204
- (fc1): Linear(in_features=1152, out_features=3584, bias=True)
1205
- (fc2): Linear(in_features=3584, out_features=3584, bias=True)
1206
- )
1207
- )
1208
- )
1209
- (vit_pos_embed): FullyShardedDataParallel(
1210
- (_fsdp_wrapped_module): PositionEmbedding()
1211
- )
1212
- )
1213
- )
1214
- _flat_param True
1215
- language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1216
- language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1217
- language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1218
- language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1219
- language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1220
- language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1221
- language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1222
- language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1223
- language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1224
- language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1225
- language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1226
- language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1227
- language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1228
- language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1229
- language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1230
- language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1231
- language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1232
- language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1233
- language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1234
- language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1235
- language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1236
- language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1237
- language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1238
- language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1239
- language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1240
- language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1241
- language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1242
- language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1243
- time_embedder._fsdp_wrapped_module._flat_param True
1244
- latent_pos_embed._fsdp_wrapped_module._flat_param False
1245
- vit_model.vision_model._fsdp_wrapped_module._flat_param True
1246
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1247
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1248
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1249
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1250
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1251
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1252
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1253
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1254
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1255
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1256
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1257
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1258
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1259
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1260
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1261
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1262
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1263
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1264
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1265
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1266
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1267
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1268
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1269
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1270
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1271
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1272
- connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1273
- vit_pos_embed._fsdp_wrapped_module._flat_param False
1274
- Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only/vlm_gym_match_equation_sos_train
1275
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step0
1276
- Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
1277
- [eval debug] first 3 batch fingerprints:
1278
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
1279
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
1280
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
1281
- ce_avg: 0.0, mse_avg: 0.6639004349708557
1282
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step500
1283
- Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
1284
- [eval debug] first 3 batch fingerprints:
1285
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
1286
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
1287
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
1288
- ce_avg: 0.0, mse_avg: 0.013862529769539833
1289
  base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step1000
1290
  Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
1291
  [eval debug] first 3 batch fingerprints:
@@ -1307,18 +1312,6 @@ Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_matc
1307
  fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
1308
  fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
1309
  ce_avg: 0.0, mse_avg: 0.010432669892907143
1310
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step2500
1311
- Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
1312
- [eval debug] first 3 batch fingerprints:
1313
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
1314
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
1315
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
1316
- ce_avg: 0.0, mse_avg: 0.01046404242515564
1317
- [2026-01-27 02:23:53] (step=0001092) Train Loss mse: 0.0114, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
1318
- [2026-01-27 02:23:55] (step=0001093) Train Loss mse: 0.0082, Train Loss ce: 0.0000, Train Steps/Sec: 0.57,
1319
- [2026-01-27 02:23:56] (step=0001094) Train Loss mse: 0.0097, Train Loss ce: 0.0000, Train Steps/Sec: 0.59,
1320
- [2026-01-27 02:23:58] (step=0001095) Train Loss mse: 0.0102, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
1321
- [2026-01-27 02:23:59] (step=0001096) Train Loss mse: 0.0125, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
1322
  [2026-01-27 02:24:01] (step=0001097) Train Loss mse: 0.0103, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
1323
  [2026-01-27 02:24:02] (step=0001098) Train Loss mse: 0.0065, Train Loss ce: 0.0000, Train Steps/Sec: 0.58,
1324
  [2026-01-27 02:24:04] (step=0001099) Train Loss mse: 0.0102, Train Loss ce: 0.0000, Train Steps/Sec: 0.67,
@@ -2805,20 +2798,6 @@ ce_avg: 0.0, mse_avg: 0.01046404242515564
2805
  [2026-01-27 03:03:51] (step=0002580) Train Loss mse: 0.0050, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
2806
  [2026-01-27 03:03:53] (step=0002581) Train Loss mse: 0.0042, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
2807
  [2026-01-27 03:03:54] (step=0002582) Train Loss mse: 0.0058, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
2808
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step3000
2809
- Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
2810
- [eval debug] first 3 batch fingerprints:
2811
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2812
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2813
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2814
- ce_avg: 0.0, mse_avg: 0.01004165131598711
2815
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step3500
2816
- Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
2817
- [eval debug] first 3 batch fingerprints:
2818
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2819
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2820
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2821
- ce_avg: 0.0, mse_avg: 0.009692845866084099
2822
  [2026-01-27 03:03:56] (step=0002583) Train Loss mse: 0.0039, Train Loss ce: 0.0000, Train Steps/Sec: 0.59,
2823
  [2026-01-27 03:03:58] (step=0002584) Train Loss mse: 0.0052, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
2824
  [2026-01-27 03:03:59] (step=0002585) Train Loss mse: 0.0062, Train Loss ce: 0.0000, Train Steps/Sec: 0.55,
@@ -2872,6 +2851,27 @@ ce_avg: 0.0, mse_avg: 0.009692845866084099
2872
  [2026-01-27 03:05:15] (step=0002633) Train Loss mse: 0.0049, Train Loss ce: 0.0000, Train Steps/Sec: 0.55,
2873
  [2026-01-27 03:05:17] (step=0002634) Train Loss mse: 0.0038, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
2874
  [2026-01-27 03:05:18] (step=0002635) Train Loss mse: 0.0037, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2875
  [2026-01-27 03:05:20] (step=0002636) Train Loss mse: 0.0061, Train Loss ce: 0.0000, Train Steps/Sec: 0.59,
2876
  [2026-01-27 03:05:21] (step=0002637) Train Loss mse: 0.0056, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
2877
  [2026-01-27 03:05:23] (step=0002638) Train Loss mse: 0.0071, Train Loss ce: 0.0000, Train Steps/Sec: 0.67,
@@ -3875,27 +3875,6 @@ ce_avg: 0.0, mse_avg: 0.009692845866084099
3875
  [2026-01-27 03:32:17] (step=0003636) Train Loss mse: 0.0050, Train Loss ce: 0.0000, Train Steps/Sec: 0.59,
3876
  [2026-01-27 03:32:18] (step=0003637) Train Loss mse: 0.0052, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
3877
  [2026-01-27 03:32:20] (step=0003638) Train Loss mse: 0.0060, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
3878
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step4000
3879
- Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
3880
- [eval debug] first 3 batch fingerprints:
3881
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3882
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3883
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3884
- ce_avg: 0.0, mse_avg: 0.010126573964953423
3885
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step4500
3886
- Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
3887
- [eval debug] first 3 batch fingerprints:
3888
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3889
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3890
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3891
- ce_avg: 0.0, mse_avg: 0.00975842121988535
3892
- base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step5000
3893
- Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
3894
- [eval debug] first 3 batch fingerprints:
3895
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3896
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3897
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3898
- ce_avg: 0.0, mse_avg: 0.009475299157202244
3899
  [2026-01-27 03:32:21] (step=0003639) Train Loss mse: 0.0054, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
3900
  [2026-01-27 03:32:23] (step=0003640) Train Loss mse: 0.0082, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
3901
  [2026-01-27 03:32:24] (step=0003641) Train Loss mse: 0.0051, Train Loss ce: 0.0000, Train Steps/Sec: 0.55,
@@ -3976,6 +3955,20 @@ ce_avg: 0.0, mse_avg: 0.009475299157202244
3976
  [2026-01-27 03:34:23] (step=0003716) Train Loss mse: 0.0047, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
3977
  [2026-01-27 03:34:24] (step=0003717) Train Loss mse: 0.0041, Train Loss ce: 0.0000, Train Steps/Sec: 0.59,
3978
  [2026-01-27 03:34:26] (step=0003718) Train Loss mse: 0.0050, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3979
  [2026-01-27 03:34:28] (step=0003719) Train Loss mse: 0.0034, Train Loss ce: 0.0000, Train Steps/Sec: 0.58,
3980
  [2026-01-27 03:34:29] (step=0003720) Train Loss mse: 0.0042, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
3981
  [2026-01-27 03:34:31] (step=0003721) Train Loss mse: 0.0074, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
@@ -5261,4 +5254,11 @@ ce_avg: 0.0, mse_avg: 0.009475299157202244
5261
  [2026-01-27 04:08:28] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/0005000.
5262
  /opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
5263
  warnings.warn(
5264
- [2026-01-27 04:11:00] Done!
 
 
 
 
 
 
 
 
1
+ FullyShardedDataParallel(
2
+ (_fsdp_wrapped_module): Bagel(
3
+ (language_model): Qwen2ForCausalLM(
4
+ (model): Qwen2Model(
5
+ (embed_tokens): Embedding(152064, 3584)
6
+ (layers): ModuleList(
7
+ (0-27): 28 x FullyShardedDataParallel(
8
+ (_fsdp_wrapped_module): CheckpointWrapper(
9
+ (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
10
+ (self_attn): PackedAttentionMoT(
11
+ (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
12
+ (k_proj): Linear(in_features=3584, out_features=512, bias=True)
13
+ (v_proj): Linear(in_features=3584, out_features=512, bias=True)
14
+ (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
15
+ (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
16
+ (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
17
+ (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
18
+ (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
19
+ (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
20
+ (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
21
+ (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
22
+ (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
23
+ )
24
+ (mlp): Qwen2MLP(
25
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
26
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
27
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
28
+ (act_fn): SiLU()
29
+ )
30
+ (mlp_moe_gen): Qwen2MLP(
31
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
32
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
33
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
34
+ (act_fn): SiLU()
35
+ )
36
+ (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
37
+ (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
38
+ (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
39
+ (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
40
+ )
41
+ )
42
+ )
43
+ )
44
+ (norm): Qwen2RMSNorm((3584,), eps=1e-06)
45
+ (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
46
+ (rotary_emb): Qwen2RotaryEmbedding()
47
+ )
48
+ (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
49
+ )
50
+ (time_embedder): FullyShardedDataParallel(
51
+ (_fsdp_wrapped_module): TimestepEmbedder(
52
+ (mlp): Sequential(
53
+ (0): Linear(in_features=256, out_features=3584, bias=True)
54
+ (1): SiLU()
55
+ (2): Linear(in_features=3584, out_features=3584, bias=True)
56
+ )
57
+ )
58
+ )
59
+ (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
60
+ (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
61
+ (latent_pos_embed): FullyShardedDataParallel(
62
+ (_fsdp_wrapped_module): PositionEmbedding()
63
+ )
64
+ (vit_model): SiglipVisionModel(
65
+ (vision_model): FullyShardedDataParallel(
66
+ (_fsdp_wrapped_module): SiglipVisionTransformer(
67
+ (embeddings): SiglipVisionEmbeddings(
68
+ (position_embedding): Embedding(4900, 1152)
69
+ (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
70
+ )
71
+ (encoder): SiglipEncoder(
72
+ (layers): ModuleList(
73
+ (0-25): 26 x FullyShardedDataParallel(
74
+ (_fsdp_wrapped_module): CheckpointWrapper(
75
+ (_checkpoint_wrapped_module): SiglipEncoderLayer(
76
+ (self_attn): SiglipFlashAttention2(
77
+ (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
78
+ (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
79
+ (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
80
+ (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
81
+ )
82
+ (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
83
+ (mlp): SiglipMLP(
84
+ (activation_fn): PytorchGELUTanh()
85
+ (fc1): Linear(in_features=1152, out_features=4304, bias=True)
86
+ (fc2): Linear(in_features=4304, out_features=1152, bias=True)
87
+ )
88
+ (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
89
+ )
90
+ )
91
+ )
92
+ )
93
+ )
94
+ (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
95
+ )
96
+ )
97
+ )
98
+ (connector): FullyShardedDataParallel(
99
+ (_fsdp_wrapped_module): CheckpointWrapper(
100
+ (_checkpoint_wrapped_module): MLPconnector(
101
+ (activation_fn): PytorchGELUTanh()
102
+ (fc1): Linear(in_features=1152, out_features=3584, bias=True)
103
+ (fc2): Linear(in_features=3584, out_features=3584, bias=True)
104
+ )
105
+ )
106
+ )
107
+ (vit_pos_embed): FullyShardedDataParallel(
108
+ (_fsdp_wrapped_module): PositionEmbedding()
109
+ )
110
+ )
111
+ )
112
+ _flat_param True
113
+ language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
114
+ language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
115
+ language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
116
+ language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
117
+ language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
118
+ language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
119
+ language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
120
+ language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
121
+ language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
122
+ language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
123
+ language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
124
+ language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
125
+ language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
126
+ language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
127
+ language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
128
+ language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
129
+ language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
130
+ language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
131
+ language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
132
+ language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
133
+ language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
134
+ language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
135
+ language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
136
+ language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
137
+ language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
138
+ language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
139
+ language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
140
+ language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
141
+ time_embedder._fsdp_wrapped_module._flat_param True
142
+ latent_pos_embed._fsdp_wrapped_module._flat_param False
143
+ vit_model.vision_model._fsdp_wrapped_module._flat_param True
144
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
145
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
146
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
147
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
148
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
149
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
150
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
151
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
152
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
153
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
154
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
155
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
156
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
157
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
158
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
159
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
160
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
161
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
162
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
163
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
164
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
165
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
166
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
167
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
168
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
169
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
170
+ connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
171
+ vit_pos_embed._fsdp_wrapped_module._flat_param False
172
+ Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only/vlm_gym_match_equation_sos_train
173
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step0
174
+ Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
175
+ [eval debug] first 3 batch fingerprints:
176
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
177
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
178
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
179
+ ce_avg: 0.0, mse_avg: 0.6639004349708557
180
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step500
181
+ Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
182
+ [eval debug] first 3 batch fingerprints:
183
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
184
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
185
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
186
+ ce_avg: 0.0, mse_avg: 0.013862529769539833
187
  wandb: Detected [huggingface_hub.inference] in use.
188
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
189
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
 
1286
  [2026-01-27 02:23:48] (step=0001089) Train Loss mse: 0.0100, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
1287
  [2026-01-27 02:23:50] (step=0001090) Train Loss mse: 0.0099, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
1288
  [2026-01-27 02:23:51] (step=0001091) Train Loss mse: 0.0082, Train Loss ce: 0.0000, Train Steps/Sec: 0.58,
1289
+ [2026-01-27 02:23:53] (step=0001092) Train Loss mse: 0.0114, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
1290
+ [2026-01-27 02:23:55] (step=0001093) Train Loss mse: 0.0082, Train Loss ce: 0.0000, Train Steps/Sec: 0.57,
1291
+ [2026-01-27 02:23:56] (step=0001094) Train Loss mse: 0.0097, Train Loss ce: 0.0000, Train Steps/Sec: 0.59,
1292
+ [2026-01-27 02:23:58] (step=0001095) Train Loss mse: 0.0102, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
1293
+ [2026-01-27 02:23:59] (step=0001096) Train Loss mse: 0.0125, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1294
  base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step1000
1295
  Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
1296
  [eval debug] first 3 batch fingerprints:
 
1312
  fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
1313
  fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
1314
  ce_avg: 0.0, mse_avg: 0.010432669892907143
 
 
 
 
 
 
 
 
 
 
 
 
1315
  [2026-01-27 02:24:01] (step=0001097) Train Loss mse: 0.0103, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
1316
  [2026-01-27 02:24:02] (step=0001098) Train Loss mse: 0.0065, Train Loss ce: 0.0000, Train Steps/Sec: 0.58,
1317
  [2026-01-27 02:24:04] (step=0001099) Train Loss mse: 0.0102, Train Loss ce: 0.0000, Train Steps/Sec: 0.67,
 
2798
  [2026-01-27 03:03:51] (step=0002580) Train Loss mse: 0.0050, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
2799
  [2026-01-27 03:03:53] (step=0002581) Train Loss mse: 0.0042, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
2800
  [2026-01-27 03:03:54] (step=0002582) Train Loss mse: 0.0058, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2801
  [2026-01-27 03:03:56] (step=0002583) Train Loss mse: 0.0039, Train Loss ce: 0.0000, Train Steps/Sec: 0.59,
2802
  [2026-01-27 03:03:58] (step=0002584) Train Loss mse: 0.0052, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
2803
  [2026-01-27 03:03:59] (step=0002585) Train Loss mse: 0.0062, Train Loss ce: 0.0000, Train Steps/Sec: 0.55,
 
2851
  [2026-01-27 03:05:15] (step=0002633) Train Loss mse: 0.0049, Train Loss ce: 0.0000, Train Steps/Sec: 0.55,
2852
  [2026-01-27 03:05:17] (step=0002634) Train Loss mse: 0.0038, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
2853
  [2026-01-27 03:05:18] (step=0002635) Train Loss mse: 0.0037, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
2854
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step2500
2855
+ Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
2856
+ [eval debug] first 3 batch fingerprints:
2857
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2858
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2859
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2860
+ ce_avg: 0.0, mse_avg: 0.01046404242515564
2861
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step3000
2862
+ Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
2863
+ [eval debug] first 3 batch fingerprints:
2864
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2865
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2866
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2867
+ ce_avg: 0.0, mse_avg: 0.01004165131598711
2868
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step3500
2869
+ Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
2870
+ [eval debug] first 3 batch fingerprints:
2871
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2872
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2873
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
2874
+ ce_avg: 0.0, mse_avg: 0.009692845866084099
2875
  [2026-01-27 03:05:20] (step=0002636) Train Loss mse: 0.0061, Train Loss ce: 0.0000, Train Steps/Sec: 0.59,
2876
  [2026-01-27 03:05:21] (step=0002637) Train Loss mse: 0.0056, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
2877
  [2026-01-27 03:05:23] (step=0002638) Train Loss mse: 0.0071, Train Loss ce: 0.0000, Train Steps/Sec: 0.67,
 
3875
  [2026-01-27 03:32:17] (step=0003636) Train Loss mse: 0.0050, Train Loss ce: 0.0000, Train Steps/Sec: 0.59,
3876
  [2026-01-27 03:32:18] (step=0003637) Train Loss mse: 0.0052, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
3877
  [2026-01-27 03:32:20] (step=0003638) Train Loss mse: 0.0060, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3878
  [2026-01-27 03:32:21] (step=0003639) Train Loss mse: 0.0054, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
3879
  [2026-01-27 03:32:23] (step=0003640) Train Loss mse: 0.0082, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
3880
  [2026-01-27 03:32:24] (step=0003641) Train Loss mse: 0.0051, Train Loss ce: 0.0000, Train Steps/Sec: 0.55,
 
3955
  [2026-01-27 03:34:23] (step=0003716) Train Loss mse: 0.0047, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
3956
  [2026-01-27 03:34:24] (step=0003717) Train Loss mse: 0.0041, Train Loss ce: 0.0000, Train Steps/Sec: 0.59,
3957
  [2026-01-27 03:34:26] (step=0003718) Train Loss mse: 0.0050, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
3958
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step4000
3959
+ Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
3960
+ [eval debug] first 3 batch fingerprints:
3961
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3962
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3963
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3964
+ ce_avg: 0.0, mse_avg: 0.010126573964953423
3965
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step4500
3966
+ Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
3967
+ [eval debug] first 3 batch fingerprints:
3968
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3969
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3970
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
3971
+ ce_avg: 0.0, mse_avg: 0.00975842121988535
3972
  [2026-01-27 03:34:28] (step=0003719) Train Loss mse: 0.0034, Train Loss ce: 0.0000, Train Steps/Sec: 0.58,
3973
  [2026-01-27 03:34:29] (step=0003720) Train Loss mse: 0.0042, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
3974
  [2026-01-27 03:34:31] (step=0003721) Train Loss mse: 0.0074, Train Loss ce: 0.0000, Train Steps/Sec: 0.68,
 
5254
  [2026-01-27 04:08:28] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/0005000.
5255
  /opt/conda/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
5256
  warnings.warn(
5257
+ [2026-01-27 04:11:00] Done!
5258
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_match_equation_sos_one_image_lr2e_5_mse_only_ins/eval_used_rows, step_tag is vlm_gym_match_equation_sos_one_img_lr2e_5_mse_only_ins_step5000
5259
+ Preparing Dataset vlm_gym_match_equation_sos_mse_loss_only_evalonce/vlm_gym_match_equation_sos_val
5260
+ [eval debug] first 3 batch fingerprints:
5261
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
5262
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
5263
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_match_equation_sos_mse_loss_only_evalonce'}]
5264
+ ce_avg: 0.0, mse_avg: 0.009475299157202244