Junyi42 commited on
Commit
3074b08
·
verified ·
1 Parent(s): 93f2e0b

Upload checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce

Browse files
Files changed (7) hide show
  1. checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/wandb/offline-run-20251229_044705-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce-run0/files/output.log +175 -175
  2. checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/wandb/offline-run-20251230_023052-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993-run0/files/output.log +94 -94
  3. checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/wandb/offline-run-20251230_024051-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed-run0/files/output.log +18 -18
  4. checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/wandb/offline-run-20260104_093254-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed-run0/files/output.log +181 -181
  5. checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/wandb/offline-run-20260105_090343-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed-run0/files/output.log +21 -21
  6. checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/wandb/offline-run-20260107_093905-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed-run0/files/output.log +200 -200
  7. checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/wandb/offline-run-20260107_184933-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed-run0/files/output.log +107 -107
checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/wandb/offline-run-20251229_044705-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce-run0/files/output.log CHANGED
@@ -1,3 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  wandb: Detected [huggingface_hub.inference] in use.
2
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
3
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
@@ -936,181 +1111,6 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
936
  [2025-12-29 08:26:29] (step=0000925) Train Loss mse: 0.0482, Train Loss ce: 0.0633, Train Steps/Sec: 0.06,
937
  [2025-12-29 08:26:42] (step=0000926) Train Loss mse: 0.0486, Train Loss ce: 0.0633, Train Steps/Sec: 0.08,
938
  [2025-12-29 08:26:54] (step=0000927) Train Loss mse: 0.0519, Train Loss ce: 0.0643, Train Steps/Sec: 0.08,
939
- FullyShardedDataParallel(
940
- (_fsdp_wrapped_module): Bagel(
941
- (language_model): Qwen2ForCausalLM(
942
- (model): Qwen2Model(
943
- (embed_tokens): Embedding(152064, 3584)
944
- (layers): ModuleList(
945
- (0-27): 28 x FullyShardedDataParallel(
946
- (_fsdp_wrapped_module): CheckpointWrapper(
947
- (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
948
- (self_attn): PackedAttentionMoT(
949
- (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
950
- (k_proj): Linear(in_features=3584, out_features=512, bias=True)
951
- (v_proj): Linear(in_features=3584, out_features=512, bias=True)
952
- (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
953
- (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
954
- (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
955
- (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
956
- (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
957
- (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
958
- (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
959
- (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
960
- (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
961
- )
962
- (mlp): Qwen2MLP(
963
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
964
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
965
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
966
- (act_fn): SiLU()
967
- )
968
- (mlp_moe_gen): Qwen2MLP(
969
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
970
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
971
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
972
- (act_fn): SiLU()
973
- )
974
- (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
975
- (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
976
- (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
977
- (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
978
- )
979
- )
980
- )
981
- )
982
- (norm): Qwen2RMSNorm((3584,), eps=1e-06)
983
- (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
984
- (rotary_emb): Qwen2RotaryEmbedding()
985
- )
986
- (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
987
- )
988
- (time_embedder): FullyShardedDataParallel(
989
- (_fsdp_wrapped_module): TimestepEmbedder(
990
- (mlp): Sequential(
991
- (0): Linear(in_features=256, out_features=3584, bias=True)
992
- (1): SiLU()
993
- (2): Linear(in_features=3584, out_features=3584, bias=True)
994
- )
995
- )
996
- )
997
- (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
998
- (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
999
- (latent_pos_embed): FullyShardedDataParallel(
1000
- (_fsdp_wrapped_module): PositionEmbedding()
1001
- )
1002
- (vit_model): SiglipVisionModel(
1003
- (vision_model): FullyShardedDataParallel(
1004
- (_fsdp_wrapped_module): SiglipVisionTransformer(
1005
- (embeddings): SiglipVisionEmbeddings(
1006
- (position_embedding): Embedding(4900, 1152)
1007
- (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
1008
- )
1009
- (encoder): SiglipEncoder(
1010
- (layers): ModuleList(
1011
- (0-25): 26 x FullyShardedDataParallel(
1012
- (_fsdp_wrapped_module): CheckpointWrapper(
1013
- (_checkpoint_wrapped_module): SiglipEncoderLayer(
1014
- (self_attn): SiglipFlashAttention2(
1015
- (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
1016
- (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
1017
- (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
1018
- (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
1019
- )
1020
- (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1021
- (mlp): SiglipMLP(
1022
- (activation_fn): PytorchGELUTanh()
1023
- (fc1): Linear(in_features=1152, out_features=4304, bias=True)
1024
- (fc2): Linear(in_features=4304, out_features=1152, bias=True)
1025
- )
1026
- (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1027
- )
1028
- )
1029
- )
1030
- )
1031
- )
1032
- (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
1033
- )
1034
- )
1035
- )
1036
- (connector): FullyShardedDataParallel(
1037
- (_fsdp_wrapped_module): CheckpointWrapper(
1038
- (_checkpoint_wrapped_module): MLPconnector(
1039
- (activation_fn): PytorchGELUTanh()
1040
- (fc1): Linear(in_features=1152, out_features=3584, bias=True)
1041
- (fc2): Linear(in_features=3584, out_features=3584, bias=True)
1042
- )
1043
- )
1044
- )
1045
- (vit_pos_embed): FullyShardedDataParallel(
1046
- (_fsdp_wrapped_module): PositionEmbedding()
1047
- )
1048
- )
1049
- )
1050
- _flat_param True
1051
- language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1052
- language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1053
- language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1054
- language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1055
- language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1056
- language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1057
- language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1058
- language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1059
- language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1060
- language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1061
- language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1062
- language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1063
- language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1064
- language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1065
- language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1066
- language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1067
- language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1068
- language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1069
- language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1070
- language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1071
- language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1072
- language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1073
- language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1074
- language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1075
- language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1076
- language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1077
- language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1078
- language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1079
- time_embedder._fsdp_wrapped_module._flat_param True
1080
- latent_pos_embed._fsdp_wrapped_module._flat_param False
1081
- vit_model.vision_model._fsdp_wrapped_module._flat_param True
1082
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1083
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1084
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1085
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1086
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1087
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1088
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1089
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1090
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1091
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1092
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1093
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1094
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1095
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1096
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1097
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1098
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1099
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1100
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1101
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1102
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1103
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1104
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1105
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1106
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1107
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1108
- connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1109
- vit_pos_embed._fsdp_wrapped_module._flat_param False
1110
- Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_train
1111
- Warning: failed loading hashes from /home/jiaxin/bagel_train/hashes_test_set_v10.json: [Errno 2] No such file or directory: '/home/jiaxin/bagel_train/hashes_test_set_v10.json'
1112
- Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_val
1113
- Warning: failed loading hashes from /home/jiaxin/bagel_train/hashes_test_set_v10.json: [Errno 2] No such file or directory: '/home/jiaxin/bagel_train/hashes_test_set_v10.json'
1114
  [2025-12-29 08:27:07] (step=0000928) Train Loss mse: 0.0324, Train Loss ce: 0.0643, Train Steps/Sec: 0.08,
1115
  [2025-12-29 08:27:20] (step=0000929) Train Loss mse: 0.0449, Train Loss ce: 0.0605, Train Steps/Sec: 0.07,
1116
  [2025-12-29 08:27:34] (step=0000930) Train Loss mse: 0.0395, Train Loss ce: 0.0646, Train Steps/Sec: 0.07,
 
1
+ FullyShardedDataParallel(
2
+ (_fsdp_wrapped_module): Bagel(
3
+ (language_model): Qwen2ForCausalLM(
4
+ (model): Qwen2Model(
5
+ (embed_tokens): Embedding(152064, 3584)
6
+ (layers): ModuleList(
7
+ (0-27): 28 x FullyShardedDataParallel(
8
+ (_fsdp_wrapped_module): CheckpointWrapper(
9
+ (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
10
+ (self_attn): PackedAttentionMoT(
11
+ (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
12
+ (k_proj): Linear(in_features=3584, out_features=512, bias=True)
13
+ (v_proj): Linear(in_features=3584, out_features=512, bias=True)
14
+ (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
15
+ (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
16
+ (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
17
+ (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
18
+ (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
19
+ (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
20
+ (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
21
+ (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
22
+ (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
23
+ )
24
+ (mlp): Qwen2MLP(
25
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
26
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
27
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
28
+ (act_fn): SiLU()
29
+ )
30
+ (mlp_moe_gen): Qwen2MLP(
31
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
32
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
33
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
34
+ (act_fn): SiLU()
35
+ )
36
+ (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
37
+ (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
38
+ (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
39
+ (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
40
+ )
41
+ )
42
+ )
43
+ )
44
+ (norm): Qwen2RMSNorm((3584,), eps=1e-06)
45
+ (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
46
+ (rotary_emb): Qwen2RotaryEmbedding()
47
+ )
48
+ (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
49
+ )
50
+ (time_embedder): FullyShardedDataParallel(
51
+ (_fsdp_wrapped_module): TimestepEmbedder(
52
+ (mlp): Sequential(
53
+ (0): Linear(in_features=256, out_features=3584, bias=True)
54
+ (1): SiLU()
55
+ (2): Linear(in_features=3584, out_features=3584, bias=True)
56
+ )
57
+ )
58
+ )
59
+ (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
60
+ (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
61
+ (latent_pos_embed): FullyShardedDataParallel(
62
+ (_fsdp_wrapped_module): PositionEmbedding()
63
+ )
64
+ (vit_model): SiglipVisionModel(
65
+ (vision_model): FullyShardedDataParallel(
66
+ (_fsdp_wrapped_module): SiglipVisionTransformer(
67
+ (embeddings): SiglipVisionEmbeddings(
68
+ (position_embedding): Embedding(4900, 1152)
69
+ (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
70
+ )
71
+ (encoder): SiglipEncoder(
72
+ (layers): ModuleList(
73
+ (0-25): 26 x FullyShardedDataParallel(
74
+ (_fsdp_wrapped_module): CheckpointWrapper(
75
+ (_checkpoint_wrapped_module): SiglipEncoderLayer(
76
+ (self_attn): SiglipFlashAttention2(
77
+ (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
78
+ (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
79
+ (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
80
+ (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
81
+ )
82
+ (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
83
+ (mlp): SiglipMLP(
84
+ (activation_fn): PytorchGELUTanh()
85
+ (fc1): Linear(in_features=1152, out_features=4304, bias=True)
86
+ (fc2): Linear(in_features=4304, out_features=1152, bias=True)
87
+ )
88
+ (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
89
+ )
90
+ )
91
+ )
92
+ )
93
+ )
94
+ (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
95
+ )
96
+ )
97
+ )
98
+ (connector): FullyShardedDataParallel(
99
+ (_fsdp_wrapped_module): CheckpointWrapper(
100
+ (_checkpoint_wrapped_module): MLPconnector(
101
+ (activation_fn): PytorchGELUTanh()
102
+ (fc1): Linear(in_features=1152, out_features=3584, bias=True)
103
+ (fc2): Linear(in_features=3584, out_features=3584, bias=True)
104
+ )
105
+ )
106
+ )
107
+ (vit_pos_embed): FullyShardedDataParallel(
108
+ (_fsdp_wrapped_module): PositionEmbedding()
109
+ )
110
+ )
111
+ )
112
+ _flat_param True
113
+ language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
114
+ language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
115
+ language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
116
+ language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
117
+ language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
118
+ language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
119
+ language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
120
+ language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
121
+ language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
122
+ language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
123
+ language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
124
+ language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
125
+ language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
126
+ language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
127
+ language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
128
+ language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
129
+ language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
130
+ language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
131
+ language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
132
+ language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
133
+ language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
134
+ language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
135
+ language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
136
+ language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
137
+ language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
138
+ language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
139
+ language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
140
+ language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
141
+ time_embedder._fsdp_wrapped_module._flat_param True
142
+ latent_pos_embed._fsdp_wrapped_module._flat_param False
143
+ vit_model.vision_model._fsdp_wrapped_module._flat_param True
144
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
145
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
146
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
147
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
148
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
149
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
150
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
151
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
152
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
153
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
154
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
155
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
156
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
157
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
158
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
159
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
160
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
161
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
162
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
163
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
164
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
165
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
166
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
167
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
168
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
169
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
170
+ connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
171
+ vit_pos_embed._fsdp_wrapped_module._flat_param False
172
+ Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_train
173
+ Warning: failed loading hashes from /home/jiaxin/bagel_train/hashes_test_set_v10.json: [Errno 2] No such file or directory: '/home/jiaxin/bagel_train/hashes_test_set_v10.json'
174
+ Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_val
175
+ Warning: failed loading hashes from /home/jiaxin/bagel_train/hashes_test_set_v10.json: [Errno 2] No such file or directory: '/home/jiaxin/bagel_train/hashes_test_set_v10.json'
176
  wandb: Detected [huggingface_hub.inference] in use.
177
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
178
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
 
1111
  [2025-12-29 08:26:29] (step=0000925) Train Loss mse: 0.0482, Train Loss ce: 0.0633, Train Steps/Sec: 0.06,
1112
  [2025-12-29 08:26:42] (step=0000926) Train Loss mse: 0.0486, Train Loss ce: 0.0633, Train Steps/Sec: 0.08,
1113
  [2025-12-29 08:26:54] (step=0000927) Train Loss mse: 0.0519, Train Loss ce: 0.0643, Train Steps/Sec: 0.08,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1114
  [2025-12-29 08:27:07] (step=0000928) Train Loss mse: 0.0324, Train Loss ce: 0.0643, Train Steps/Sec: 0.08,
1115
  [2025-12-29 08:27:20] (step=0000929) Train Loss mse: 0.0449, Train Loss ce: 0.0605, Train Steps/Sec: 0.07,
1116
  [2025-12-29 08:27:34] (step=0000930) Train Loss mse: 0.0395, Train Loss ce: 0.0646, Train Steps/Sec: 0.07,
checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/wandb/offline-run-20251230_023052-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993-run0/files/output.log CHANGED
@@ -799,6 +799,100 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
799
  [2025-12-30 05:38:29] (step=0000788) Train Loss mse: 0.0482, Train Loss ce: 0.0750, Train Steps/Sec: 0.09,
800
  [2025-12-30 05:38:45] (step=0000789) Train Loss mse: 0.0491, Train Loss ce: 0.0644, Train Steps/Sec: 0.06,
801
  [2025-12-30 05:38:58] (step=0000790) Train Loss mse: 0.0478, Train Loss ce: 0.0628, Train Steps/Sec: 0.08,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
802
  FullyShardedDataParallel(
803
  (_fsdp_wrapped_module): Bagel(
804
  (language_model): Qwen2ForCausalLM(
@@ -974,100 +1068,6 @@ Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_train
974
  Warning: failed loading hashes from /home/jiaxin/bagel_train/hashes_test_set_v10.json: [Errno 2] No such file or directory: '/home/jiaxin/bagel_train/hashes_test_set_v10.json'
975
  Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_val
976
  Warning: failed loading hashes from /home/jiaxin/bagel_train/hashes_test_set_v10.json: [Errno 2] No such file or directory: '/home/jiaxin/bagel_train/hashes_test_set_v10.json'
977
- [2025-12-30 05:39:11] (step=0000791) Train Loss mse: 0.0397, Train Loss ce: 0.0636, Train Steps/Sec: 0.07,
978
- [2025-12-30 05:39:23] (step=0000792) Train Loss mse: 0.0506, Train Loss ce: 0.0581, Train Steps/Sec: 0.09,
979
- [2025-12-30 05:39:35] (step=0000793) Train Loss mse: 0.0476, Train Loss ce: 0.0624, Train Steps/Sec: 0.08,
980
- [2025-12-30 05:39:48] (step=0000794) Train Loss mse: 0.0308, Train Loss ce: 0.0605, Train Steps/Sec: 0.07,
981
- [2025-12-30 05:40:04] (step=0000795) Train Loss mse: 0.0383, Train Loss ce: 0.0697, Train Steps/Sec: 0.06,
982
- [2025-12-30 05:40:19] (step=0000796) Train Loss mse: 0.0339, Train Loss ce: 0.0695, Train Steps/Sec: 0.07,
983
- [2025-12-30 05:40:30] (step=0000797) Train Loss mse: 0.0377, Train Loss ce: 0.0688, Train Steps/Sec: 0.09,
984
- [2025-12-30 05:40:46] (step=0000798) Train Loss mse: 0.0481, Train Loss ce: 0.0675, Train Steps/Sec: 0.06,
985
- [2025-12-30 05:41:03] (step=0000799) Train Loss mse: 0.0411, Train Loss ce: 0.0655, Train Steps/Sec: 0.06,
986
- [2025-12-30 05:41:16] (step=0000800) Train Loss mse: 0.0402, Train Loss ce: 0.0631, Train Steps/Sec: 0.07,
987
- [2025-12-30 05:41:30] (step=0000801) Train Loss mse: 0.0409, Train Loss ce: 0.0616, Train Steps/Sec: 0.07,
988
- [2025-12-30 05:41:42] (step=0000802) Train Loss mse: 0.0484, Train Loss ce: 0.0610, Train Steps/Sec: 0.08,
989
- [2025-12-30 05:41:55] (step=0000803) Train Loss mse: 0.0327, Train Loss ce: 0.0618, Train Steps/Sec: 0.08,
990
- [2025-12-30 05:42:06] (step=0000804) Train Loss mse: 0.0724, Train Loss ce: 0.0654, Train Steps/Sec: 0.10,
991
- [2025-12-30 05:42:19] (step=0000805) Train Loss mse: 0.0422, Train Loss ce: 0.0623, Train Steps/Sec: 0.07,
992
- [2025-12-30 05:42:31] (step=0000806) Train Loss mse: 0.0480, Train Loss ce: 0.0572, Train Steps/Sec: 0.08,
993
- [2025-12-30 05:42:43] (step=0000807) Train Loss mse: 0.0494, Train Loss ce: 0.0594, Train Steps/Sec: 0.08,
994
- [2025-12-30 05:42:54] (step=0000808) Train Loss mse: 0.0453, Train Loss ce: 0.0614, Train Steps/Sec: 0.09,
995
- [2025-12-30 05:43:11] (step=0000809) Train Loss mse: 0.0326, Train Loss ce: 0.0636, Train Steps/Sec: 0.06,
996
- [2025-12-30 05:43:23] (step=0000810) Train Loss mse: 0.0434, Train Loss ce: 0.0585, Train Steps/Sec: 0.08,
997
- [2025-12-30 05:43:35] (step=0000811) Train Loss mse: 0.0605, Train Loss ce: 0.0614, Train Steps/Sec: 0.08,
998
- [2025-12-30 05:43:47] (step=0000812) Train Loss mse: 0.0396, Train Loss ce: 0.0593, Train Steps/Sec: 0.09,
999
- [2025-12-30 05:44:00] (step=0000813) Train Loss mse: 0.0491, Train Loss ce: 0.0637, Train Steps/Sec: 0.07,
1000
- [2025-12-30 05:44:13] (step=0000814) Train Loss mse: 0.0546, Train Loss ce: 0.0675, Train Steps/Sec: 0.08,
1001
- [2025-12-30 05:44:29] (step=0000815) Train Loss mse: 0.0476, Train Loss ce: 0.0708, Train Steps/Sec: 0.06,
1002
- [2025-12-30 05:44:45] (step=0000816) Train Loss mse: 0.0340, Train Loss ce: 0.0607, Train Steps/Sec: 0.06,
1003
- [2025-12-30 05:45:01] (step=0000817) Train Loss mse: 0.0456, Train Loss ce: 0.0649, Train Steps/Sec: 0.06,
1004
- [2025-12-30 05:45:15] (step=0000818) Train Loss mse: 0.0350, Train Loss ce: 0.0628, Train Steps/Sec: 0.07,
1005
- [2025-12-30 05:45:26] (step=0000819) Train Loss mse: 0.0484, Train Loss ce: 0.0635, Train Steps/Sec: 0.09,
1006
- [2025-12-30 05:45:37] (step=0000820) Train Loss mse: 0.0427, Train Loss ce: 0.0604, Train Steps/Sec: 0.09,
1007
- [2025-12-30 05:45:50] (step=0000821) Train Loss mse: 0.0280, Train Loss ce: 0.0599, Train Steps/Sec: 0.08,
1008
- [2025-12-30 05:46:01] (step=0000822) Train Loss mse: 0.0395, Train Loss ce: 0.0650, Train Steps/Sec: 0.09,
1009
- [2025-12-30 05:46:13] (step=0000823) Train Loss mse: 0.0509, Train Loss ce: 0.0648, Train Steps/Sec: 0.08,
1010
- [2025-12-30 05:46:29] (step=0000824) Train Loss mse: 0.0542, Train Loss ce: 0.0660, Train Steps/Sec: 0.06,
1011
- [2025-12-30 05:46:43] (step=0000825) Train Loss mse: 0.0471, Train Loss ce: 0.0679, Train Steps/Sec: 0.07,
1012
- [2025-12-30 05:46:55] (step=0000826) Train Loss mse: 0.0600, Train Loss ce: 0.0622, Train Steps/Sec: 0.08,
1013
- [2025-12-30 05:47:05] (step=0000827) Train Loss mse: 0.0479, Train Loss ce: 0.0622, Train Steps/Sec: 0.10,
1014
- [2025-12-30 05:47:19] (step=0000828) Train Loss mse: 0.0395, Train Loss ce: 0.0603, Train Steps/Sec: 0.07,
1015
- [2025-12-30 05:47:35] (step=0000829) Train Loss mse: 0.0340, Train Loss ce: 0.0620, Train Steps/Sec: 0.06,
1016
- [2025-12-30 05:47:46] (step=0000830) Train Loss mse: 0.0480, Train Loss ce: 0.0592, Train Steps/Sec: 0.09,
1017
- [2025-12-30 05:47:58] (step=0000831) Train Loss mse: 0.0487, Train Loss ce: 0.0607, Train Steps/Sec: 0.08,
1018
- [2025-12-30 05:48:10] (step=0000832) Train Loss mse: 0.0518, Train Loss ce: 0.0637, Train Steps/Sec: 0.08,
1019
- [2025-12-30 05:48:26] (step=0000833) Train Loss mse: 0.0428, Train Loss ce: 0.0640, Train Steps/Sec: 0.06,
1020
- [2025-12-30 05:48:38] (step=0000834) Train Loss mse: 0.0478, Train Loss ce: 0.0733, Train Steps/Sec: 0.09,
1021
- [2025-12-30 05:48:54] (step=0000835) Train Loss mse: 0.0390, Train Loss ce: 0.0679, Train Steps/Sec: 0.06,
1022
- [2025-12-30 05:49:07] (step=0000836) Train Loss mse: 0.0372, Train Loss ce: 0.0580, Train Steps/Sec: 0.07,
1023
- [2025-12-30 05:49:20] (step=0000837) Train Loss mse: 0.0565, Train Loss ce: 0.0645, Train Steps/Sec: 0.08,
1024
- [2025-12-30 05:49:35] (step=0000838) Train Loss mse: 0.0368, Train Loss ce: 0.0601, Train Steps/Sec: 0.06,
1025
- [2025-12-30 05:49:49] (step=0000839) Train Loss mse: 0.0374, Train Loss ce: 0.0657, Train Steps/Sec: 0.08,
1026
- [2025-12-30 05:50:04] (step=0000840) Train Loss mse: 0.0467, Train Loss ce: 0.0665, Train Steps/Sec: 0.06,
1027
- [2025-12-30 05:50:18] (step=0000841) Train Loss mse: 0.0450, Train Loss ce: 0.0677, Train Steps/Sec: 0.08,
1028
- [2025-12-30 05:50:30] (step=0000842) Train Loss mse: 0.0410, Train Loss ce: 0.0632, Train Steps/Sec: 0.08,
1029
- [2025-12-30 05:50:46] (step=0000843) Train Loss mse: 0.0386, Train Loss ce: 0.0618, Train Steps/Sec: 0.06,
1030
- [2025-12-30 05:51:00] (step=0000844) Train Loss mse: 0.0496, Train Loss ce: 0.0641, Train Steps/Sec: 0.08,
1031
- [2025-12-30 05:51:14] (step=0000845) Train Loss mse: 0.0314, Train Loss ce: 0.0647, Train Steps/Sec: 0.07,
1032
- [2025-12-30 05:51:26] (step=0000846) Train Loss mse: 0.0343, Train Loss ce: 0.0591, Train Steps/Sec: 0.08,
1033
- [2025-12-30 05:51:42] (step=0000847) Train Loss mse: 0.0252, Train Loss ce: 0.0611, Train Steps/Sec: 0.06,
1034
- [2025-12-30 05:51:55] (step=0000848) Train Loss mse: 0.0322, Train Loss ce: 0.0662, Train Steps/Sec: 0.08,
1035
- [2025-12-30 05:52:11] (step=0000849) Train Loss mse: 0.0505, Train Loss ce: 0.0663, Train Steps/Sec: 0.06,
1036
- [2025-12-30 05:52:23] (step=0000850) Train Loss mse: 0.0455, Train Loss ce: 0.0656, Train Steps/Sec: 0.08,
1037
- [2025-12-30 05:52:36] (step=0000851) Train Loss mse: 0.0462, Train Loss ce: 0.0642, Train Steps/Sec: 0.08,
1038
- [2025-12-30 05:52:49] (step=0000852) Train Loss mse: 0.0527, Train Loss ce: 0.0714, Train Steps/Sec: 0.08,
1039
- [2025-12-30 05:53:01] (step=0000853) Train Loss mse: 0.0497, Train Loss ce: 0.0633, Train Steps/Sec: 0.08,
1040
- [2025-12-30 05:53:17] (step=0000854) Train Loss mse: 0.0413, Train Loss ce: 0.0632, Train Steps/Sec: 0.06,
1041
- [2025-12-30 05:53:30] (step=0000855) Train Loss mse: 0.0484, Train Loss ce: 0.0659, Train Steps/Sec: 0.07,
1042
- [2025-12-30 05:53:43] (step=0000856) Train Loss mse: 0.0571, Train Loss ce: 0.0692, Train Steps/Sec: 0.07,
1043
- [2025-12-30 05:53:57] (step=0000857) Train Loss mse: 0.0312, Train Loss ce: 0.0582, Train Steps/Sec: 0.07,
1044
- [2025-12-30 05:54:09] (step=0000858) Train Loss mse: 0.0596, Train Loss ce: 0.0589, Train Steps/Sec: 0.08,
1045
- [2025-12-30 05:54:22] (step=0000859) Train Loss mse: 0.0381, Train Loss ce: 0.0585, Train Steps/Sec: 0.08,
1046
- [2025-12-30 05:54:36] (step=0000860) Train Loss mse: 0.0362, Train Loss ce: 0.0659, Train Steps/Sec: 0.08,
1047
- [2025-12-30 05:54:52] (step=0000861) Train Loss mse: 0.0378, Train Loss ce: 0.0605, Train Steps/Sec: 0.06,
1048
- [2025-12-30 05:55:05] (step=0000862) Train Loss mse: 0.0405, Train Loss ce: 0.0605, Train Steps/Sec: 0.07,
1049
- [2025-12-30 05:55:18] (step=0000863) Train Loss mse: 0.0501, Train Loss ce: 0.0731, Train Steps/Sec: 0.08,
1050
- [2025-12-30 05:55:32] (step=0000864) Train Loss mse: 0.0554, Train Loss ce: 0.0656, Train Steps/Sec: 0.07,
1051
- [2025-12-30 05:55:45] (step=0000865) Train Loss mse: 0.0447, Train Loss ce: 0.0668, Train Steps/Sec: 0.08,
1052
- [2025-12-30 05:55:58] (step=0000866) Train Loss mse: 0.0281, Train Loss ce: 0.0655, Train Steps/Sec: 0.08,
1053
- [2025-12-30 05:56:14] (step=0000867) Train Loss mse: 0.0498, Train Loss ce: 0.0637, Train Steps/Sec: 0.06,
1054
- [2025-12-30 05:56:28] (step=0000868) Train Loss mse: 0.0501, Train Loss ce: 0.0628, Train Steps/Sec: 0.07,
1055
- [2025-12-30 05:56:41] (step=0000869) Train Loss mse: 0.0460, Train Loss ce: 0.0721, Train Steps/Sec: 0.08,
1056
- [2025-12-30 05:56:54] (step=0000870) Train Loss mse: 0.0598, Train Loss ce: 0.0687, Train Steps/Sec: 0.08,
1057
- [2025-12-30 05:57:09] (step=0000871) Train Loss mse: 0.0381, Train Loss ce: 0.0636, Train Steps/Sec: 0.06,
1058
- [2025-12-30 05:57:26] (step=0000872) Train Loss mse: 0.0296, Train Loss ce: 0.0603, Train Steps/Sec: 0.06,
1059
- [2025-12-30 05:57:39] (step=0000873) Train Loss mse: 0.0587, Train Loss ce: 0.0701, Train Steps/Sec: 0.07,
1060
- [2025-12-30 05:57:55] (step=0000874) Train Loss mse: 0.0443, Train Loss ce: 0.0640, Train Steps/Sec: 0.06,
1061
- [2025-12-30 05:58:11] (step=0000875) Train Loss mse: 0.0375, Train Loss ce: 0.0569, Train Steps/Sec: 0.06,
1062
- [2025-12-30 05:58:27] (step=0000876) Train Loss mse: 0.0279, Train Loss ce: 0.0632, Train Steps/Sec: 0.06,
1063
- [2025-12-30 05:58:43] (step=0000877) Train Loss mse: 0.0385, Train Loss ce: 0.0582, Train Steps/Sec: 0.06,
1064
- [2025-12-30 05:58:59] (step=0000878) Train Loss mse: 0.0368, Train Loss ce: 0.0641, Train Steps/Sec: 0.06,
1065
- [2025-12-30 05:59:12] (step=0000879) Train Loss mse: 0.0451, Train Loss ce: 0.0623, Train Steps/Sec: 0.07,
1066
- [2025-12-30 05:59:26] (step=0000880) Train Loss mse: 0.0397, Train Loss ce: 0.0639, Train Steps/Sec: 0.07,
1067
- [2025-12-30 05:59:40] (step=0000881) Train Loss mse: 0.0532, Train Loss ce: 0.0666, Train Steps/Sec: 0.07,
1068
- [2025-12-30 05:59:53] (step=0000882) Train Loss mse: 0.0452, Train Loss ce: 0.0636, Train Steps/Sec: 0.08,
1069
- [2025-12-30 06:00:07] (step=0000883) Train Loss mse: 0.0423, Train Loss ce: 0.0629, Train Steps/Sec: 0.08,
1070
- [2025-12-30 06:00:19] (step=0000884) Train Loss mse: 0.0447, Train Loss ce: 0.0672, Train Steps/Sec: 0.09,
1071
  [2025-12-30 06:00:30] (step=0000885) Train Loss mse: 0.0446, Train Loss ce: 0.0648, Train Steps/Sec: 0.09,
1072
  [2025-12-30 06:00:43] (step=0000886) Train Loss mse: 0.0273, Train Loss ce: 0.0610, Train Steps/Sec: 0.08,
1073
  [2025-12-30 06:00:56] (step=0000887) Train Loss mse: 0.0444, Train Loss ce: 0.0651, Train Steps/Sec: 0.08,
 
799
  [2025-12-30 05:38:29] (step=0000788) Train Loss mse: 0.0482, Train Loss ce: 0.0750, Train Steps/Sec: 0.09,
800
  [2025-12-30 05:38:45] (step=0000789) Train Loss mse: 0.0491, Train Loss ce: 0.0644, Train Steps/Sec: 0.06,
801
  [2025-12-30 05:38:58] (step=0000790) Train Loss mse: 0.0478, Train Loss ce: 0.0628, Train Steps/Sec: 0.08,
802
+ [2025-12-30 05:39:11] (step=0000791) Train Loss mse: 0.0397, Train Loss ce: 0.0636, Train Steps/Sec: 0.07,
803
+ [2025-12-30 05:39:23] (step=0000792) Train Loss mse: 0.0506, Train Loss ce: 0.0581, Train Steps/Sec: 0.09,
804
+ [2025-12-30 05:39:35] (step=0000793) Train Loss mse: 0.0476, Train Loss ce: 0.0624, Train Steps/Sec: 0.08,
805
+ [2025-12-30 05:39:48] (step=0000794) Train Loss mse: 0.0308, Train Loss ce: 0.0605, Train Steps/Sec: 0.07,
806
+ [2025-12-30 05:40:04] (step=0000795) Train Loss mse: 0.0383, Train Loss ce: 0.0697, Train Steps/Sec: 0.06,
807
+ [2025-12-30 05:40:19] (step=0000796) Train Loss mse: 0.0339, Train Loss ce: 0.0695, Train Steps/Sec: 0.07,
808
+ [2025-12-30 05:40:30] (step=0000797) Train Loss mse: 0.0377, Train Loss ce: 0.0688, Train Steps/Sec: 0.09,
809
+ [2025-12-30 05:40:46] (step=0000798) Train Loss mse: 0.0481, Train Loss ce: 0.0675, Train Steps/Sec: 0.06,
810
+ [2025-12-30 05:41:03] (step=0000799) Train Loss mse: 0.0411, Train Loss ce: 0.0655, Train Steps/Sec: 0.06,
811
+ [2025-12-30 05:41:16] (step=0000800) Train Loss mse: 0.0402, Train Loss ce: 0.0631, Train Steps/Sec: 0.07,
812
+ [2025-12-30 05:41:30] (step=0000801) Train Loss mse: 0.0409, Train Loss ce: 0.0616, Train Steps/Sec: 0.07,
813
+ [2025-12-30 05:41:42] (step=0000802) Train Loss mse: 0.0484, Train Loss ce: 0.0610, Train Steps/Sec: 0.08,
814
+ [2025-12-30 05:41:55] (step=0000803) Train Loss mse: 0.0327, Train Loss ce: 0.0618, Train Steps/Sec: 0.08,
815
+ [2025-12-30 05:42:06] (step=0000804) Train Loss mse: 0.0724, Train Loss ce: 0.0654, Train Steps/Sec: 0.10,
816
+ [2025-12-30 05:42:19] (step=0000805) Train Loss mse: 0.0422, Train Loss ce: 0.0623, Train Steps/Sec: 0.07,
817
+ [2025-12-30 05:42:31] (step=0000806) Train Loss mse: 0.0480, Train Loss ce: 0.0572, Train Steps/Sec: 0.08,
818
+ [2025-12-30 05:42:43] (step=0000807) Train Loss mse: 0.0494, Train Loss ce: 0.0594, Train Steps/Sec: 0.08,
819
+ [2025-12-30 05:42:54] (step=0000808) Train Loss mse: 0.0453, Train Loss ce: 0.0614, Train Steps/Sec: 0.09,
820
+ [2025-12-30 05:43:11] (step=0000809) Train Loss mse: 0.0326, Train Loss ce: 0.0636, Train Steps/Sec: 0.06,
821
+ [2025-12-30 05:43:23] (step=0000810) Train Loss mse: 0.0434, Train Loss ce: 0.0585, Train Steps/Sec: 0.08,
822
+ [2025-12-30 05:43:35] (step=0000811) Train Loss mse: 0.0605, Train Loss ce: 0.0614, Train Steps/Sec: 0.08,
823
+ [2025-12-30 05:43:47] (step=0000812) Train Loss mse: 0.0396, Train Loss ce: 0.0593, Train Steps/Sec: 0.09,
824
+ [2025-12-30 05:44:00] (step=0000813) Train Loss mse: 0.0491, Train Loss ce: 0.0637, Train Steps/Sec: 0.07,
825
+ [2025-12-30 05:44:13] (step=0000814) Train Loss mse: 0.0546, Train Loss ce: 0.0675, Train Steps/Sec: 0.08,
826
+ [2025-12-30 05:44:29] (step=0000815) Train Loss mse: 0.0476, Train Loss ce: 0.0708, Train Steps/Sec: 0.06,
827
+ [2025-12-30 05:44:45] (step=0000816) Train Loss mse: 0.0340, Train Loss ce: 0.0607, Train Steps/Sec: 0.06,
828
+ [2025-12-30 05:45:01] (step=0000817) Train Loss mse: 0.0456, Train Loss ce: 0.0649, Train Steps/Sec: 0.06,
829
+ [2025-12-30 05:45:15] (step=0000818) Train Loss mse: 0.0350, Train Loss ce: 0.0628, Train Steps/Sec: 0.07,
830
+ [2025-12-30 05:45:26] (step=0000819) Train Loss mse: 0.0484, Train Loss ce: 0.0635, Train Steps/Sec: 0.09,
831
+ [2025-12-30 05:45:37] (step=0000820) Train Loss mse: 0.0427, Train Loss ce: 0.0604, Train Steps/Sec: 0.09,
832
+ [2025-12-30 05:45:50] (step=0000821) Train Loss mse: 0.0280, Train Loss ce: 0.0599, Train Steps/Sec: 0.08,
833
+ [2025-12-30 05:46:01] (step=0000822) Train Loss mse: 0.0395, Train Loss ce: 0.0650, Train Steps/Sec: 0.09,
834
+ [2025-12-30 05:46:13] (step=0000823) Train Loss mse: 0.0509, Train Loss ce: 0.0648, Train Steps/Sec: 0.08,
835
+ [2025-12-30 05:46:29] (step=0000824) Train Loss mse: 0.0542, Train Loss ce: 0.0660, Train Steps/Sec: 0.06,
836
+ [2025-12-30 05:46:43] (step=0000825) Train Loss mse: 0.0471, Train Loss ce: 0.0679, Train Steps/Sec: 0.07,
837
+ [2025-12-30 05:46:55] (step=0000826) Train Loss mse: 0.0600, Train Loss ce: 0.0622, Train Steps/Sec: 0.08,
838
+ [2025-12-30 05:47:05] (step=0000827) Train Loss mse: 0.0479, Train Loss ce: 0.0622, Train Steps/Sec: 0.10,
839
+ [2025-12-30 05:47:19] (step=0000828) Train Loss mse: 0.0395, Train Loss ce: 0.0603, Train Steps/Sec: 0.07,
840
+ [2025-12-30 05:47:35] (step=0000829) Train Loss mse: 0.0340, Train Loss ce: 0.0620, Train Steps/Sec: 0.06,
841
+ [2025-12-30 05:47:46] (step=0000830) Train Loss mse: 0.0480, Train Loss ce: 0.0592, Train Steps/Sec: 0.09,
842
+ [2025-12-30 05:47:58] (step=0000831) Train Loss mse: 0.0487, Train Loss ce: 0.0607, Train Steps/Sec: 0.08,
843
+ [2025-12-30 05:48:10] (step=0000832) Train Loss mse: 0.0518, Train Loss ce: 0.0637, Train Steps/Sec: 0.08,
844
+ [2025-12-30 05:48:26] (step=0000833) Train Loss mse: 0.0428, Train Loss ce: 0.0640, Train Steps/Sec: 0.06,
845
+ [2025-12-30 05:48:38] (step=0000834) Train Loss mse: 0.0478, Train Loss ce: 0.0733, Train Steps/Sec: 0.09,
846
+ [2025-12-30 05:48:54] (step=0000835) Train Loss mse: 0.0390, Train Loss ce: 0.0679, Train Steps/Sec: 0.06,
847
+ [2025-12-30 05:49:07] (step=0000836) Train Loss mse: 0.0372, Train Loss ce: 0.0580, Train Steps/Sec: 0.07,
848
+ [2025-12-30 05:49:20] (step=0000837) Train Loss mse: 0.0565, Train Loss ce: 0.0645, Train Steps/Sec: 0.08,
849
+ [2025-12-30 05:49:35] (step=0000838) Train Loss mse: 0.0368, Train Loss ce: 0.0601, Train Steps/Sec: 0.06,
850
+ [2025-12-30 05:49:49] (step=0000839) Train Loss mse: 0.0374, Train Loss ce: 0.0657, Train Steps/Sec: 0.08,
851
+ [2025-12-30 05:50:04] (step=0000840) Train Loss mse: 0.0467, Train Loss ce: 0.0665, Train Steps/Sec: 0.06,
852
+ [2025-12-30 05:50:18] (step=0000841) Train Loss mse: 0.0450, Train Loss ce: 0.0677, Train Steps/Sec: 0.08,
853
+ [2025-12-30 05:50:30] (step=0000842) Train Loss mse: 0.0410, Train Loss ce: 0.0632, Train Steps/Sec: 0.08,
854
+ [2025-12-30 05:50:46] (step=0000843) Train Loss mse: 0.0386, Train Loss ce: 0.0618, Train Steps/Sec: 0.06,
855
+ [2025-12-30 05:51:00] (step=0000844) Train Loss mse: 0.0496, Train Loss ce: 0.0641, Train Steps/Sec: 0.08,
856
+ [2025-12-30 05:51:14] (step=0000845) Train Loss mse: 0.0314, Train Loss ce: 0.0647, Train Steps/Sec: 0.07,
857
+ [2025-12-30 05:51:26] (step=0000846) Train Loss mse: 0.0343, Train Loss ce: 0.0591, Train Steps/Sec: 0.08,
858
+ [2025-12-30 05:51:42] (step=0000847) Train Loss mse: 0.0252, Train Loss ce: 0.0611, Train Steps/Sec: 0.06,
859
+ [2025-12-30 05:51:55] (step=0000848) Train Loss mse: 0.0322, Train Loss ce: 0.0662, Train Steps/Sec: 0.08,
860
+ [2025-12-30 05:52:11] (step=0000849) Train Loss mse: 0.0505, Train Loss ce: 0.0663, Train Steps/Sec: 0.06,
861
+ [2025-12-30 05:52:23] (step=0000850) Train Loss mse: 0.0455, Train Loss ce: 0.0656, Train Steps/Sec: 0.08,
862
+ [2025-12-30 05:52:36] (step=0000851) Train Loss mse: 0.0462, Train Loss ce: 0.0642, Train Steps/Sec: 0.08,
863
+ [2025-12-30 05:52:49] (step=0000852) Train Loss mse: 0.0527, Train Loss ce: 0.0714, Train Steps/Sec: 0.08,
864
+ [2025-12-30 05:53:01] (step=0000853) Train Loss mse: 0.0497, Train Loss ce: 0.0633, Train Steps/Sec: 0.08,
865
+ [2025-12-30 05:53:17] (step=0000854) Train Loss mse: 0.0413, Train Loss ce: 0.0632, Train Steps/Sec: 0.06,
866
+ [2025-12-30 05:53:30] (step=0000855) Train Loss mse: 0.0484, Train Loss ce: 0.0659, Train Steps/Sec: 0.07,
867
+ [2025-12-30 05:53:43] (step=0000856) Train Loss mse: 0.0571, Train Loss ce: 0.0692, Train Steps/Sec: 0.07,
868
+ [2025-12-30 05:53:57] (step=0000857) Train Loss mse: 0.0312, Train Loss ce: 0.0582, Train Steps/Sec: 0.07,
869
+ [2025-12-30 05:54:09] (step=0000858) Train Loss mse: 0.0596, Train Loss ce: 0.0589, Train Steps/Sec: 0.08,
870
+ [2025-12-30 05:54:22] (step=0000859) Train Loss mse: 0.0381, Train Loss ce: 0.0585, Train Steps/Sec: 0.08,
871
+ [2025-12-30 05:54:36] (step=0000860) Train Loss mse: 0.0362, Train Loss ce: 0.0659, Train Steps/Sec: 0.08,
872
+ [2025-12-30 05:54:52] (step=0000861) Train Loss mse: 0.0378, Train Loss ce: 0.0605, Train Steps/Sec: 0.06,
873
+ [2025-12-30 05:55:05] (step=0000862) Train Loss mse: 0.0405, Train Loss ce: 0.0605, Train Steps/Sec: 0.07,
874
+ [2025-12-30 05:55:18] (step=0000863) Train Loss mse: 0.0501, Train Loss ce: 0.0731, Train Steps/Sec: 0.08,
875
+ [2025-12-30 05:55:32] (step=0000864) Train Loss mse: 0.0554, Train Loss ce: 0.0656, Train Steps/Sec: 0.07,
876
+ [2025-12-30 05:55:45] (step=0000865) Train Loss mse: 0.0447, Train Loss ce: 0.0668, Train Steps/Sec: 0.08,
877
+ [2025-12-30 05:55:58] (step=0000866) Train Loss mse: 0.0281, Train Loss ce: 0.0655, Train Steps/Sec: 0.08,
878
+ [2025-12-30 05:56:14] (step=0000867) Train Loss mse: 0.0498, Train Loss ce: 0.0637, Train Steps/Sec: 0.06,
879
+ [2025-12-30 05:56:28] (step=0000868) Train Loss mse: 0.0501, Train Loss ce: 0.0628, Train Steps/Sec: 0.07,
880
+ [2025-12-30 05:56:41] (step=0000869) Train Loss mse: 0.0460, Train Loss ce: 0.0721, Train Steps/Sec: 0.08,
881
+ [2025-12-30 05:56:54] (step=0000870) Train Loss mse: 0.0598, Train Loss ce: 0.0687, Train Steps/Sec: 0.08,
882
+ [2025-12-30 05:57:09] (step=0000871) Train Loss mse: 0.0381, Train Loss ce: 0.0636, Train Steps/Sec: 0.06,
883
+ [2025-12-30 05:57:26] (step=0000872) Train Loss mse: 0.0296, Train Loss ce: 0.0603, Train Steps/Sec: 0.06,
884
+ [2025-12-30 05:57:39] (step=0000873) Train Loss mse: 0.0587, Train Loss ce: 0.0701, Train Steps/Sec: 0.07,
885
+ [2025-12-30 05:57:55] (step=0000874) Train Loss mse: 0.0443, Train Loss ce: 0.0640, Train Steps/Sec: 0.06,
886
+ [2025-12-30 05:58:11] (step=0000875) Train Loss mse: 0.0375, Train Loss ce: 0.0569, Train Steps/Sec: 0.06,
887
+ [2025-12-30 05:58:27] (step=0000876) Train Loss mse: 0.0279, Train Loss ce: 0.0632, Train Steps/Sec: 0.06,
888
+ [2025-12-30 05:58:43] (step=0000877) Train Loss mse: 0.0385, Train Loss ce: 0.0582, Train Steps/Sec: 0.06,
889
+ [2025-12-30 05:58:59] (step=0000878) Train Loss mse: 0.0368, Train Loss ce: 0.0641, Train Steps/Sec: 0.06,
890
+ [2025-12-30 05:59:12] (step=0000879) Train Loss mse: 0.0451, Train Loss ce: 0.0623, Train Steps/Sec: 0.07,
891
+ [2025-12-30 05:59:26] (step=0000880) Train Loss mse: 0.0397, Train Loss ce: 0.0639, Train Steps/Sec: 0.07,
892
+ [2025-12-30 05:59:40] (step=0000881) Train Loss mse: 0.0532, Train Loss ce: 0.0666, Train Steps/Sec: 0.07,
893
+ [2025-12-30 05:59:53] (step=0000882) Train Loss mse: 0.0452, Train Loss ce: 0.0636, Train Steps/Sec: 0.08,
894
+ [2025-12-30 06:00:07] (step=0000883) Train Loss mse: 0.0423, Train Loss ce: 0.0629, Train Steps/Sec: 0.08,
895
+ [2025-12-30 06:00:19] (step=0000884) Train Loss mse: 0.0447, Train Loss ce: 0.0672, Train Steps/Sec: 0.09,
896
  FullyShardedDataParallel(
897
  (_fsdp_wrapped_module): Bagel(
898
  (language_model): Qwen2ForCausalLM(
 
1068
  Warning: failed loading hashes from /home/jiaxin/bagel_train/hashes_test_set_v10.json: [Errno 2] No such file or directory: '/home/jiaxin/bagel_train/hashes_test_set_v10.json'
1069
  Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_val
1070
  Warning: failed loading hashes from /home/jiaxin/bagel_train/hashes_test_set_v10.json: [Errno 2] No such file or directory: '/home/jiaxin/bagel_train/hashes_test_set_v10.json'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1071
  [2025-12-30 06:00:30] (step=0000885) Train Loss mse: 0.0446, Train Loss ce: 0.0648, Train Steps/Sec: 0.09,
1072
  [2025-12-30 06:00:43] (step=0000886) Train Loss mse: 0.0273, Train Loss ce: 0.0610, Train Steps/Sec: 0.08,
1073
  [2025-12-30 06:00:56] (step=0000887) Train Loss mse: 0.0444, Train Loss ce: 0.0651, Train Steps/Sec: 0.08,
checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/wandb/offline-run-20251230_024051-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed-run0/files/output.log CHANGED
@@ -872,6 +872,24 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
872
  [2025-12-30 06:05:11] (step=0000861) Train Loss mse: 0.0378, Train Loss ce: 0.0571, Train Steps/Sec: 0.06,
873
  [2025-12-30 06:05:25] (step=0000862) Train Loss mse: 0.0404, Train Loss ce: 0.0582, Train Steps/Sec: 0.07,
874
  [2025-12-30 06:05:37] (step=0000863) Train Loss mse: 0.0498, Train Loss ce: 0.0737, Train Steps/Sec: 0.08,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
875
  FullyShardedDataParallel(
876
  (_fsdp_wrapped_module): Bagel(
877
  (language_model): Qwen2ForCausalLM(
@@ -1045,24 +1063,6 @@ connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1045
  vit_pos_embed._fsdp_wrapped_module._flat_param False
1046
  Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_train
1047
  Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_val
1048
- [2025-12-30 06:05:51] (step=0000864) Train Loss mse: 0.0551, Train Loss ce: 0.0653, Train Steps/Sec: 0.07,
1049
- [2025-12-30 06:06:04] (step=0000865) Train Loss mse: 0.0448, Train Loss ce: 0.0661, Train Steps/Sec: 0.08,
1050
- [2025-12-30 06:06:17] (step=0000866) Train Loss mse: 0.0281, Train Loss ce: 0.0686, Train Steps/Sec: 0.08,
1051
- [2025-12-30 06:06:34] (step=0000867) Train Loss mse: 0.0497, Train Loss ce: 0.0639, Train Steps/Sec: 0.06,
1052
- [2025-12-30 06:06:47] (step=0000868) Train Loss mse: 0.0499, Train Loss ce: 0.0637, Train Steps/Sec: 0.07,
1053
- [2025-12-30 06:07:00] (step=0000869) Train Loss mse: 0.0456, Train Loss ce: 0.0727, Train Steps/Sec: 0.08,
1054
- [2025-12-30 06:07:13] (step=0000870) Train Loss mse: 0.0586, Train Loss ce: 0.0668, Train Steps/Sec: 0.08,
1055
- [2025-12-30 06:07:29] (step=0000871) Train Loss mse: 0.0382, Train Loss ce: 0.0622, Train Steps/Sec: 0.06,
1056
- [2025-12-30 06:07:45] (step=0000872) Train Loss mse: 0.0296, Train Loss ce: 0.0591, Train Steps/Sec: 0.06,
1057
- [2025-12-30 06:07:59] (step=0000873) Train Loss mse: 0.0595, Train Loss ce: 0.0675, Train Steps/Sec: 0.07,
1058
- [2025-12-30 06:08:15] (step=0000874) Train Loss mse: 0.0444, Train Loss ce: 0.0658, Train Steps/Sec: 0.06,
1059
- [2025-12-30 06:08:31] (step=0000875) Train Loss mse: 0.0382, Train Loss ce: 0.0611, Train Steps/Sec: 0.06,
1060
- [2025-12-30 06:08:47] (step=0000876) Train Loss mse: 0.0280, Train Loss ce: 0.0634, Train Steps/Sec: 0.06,
1061
- [2025-12-30 06:09:03] (step=0000877) Train Loss mse: 0.0387, Train Loss ce: 0.0618, Train Steps/Sec: 0.06,
1062
- [2025-12-30 06:09:19] (step=0000878) Train Loss mse: 0.0368, Train Loss ce: 0.0598, Train Steps/Sec: 0.06,
1063
- [2025-12-30 06:09:32] (step=0000879) Train Loss mse: 0.0453, Train Loss ce: 0.0626, Train Steps/Sec: 0.08,
1064
- [2025-12-30 06:09:45] (step=0000880) Train Loss mse: 0.0401, Train Loss ce: 0.0609, Train Steps/Sec: 0.08,
1065
- [2025-12-30 06:10:00] (step=0000881) Train Loss mse: 0.0515, Train Loss ce: 0.0631, Train Steps/Sec: 0.07,
1066
  [2025-12-30 06:10:13] (step=0000882) Train Loss mse: 0.0459, Train Loss ce: 0.0686, Train Steps/Sec: 0.07,
1067
  [2025-12-30 06:10:27] (step=0000883) Train Loss mse: 0.0424, Train Loss ce: 0.0605, Train Steps/Sec: 0.08,
1068
  [2025-12-30 06:10:38] (step=0000884) Train Loss mse: 0.0447, Train Loss ce: 0.0630, Train Steps/Sec: 0.09,
 
872
  [2025-12-30 06:05:11] (step=0000861) Train Loss mse: 0.0378, Train Loss ce: 0.0571, Train Steps/Sec: 0.06,
873
  [2025-12-30 06:05:25] (step=0000862) Train Loss mse: 0.0404, Train Loss ce: 0.0582, Train Steps/Sec: 0.07,
874
  [2025-12-30 06:05:37] (step=0000863) Train Loss mse: 0.0498, Train Loss ce: 0.0737, Train Steps/Sec: 0.08,
875
+ [2025-12-30 06:05:51] (step=0000864) Train Loss mse: 0.0551, Train Loss ce: 0.0653, Train Steps/Sec: 0.07,
876
+ [2025-12-30 06:06:04] (step=0000865) Train Loss mse: 0.0448, Train Loss ce: 0.0661, Train Steps/Sec: 0.08,
877
+ [2025-12-30 06:06:17] (step=0000866) Train Loss mse: 0.0281, Train Loss ce: 0.0686, Train Steps/Sec: 0.08,
878
+ [2025-12-30 06:06:34] (step=0000867) Train Loss mse: 0.0497, Train Loss ce: 0.0639, Train Steps/Sec: 0.06,
879
+ [2025-12-30 06:06:47] (step=0000868) Train Loss mse: 0.0499, Train Loss ce: 0.0637, Train Steps/Sec: 0.07,
880
+ [2025-12-30 06:07:00] (step=0000869) Train Loss mse: 0.0456, Train Loss ce: 0.0727, Train Steps/Sec: 0.08,
881
+ [2025-12-30 06:07:13] (step=0000870) Train Loss mse: 0.0586, Train Loss ce: 0.0668, Train Steps/Sec: 0.08,
882
+ [2025-12-30 06:07:29] (step=0000871) Train Loss mse: 0.0382, Train Loss ce: 0.0622, Train Steps/Sec: 0.06,
883
+ [2025-12-30 06:07:45] (step=0000872) Train Loss mse: 0.0296, Train Loss ce: 0.0591, Train Steps/Sec: 0.06,
884
+ [2025-12-30 06:07:59] (step=0000873) Train Loss mse: 0.0595, Train Loss ce: 0.0675, Train Steps/Sec: 0.07,
885
+ [2025-12-30 06:08:15] (step=0000874) Train Loss mse: 0.0444, Train Loss ce: 0.0658, Train Steps/Sec: 0.06,
886
+ [2025-12-30 06:08:31] (step=0000875) Train Loss mse: 0.0382, Train Loss ce: 0.0611, Train Steps/Sec: 0.06,
887
+ [2025-12-30 06:08:47] (step=0000876) Train Loss mse: 0.0280, Train Loss ce: 0.0634, Train Steps/Sec: 0.06,
888
+ [2025-12-30 06:09:03] (step=0000877) Train Loss mse: 0.0387, Train Loss ce: 0.0618, Train Steps/Sec: 0.06,
889
+ [2025-12-30 06:09:19] (step=0000878) Train Loss mse: 0.0368, Train Loss ce: 0.0598, Train Steps/Sec: 0.06,
890
+ [2025-12-30 06:09:32] (step=0000879) Train Loss mse: 0.0453, Train Loss ce: 0.0626, Train Steps/Sec: 0.08,
891
+ [2025-12-30 06:09:45] (step=0000880) Train Loss mse: 0.0401, Train Loss ce: 0.0609, Train Steps/Sec: 0.08,
892
+ [2025-12-30 06:10:00] (step=0000881) Train Loss mse: 0.0515, Train Loss ce: 0.0631, Train Steps/Sec: 0.07,
893
  FullyShardedDataParallel(
894
  (_fsdp_wrapped_module): Bagel(
895
  (language_model): Qwen2ForCausalLM(
 
1063
  vit_pos_embed._fsdp_wrapped_module._flat_param False
1064
  Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_train
1065
  Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_val
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1066
  [2025-12-30 06:10:13] (step=0000882) Train Loss mse: 0.0459, Train Loss ce: 0.0686, Train Steps/Sec: 0.07,
1067
  [2025-12-30 06:10:27] (step=0000883) Train Loss mse: 0.0424, Train Loss ce: 0.0605, Train Steps/Sec: 0.08,
1068
  [2025-12-30 06:10:38] (step=0000884) Train Loss mse: 0.0447, Train Loss ce: 0.0630, Train Steps/Sec: 0.09,
checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/wandb/offline-run-20260104_093254-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed-run0/files/output.log CHANGED
@@ -1,183 +1,3 @@
1
- FullyShardedDataParallel(
2
- (_fsdp_wrapped_module): Bagel(
3
- (language_model): Qwen2ForCausalLM(
4
- (model): Qwen2Model(
5
- (embed_tokens): Embedding(152064, 3584)
6
- (layers): ModuleList(
7
- (0-27): 28 x FullyShardedDataParallel(
8
- (_fsdp_wrapped_module): CheckpointWrapper(
9
- (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
10
- (self_attn): PackedAttentionMoT(
11
- (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
12
- (k_proj): Linear(in_features=3584, out_features=512, bias=True)
13
- (v_proj): Linear(in_features=3584, out_features=512, bias=True)
14
- (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
15
- (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
16
- (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
17
- (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
18
- (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
19
- (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
20
- (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
21
- (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
22
- (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
23
- )
24
- (mlp): Qwen2MLP(
25
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
26
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
27
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
28
- (act_fn): SiLU()
29
- )
30
- (mlp_moe_gen): Qwen2MLP(
31
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
32
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
33
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
34
- (act_fn): SiLU()
35
- )
36
- (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
37
- (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
38
- (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
39
- (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
40
- )
41
- )
42
- )
43
- )
44
- (norm): Qwen2RMSNorm((3584,), eps=1e-06)
45
- (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
46
- (rotary_emb): Qwen2RotaryEmbedding()
47
- )
48
- (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
49
- )
50
- (time_embedder): FullyShardedDataParallel(
51
- (_fsdp_wrapped_module): TimestepEmbedder(
52
- (mlp): Sequential(
53
- (0): Linear(in_features=256, out_features=3584, bias=True)
54
- (1): SiLU()
55
- (2): Linear(in_features=3584, out_features=3584, bias=True)
56
- )
57
- )
58
- )
59
- (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
60
- (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
61
- (latent_pos_embed): FullyShardedDataParallel(
62
- (_fsdp_wrapped_module): PositionEmbedding()
63
- )
64
- (vit_model): SiglipVisionModel(
65
- (vision_model): FullyShardedDataParallel(
66
- (_fsdp_wrapped_module): SiglipVisionTransformer(
67
- (embeddings): SiglipVisionEmbeddings(
68
- (position_embedding): Embedding(4900, 1152)
69
- (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
70
- )
71
- (encoder): SiglipEncoder(
72
- (layers): ModuleList(
73
- (0-25): 26 x FullyShardedDataParallel(
74
- (_fsdp_wrapped_module): CheckpointWrapper(
75
- (_checkpoint_wrapped_module): SiglipEncoderLayer(
76
- (self_attn): SiglipFlashAttention2(
77
- (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
78
- (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
79
- (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
80
- (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
81
- )
82
- (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
83
- (mlp): SiglipMLP(
84
- (activation_fn): PytorchGELUTanh()
85
- (fc1): Linear(in_features=1152, out_features=4304, bias=True)
86
- (fc2): Linear(in_features=4304, out_features=1152, bias=True)
87
- )
88
- (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
89
- )
90
- )
91
- )
92
- )
93
- )
94
- (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
95
- )
96
- )
97
- )
98
- (connector): FullyShardedDataParallel(
99
- (_fsdp_wrapped_module): CheckpointWrapper(
100
- (_checkpoint_wrapped_module): MLPconnector(
101
- (activation_fn): PytorchGELUTanh()
102
- (fc1): Linear(in_features=1152, out_features=3584, bias=True)
103
- (fc2): Linear(in_features=3584, out_features=3584, bias=True)
104
- )
105
- )
106
- )
107
- (vit_pos_embed): FullyShardedDataParallel(
108
- (_fsdp_wrapped_module): PositionEmbedding()
109
- )
110
- )
111
- )
112
- _flat_param True
113
- language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
114
- language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
115
- language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
116
- language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
117
- language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
118
- language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
119
- language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
120
- language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
121
- language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
122
- language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
123
- language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
124
- language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
125
- language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
126
- language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
127
- language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
128
- language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
129
- language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
130
- language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
131
- language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
132
- language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
133
- language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
134
- language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
135
- language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
136
- language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
137
- language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
138
- language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
139
- language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
140
- language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
141
- time_embedder._fsdp_wrapped_module._flat_param True
142
- latent_pos_embed._fsdp_wrapped_module._flat_param False
143
- vit_model.vision_model._fsdp_wrapped_module._flat_param True
144
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
145
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
146
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
147
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
148
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
149
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
150
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
151
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
152
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
153
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
154
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
155
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
156
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
157
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
158
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
159
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
160
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
161
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
162
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
163
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
164
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
165
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
166
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
167
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
168
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
169
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
170
- connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
171
- vit_pos_embed._fsdp_wrapped_module._flat_param False
172
- Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_train
173
- base_dir is results/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step0
174
- Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
175
- [eval debug] first 3 batch fingerprints:
176
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
177
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
178
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
179
- ce_avg: 0.40467292070388794, mse_avg: 0.06533115357160568
180
- base_dir is results/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step500
181
  wandb: Detected [huggingface_hub.inference] in use.
182
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
183
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
@@ -719,4 +539,184 @@ AssertionError
719
  [rank0]: File "/home/clouduser/Code/Github/unified_world_model/data/dataset_base.py", line 115, in build_datasets
720
  [rank0]: assert 'dataset_names' in dataset_args.keys()
721
  [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
722
- [rank0]: AssertionError
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  wandb: Detected [huggingface_hub.inference] in use.
2
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
3
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
 
539
  [rank0]: File "/home/clouduser/Code/Github/unified_world_model/data/dataset_base.py", line 115, in build_datasets
540
  [rank0]: assert 'dataset_names' in dataset_args.keys()
541
  [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
542
+ [rank0]: AssertionError
543
+ FullyShardedDataParallel(
544
+ (_fsdp_wrapped_module): Bagel(
545
+ (language_model): Qwen2ForCausalLM(
546
+ (model): Qwen2Model(
547
+ (embed_tokens): Embedding(152064, 3584)
548
+ (layers): ModuleList(
549
+ (0-27): 28 x FullyShardedDataParallel(
550
+ (_fsdp_wrapped_module): CheckpointWrapper(
551
+ (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
552
+ (self_attn): PackedAttentionMoT(
553
+ (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
554
+ (k_proj): Linear(in_features=3584, out_features=512, bias=True)
555
+ (v_proj): Linear(in_features=3584, out_features=512, bias=True)
556
+ (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
557
+ (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
558
+ (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
559
+ (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
560
+ (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
561
+ (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
562
+ (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
563
+ (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
564
+ (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
565
+ )
566
+ (mlp): Qwen2MLP(
567
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
568
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
569
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
570
+ (act_fn): SiLU()
571
+ )
572
+ (mlp_moe_gen): Qwen2MLP(
573
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
574
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
575
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
576
+ (act_fn): SiLU()
577
+ )
578
+ (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
579
+ (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
580
+ (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
581
+ (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
582
+ )
583
+ )
584
+ )
585
+ )
586
+ (norm): Qwen2RMSNorm((3584,), eps=1e-06)
587
+ (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
588
+ (rotary_emb): Qwen2RotaryEmbedding()
589
+ )
590
+ (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
591
+ )
592
+ (time_embedder): FullyShardedDataParallel(
593
+ (_fsdp_wrapped_module): TimestepEmbedder(
594
+ (mlp): Sequential(
595
+ (0): Linear(in_features=256, out_features=3584, bias=True)
596
+ (1): SiLU()
597
+ (2): Linear(in_features=3584, out_features=3584, bias=True)
598
+ )
599
+ )
600
+ )
601
+ (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
602
+ (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
603
+ (latent_pos_embed): FullyShardedDataParallel(
604
+ (_fsdp_wrapped_module): PositionEmbedding()
605
+ )
606
+ (vit_model): SiglipVisionModel(
607
+ (vision_model): FullyShardedDataParallel(
608
+ (_fsdp_wrapped_module): SiglipVisionTransformer(
609
+ (embeddings): SiglipVisionEmbeddings(
610
+ (position_embedding): Embedding(4900, 1152)
611
+ (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
612
+ )
613
+ (encoder): SiglipEncoder(
614
+ (layers): ModuleList(
615
+ (0-25): 26 x FullyShardedDataParallel(
616
+ (_fsdp_wrapped_module): CheckpointWrapper(
617
+ (_checkpoint_wrapped_module): SiglipEncoderLayer(
618
+ (self_attn): SiglipFlashAttention2(
619
+ (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
620
+ (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
621
+ (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
622
+ (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
623
+ )
624
+ (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
625
+ (mlp): SiglipMLP(
626
+ (activation_fn): PytorchGELUTanh()
627
+ (fc1): Linear(in_features=1152, out_features=4304, bias=True)
628
+ (fc2): Linear(in_features=4304, out_features=1152, bias=True)
629
+ )
630
+ (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
631
+ )
632
+ )
633
+ )
634
+ )
635
+ )
636
+ (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
637
+ )
638
+ )
639
+ )
640
+ (connector): FullyShardedDataParallel(
641
+ (_fsdp_wrapped_module): CheckpointWrapper(
642
+ (_checkpoint_wrapped_module): MLPconnector(
643
+ (activation_fn): PytorchGELUTanh()
644
+ (fc1): Linear(in_features=1152, out_features=3584, bias=True)
645
+ (fc2): Linear(in_features=3584, out_features=3584, bias=True)
646
+ )
647
+ )
648
+ )
649
+ (vit_pos_embed): FullyShardedDataParallel(
650
+ (_fsdp_wrapped_module): PositionEmbedding()
651
+ )
652
+ )
653
+ )
654
+ _flat_param True
655
+ language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
656
+ language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
657
+ language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
658
+ language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
659
+ language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
660
+ language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
661
+ language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
662
+ language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
663
+ language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
664
+ language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
665
+ language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
666
+ language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
667
+ language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
668
+ language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
669
+ language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
670
+ language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
671
+ language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
672
+ language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
673
+ language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
674
+ language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
675
+ language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
676
+ language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
677
+ language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
678
+ language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
679
+ language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
680
+ language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
681
+ language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
682
+ language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
683
+ time_embedder._fsdp_wrapped_module._flat_param True
684
+ latent_pos_embed._fsdp_wrapped_module._flat_param False
685
+ vit_model.vision_model._fsdp_wrapped_module._flat_param True
686
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
687
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
688
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
689
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
690
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
691
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
692
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
693
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
694
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
695
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
696
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
697
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
698
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
699
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
700
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
701
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
702
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
703
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
704
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
705
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
706
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
707
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
708
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
709
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
710
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
711
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
712
+ connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
713
+ vit_pos_embed._fsdp_wrapped_module._flat_param False
714
+ Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_train
715
+ base_dir is results/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step0
716
+ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
717
+ [eval debug] first 3 batch fingerprints:
718
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
719
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
720
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
721
+ ce_avg: 0.40467292070388794, mse_avg: 0.06533115357160568
722
+ base_dir is results/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step500
checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/wandb/offline-run-20260105_090343-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed-run0/files/output.log CHANGED
@@ -849,6 +849,12 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
849
  [2026-01-05 12:24:55] (step=0000838) Train Loss mse: 0.0370, Train Loss ce: 0.0676, Train Steps/Sec: 0.06,
850
  [2026-01-05 12:25:08] (step=0000839) Train Loss mse: 0.0387, Train Loss ce: 0.0647, Train Steps/Sec: 0.07,
851
  [2026-01-05 12:25:24] (step=0000840) Train Loss mse: 0.0463, Train Loss ce: 0.0704, Train Steps/Sec: 0.06,
 
 
 
 
 
 
852
  FullyShardedDataParallel(
853
  (_fsdp_wrapped_module): Bagel(
854
  (language_model): Qwen2ForCausalLM(
@@ -1035,12 +1041,20 @@ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
1035
  fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1036
  fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1037
  ce_avg: 0.07113238424062729, mse_avg: 0.043496306985616684
1038
- [2026-01-05 12:25:37] (step=0000841) Train Loss mse: 0.0448, Train Loss ce: 0.0663, Train Steps/Sec: 0.08,
1039
- [2026-01-05 12:25:50] (step=0000842) Train Loss mse: 0.0398, Train Loss ce: 0.0621, Train Steps/Sec: 0.08,
1040
- [2026-01-05 12:26:06] (step=0000843) Train Loss mse: 0.0387, Train Loss ce: 0.0604, Train Steps/Sec: 0.06,
1041
- [2026-01-05 12:26:20] (step=0000844) Train Loss mse: 0.0499, Train Loss ce: 0.0661, Train Steps/Sec: 0.07,
1042
- [2026-01-05 12:26:34] (step=0000845) Train Loss mse: 0.0324, Train Loss ce: 0.0699, Train Steps/Sec: 0.07,
1043
- [2026-01-05 12:26:46] (step=0000846) Train Loss mse: 0.0344, Train Loss ce: 0.0596, Train Steps/Sec: 0.08,
 
 
 
 
 
 
 
 
1044
  [2026-01-05 12:27:02] (step=0000847) Train Loss mse: 0.0251, Train Loss ce: 0.0612, Train Steps/Sec: 0.06,
1045
  [2026-01-05 12:27:15] (step=0000848) Train Loss mse: 0.0327, Train Loss ce: 0.0664, Train Steps/Sec: 0.08,
1046
  [2026-01-05 12:27:31] (step=0000849) Train Loss mse: 0.0508, Train Loss ce: 0.0639, Train Steps/Sec: 0.06,
@@ -2179,18 +2193,4 @@ ce_avg: 0.07113238424062729, mse_avg: 0.043496306985616684
2179
  [2026-01-05 16:48:49] (step=0001982) Train Loss mse: 0.0501, Train Loss ce: 0.0636, Train Steps/Sec: 0.09,
2180
  [2026-01-05 16:49:02] (step=0001983) Train Loss mse: 0.0559, Train Loss ce: 0.0639, Train Steps/Sec: 0.08,
2181
  [2026-01-05 16:49:14] (step=0001984) Train Loss mse: 0.0539, Train Loss ce: 0.0606, Train Steps/Sec: 0.08,
2182
- [2026-01-05 16:49:30] (step=0001985) Train Loss mse: 0.0312, Train Loss ce: 0.0595, Train Steps/Sec: 0.06,
2183
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step1000
2184
- Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
2185
- [eval debug] first 3 batch fingerprints:
2186
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2187
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2188
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2189
- ce_avg: 0.09860303997993469, mse_avg: 0.04136444255709648
2190
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step1500
2191
- Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
2192
- [eval debug] first 3 batch fingerprints:
2193
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2194
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2195
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2196
- ce_avg: 0.13110212981700897, mse_avg: 0.041821885854005814
 
849
  [2026-01-05 12:24:55] (step=0000838) Train Loss mse: 0.0370, Train Loss ce: 0.0676, Train Steps/Sec: 0.06,
850
  [2026-01-05 12:25:08] (step=0000839) Train Loss mse: 0.0387, Train Loss ce: 0.0647, Train Steps/Sec: 0.07,
851
  [2026-01-05 12:25:24] (step=0000840) Train Loss mse: 0.0463, Train Loss ce: 0.0704, Train Steps/Sec: 0.06,
852
+ [2026-01-05 12:25:37] (step=0000841) Train Loss mse: 0.0448, Train Loss ce: 0.0663, Train Steps/Sec: 0.08,
853
+ [2026-01-05 12:25:50] (step=0000842) Train Loss mse: 0.0398, Train Loss ce: 0.0621, Train Steps/Sec: 0.08,
854
+ [2026-01-05 12:26:06] (step=0000843) Train Loss mse: 0.0387, Train Loss ce: 0.0604, Train Steps/Sec: 0.06,
855
+ [2026-01-05 12:26:20] (step=0000844) Train Loss mse: 0.0499, Train Loss ce: 0.0661, Train Steps/Sec: 0.07,
856
+ [2026-01-05 12:26:34] (step=0000845) Train Loss mse: 0.0324, Train Loss ce: 0.0699, Train Steps/Sec: 0.07,
857
+ [2026-01-05 12:26:46] (step=0000846) Train Loss mse: 0.0344, Train Loss ce: 0.0596, Train Steps/Sec: 0.08,
858
  FullyShardedDataParallel(
859
  (_fsdp_wrapped_module): Bagel(
860
  (language_model): Qwen2ForCausalLM(
 
1041
  fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1042
  fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1043
  ce_avg: 0.07113238424062729, mse_avg: 0.043496306985616684
1044
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step1000
1045
+ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
1046
+ [eval debug] first 3 batch fingerprints:
1047
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1048
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1049
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1050
+ ce_avg: 0.09860303997993469, mse_avg: 0.04136444255709648
1051
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step1500
1052
+ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
1053
+ [eval debug] first 3 batch fingerprints:
1054
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1055
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1056
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1057
+ ce_avg: 0.13110212981700897, mse_avg: 0.041821885854005814
1058
  [2026-01-05 12:27:02] (step=0000847) Train Loss mse: 0.0251, Train Loss ce: 0.0612, Train Steps/Sec: 0.06,
1059
  [2026-01-05 12:27:15] (step=0000848) Train Loss mse: 0.0327, Train Loss ce: 0.0664, Train Steps/Sec: 0.08,
1060
  [2026-01-05 12:27:31] (step=0000849) Train Loss mse: 0.0508, Train Loss ce: 0.0639, Train Steps/Sec: 0.06,
 
2193
  [2026-01-05 16:48:49] (step=0001982) Train Loss mse: 0.0501, Train Loss ce: 0.0636, Train Steps/Sec: 0.09,
2194
  [2026-01-05 16:49:02] (step=0001983) Train Loss mse: 0.0559, Train Loss ce: 0.0639, Train Steps/Sec: 0.08,
2195
  [2026-01-05 16:49:14] (step=0001984) Train Loss mse: 0.0539, Train Loss ce: 0.0606, Train Steps/Sec: 0.08,
2196
+ [2026-01-05 16:49:30] (step=0001985) Train Loss mse: 0.0312, Train Loss ce: 0.0595, Train Steps/Sec: 0.06,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/wandb/offline-run-20260107_093905-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed-run0/files/output.log CHANGED
@@ -1,189 +1,3 @@
1
- FullyShardedDataParallel(
2
- (_fsdp_wrapped_module): Bagel(
3
- (language_model): Qwen2ForCausalLM(
4
- (model): Qwen2Model(
5
- (embed_tokens): Embedding(152064, 3584)
6
- (layers): ModuleList(
7
- (0-27): 28 x FullyShardedDataParallel(
8
- (_fsdp_wrapped_module): CheckpointWrapper(
9
- (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
10
- (self_attn): PackedAttentionMoT(
11
- (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
12
- (k_proj): Linear(in_features=3584, out_features=512, bias=True)
13
- (v_proj): Linear(in_features=3584, out_features=512, bias=True)
14
- (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
15
- (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
16
- (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
17
- (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
18
- (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
19
- (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
20
- (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
21
- (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
22
- (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
23
- )
24
- (mlp): Qwen2MLP(
25
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
26
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
27
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
28
- (act_fn): SiLU()
29
- )
30
- (mlp_moe_gen): Qwen2MLP(
31
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
32
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
33
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
34
- (act_fn): SiLU()
35
- )
36
- (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
37
- (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
38
- (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
39
- (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
40
- )
41
- )
42
- )
43
- )
44
- (norm): Qwen2RMSNorm((3584,), eps=1e-06)
45
- (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
46
- (rotary_emb): Qwen2RotaryEmbedding()
47
- )
48
- (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
49
- )
50
- (time_embedder): FullyShardedDataParallel(
51
- (_fsdp_wrapped_module): TimestepEmbedder(
52
- (mlp): Sequential(
53
- (0): Linear(in_features=256, out_features=3584, bias=True)
54
- (1): SiLU()
55
- (2): Linear(in_features=3584, out_features=3584, bias=True)
56
- )
57
- )
58
- )
59
- (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
60
- (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
61
- (latent_pos_embed): FullyShardedDataParallel(
62
- (_fsdp_wrapped_module): PositionEmbedding()
63
- )
64
- (vit_model): SiglipVisionModel(
65
- (vision_model): FullyShardedDataParallel(
66
- (_fsdp_wrapped_module): SiglipVisionTransformer(
67
- (embeddings): SiglipVisionEmbeddings(
68
- (position_embedding): Embedding(4900, 1152)
69
- (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
70
- )
71
- (encoder): SiglipEncoder(
72
- (layers): ModuleList(
73
- (0-25): 26 x FullyShardedDataParallel(
74
- (_fsdp_wrapped_module): CheckpointWrapper(
75
- (_checkpoint_wrapped_module): SiglipEncoderLayer(
76
- (self_attn): SiglipFlashAttention2(
77
- (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
78
- (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
79
- (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
80
- (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
81
- )
82
- (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
83
- (mlp): SiglipMLP(
84
- (activation_fn): PytorchGELUTanh()
85
- (fc1): Linear(in_features=1152, out_features=4304, bias=True)
86
- (fc2): Linear(in_features=4304, out_features=1152, bias=True)
87
- )
88
- (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
89
- )
90
- )
91
- )
92
- )
93
- )
94
- (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
95
- )
96
- )
97
- )
98
- (connector): FullyShardedDataParallel(
99
- (_fsdp_wrapped_module): CheckpointWrapper(
100
- (_checkpoint_wrapped_module): MLPconnector(
101
- (activation_fn): PytorchGELUTanh()
102
- (fc1): Linear(in_features=1152, out_features=3584, bias=True)
103
- (fc2): Linear(in_features=3584, out_features=3584, bias=True)
104
- )
105
- )
106
- )
107
- (vit_pos_embed): FullyShardedDataParallel(
108
- (_fsdp_wrapped_module): PositionEmbedding()
109
- )
110
- )
111
- )
112
- _flat_param True
113
- language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
114
- language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
115
- language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
116
- language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
117
- language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
118
- language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
119
- language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
120
- language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
121
- language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
122
- language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
123
- language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
124
- language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
125
- language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
126
- language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
127
- language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
128
- language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
129
- language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
130
- language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
131
- language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
132
- language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
133
- language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
134
- language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
135
- language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
136
- language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
137
- language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
138
- language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
139
- language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
140
- language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
141
- time_embedder._fsdp_wrapped_module._flat_param True
142
- latent_pos_embed._fsdp_wrapped_module._flat_param False
143
- vit_model.vision_model._fsdp_wrapped_module._flat_param True
144
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
145
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
146
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
147
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
148
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
149
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
150
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
151
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
152
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
153
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
154
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
155
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
156
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
157
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
158
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
159
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
160
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
161
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
162
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
163
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
164
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
165
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
166
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
167
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
168
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
169
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
170
- connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
171
- vit_pos_embed._fsdp_wrapped_module._flat_param False
172
- Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_train
173
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step0
174
- Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
175
- [eval debug] first 3 batch fingerprints:
176
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
177
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
178
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
179
- ce_avg: 0.40467292070388794, mse_avg: 0.06533115357160568
180
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step500
181
- Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
182
- [eval debug] first 3 batch fingerprints:
183
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
184
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
185
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
186
- ce_avg: 0.07139278203248978, mse_avg: 0.04338355362415314
187
  wandb: Detected [huggingface_hub.inference] in use.
188
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
189
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
@@ -1014,6 +828,206 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
1014
  [2026-01-07 12:55:35] (step=0000817) Train Loss mse: 0.0457, Train Loss ce: 0.0721, Train Steps/Sec: 0.06,
1015
  [2026-01-07 12:55:48] (step=0000818) Train Loss mse: 0.0351, Train Loss ce: 0.0605, Train Steps/Sec: 0.08,
1016
  [2026-01-07 12:55:59] (step=0000819) Train Loss mse: 0.0485, Train Loss ce: 0.0618, Train Steps/Sec: 0.09,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1017
  [2026-01-07 12:56:10] (step=0000820) Train Loss mse: 0.0426, Train Loss ce: 0.0636, Train Steps/Sec: 0.09,
1018
  [2026-01-07 12:56:23] (step=0000821) Train Loss mse: 0.0290, Train Loss ce: 0.0615, Train Steps/Sec: 0.07,
1019
  [2026-01-07 12:56:34] (step=0000822) Train Loss mse: 0.0403, Train Loss ce: 0.0646, Train Steps/Sec: 0.09,
@@ -1057,20 +1071,6 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
1057
  [2026-01-07 13:05:13] (step=0000860) Train Loss mse: 0.0359, Train Loss ce: 0.0613, Train Steps/Sec: 0.08,
1058
  [2026-01-07 13:05:29] (step=0000861) Train Loss mse: 0.0381, Train Loss ce: 0.0595, Train Steps/Sec: 0.06,
1059
  [2026-01-07 13:05:42] (step=0000862) Train Loss mse: 0.0404, Train Loss ce: 0.0619, Train Steps/Sec: 0.07,
1060
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step1000
1061
- Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
1062
- [eval debug] first 3 batch fingerprints:
1063
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1064
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1065
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1066
- ce_avg: 0.09475355595350266, mse_avg: 0.04140673205256462
1067
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step1500
1068
- Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
1069
- [eval debug] first 3 batch fingerprints:
1070
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1071
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1072
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1073
- ce_avg: 0.1198282465338707, mse_avg: 0.04170006886124611
1074
  [2026-01-07 13:05:55] (step=0000863) Train Loss mse: 0.0506, Train Loss ce: 0.0734, Train Steps/Sec: 0.08,
1075
  [2026-01-07 13:06:08] (step=0000864) Train Loss mse: 0.0548, Train Loss ce: 0.0637, Train Steps/Sec: 0.07,
1076
  [2026-01-07 13:06:22] (step=0000865) Train Loss mse: 0.0454, Train Loss ce: 0.0644, Train Steps/Sec: 0.07,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  wandb: Detected [huggingface_hub.inference] in use.
2
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
3
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
 
828
  [2026-01-07 12:55:35] (step=0000817) Train Loss mse: 0.0457, Train Loss ce: 0.0721, Train Steps/Sec: 0.06,
829
  [2026-01-07 12:55:48] (step=0000818) Train Loss mse: 0.0351, Train Loss ce: 0.0605, Train Steps/Sec: 0.08,
830
  [2026-01-07 12:55:59] (step=0000819) Train Loss mse: 0.0485, Train Loss ce: 0.0618, Train Steps/Sec: 0.09,
831
+ FullyShardedDataParallel(
832
+ (_fsdp_wrapped_module): Bagel(
833
+ (language_model): Qwen2ForCausalLM(
834
+ (model): Qwen2Model(
835
+ (embed_tokens): Embedding(152064, 3584)
836
+ (layers): ModuleList(
837
+ (0-27): 28 x FullyShardedDataParallel(
838
+ (_fsdp_wrapped_module): CheckpointWrapper(
839
+ (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
840
+ (self_attn): PackedAttentionMoT(
841
+ (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
842
+ (k_proj): Linear(in_features=3584, out_features=512, bias=True)
843
+ (v_proj): Linear(in_features=3584, out_features=512, bias=True)
844
+ (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
845
+ (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
846
+ (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
847
+ (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
848
+ (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
849
+ (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
850
+ (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
851
+ (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
852
+ (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
853
+ )
854
+ (mlp): Qwen2MLP(
855
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
856
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
857
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
858
+ (act_fn): SiLU()
859
+ )
860
+ (mlp_moe_gen): Qwen2MLP(
861
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
862
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
863
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
864
+ (act_fn): SiLU()
865
+ )
866
+ (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
867
+ (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
868
+ (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
869
+ (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
870
+ )
871
+ )
872
+ )
873
+ )
874
+ (norm): Qwen2RMSNorm((3584,), eps=1e-06)
875
+ (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
876
+ (rotary_emb): Qwen2RotaryEmbedding()
877
+ )
878
+ (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
879
+ )
880
+ (time_embedder): FullyShardedDataParallel(
881
+ (_fsdp_wrapped_module): TimestepEmbedder(
882
+ (mlp): Sequential(
883
+ (0): Linear(in_features=256, out_features=3584, bias=True)
884
+ (1): SiLU()
885
+ (2): Linear(in_features=3584, out_features=3584, bias=True)
886
+ )
887
+ )
888
+ )
889
+ (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
890
+ (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
891
+ (latent_pos_embed): FullyShardedDataParallel(
892
+ (_fsdp_wrapped_module): PositionEmbedding()
893
+ )
894
+ (vit_model): SiglipVisionModel(
895
+ (vision_model): FullyShardedDataParallel(
896
+ (_fsdp_wrapped_module): SiglipVisionTransformer(
897
+ (embeddings): SiglipVisionEmbeddings(
898
+ (position_embedding): Embedding(4900, 1152)
899
+ (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
900
+ )
901
+ (encoder): SiglipEncoder(
902
+ (layers): ModuleList(
903
+ (0-25): 26 x FullyShardedDataParallel(
904
+ (_fsdp_wrapped_module): CheckpointWrapper(
905
+ (_checkpoint_wrapped_module): SiglipEncoderLayer(
906
+ (self_attn): SiglipFlashAttention2(
907
+ (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
908
+ (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
909
+ (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
910
+ (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
911
+ )
912
+ (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
913
+ (mlp): SiglipMLP(
914
+ (activation_fn): PytorchGELUTanh()
915
+ (fc1): Linear(in_features=1152, out_features=4304, bias=True)
916
+ (fc2): Linear(in_features=4304, out_features=1152, bias=True)
917
+ )
918
+ (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
919
+ )
920
+ )
921
+ )
922
+ )
923
+ )
924
+ (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
925
+ )
926
+ )
927
+ )
928
+ (connector): FullyShardedDataParallel(
929
+ (_fsdp_wrapped_module): CheckpointWrapper(
930
+ (_checkpoint_wrapped_module): MLPconnector(
931
+ (activation_fn): PytorchGELUTanh()
932
+ (fc1): Linear(in_features=1152, out_features=3584, bias=True)
933
+ (fc2): Linear(in_features=3584, out_features=3584, bias=True)
934
+ )
935
+ )
936
+ )
937
+ (vit_pos_embed): FullyShardedDataParallel(
938
+ (_fsdp_wrapped_module): PositionEmbedding()
939
+ )
940
+ )
941
+ )
942
+ _flat_param True
943
+ language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
944
+ language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
945
+ language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
946
+ language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
947
+ language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
948
+ language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
949
+ language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
950
+ language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
951
+ language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
952
+ language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
953
+ language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
954
+ language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
955
+ language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
956
+ language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
957
+ language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
958
+ language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
959
+ language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
960
+ language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
961
+ language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
962
+ language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
963
+ language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
964
+ language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
965
+ language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
966
+ language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
967
+ language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
968
+ language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
969
+ language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
970
+ language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
971
+ time_embedder._fsdp_wrapped_module._flat_param True
972
+ latent_pos_embed._fsdp_wrapped_module._flat_param False
973
+ vit_model.vision_model._fsdp_wrapped_module._flat_param True
974
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
975
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
976
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
977
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
978
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
979
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
980
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
981
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
982
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
983
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
984
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
985
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
986
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
987
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
988
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
989
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
990
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
991
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
992
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
993
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
994
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
995
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
996
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
997
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
998
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
999
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1000
+ connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
1001
+ vit_pos_embed._fsdp_wrapped_module._flat_param False
1002
+ Preparing Dataset vlm_gym_jigsaw_celoss/vlm_gym_jigsaw_train
1003
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step0
1004
+ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
1005
+ [eval debug] first 3 batch fingerprints:
1006
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1007
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1008
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1009
+ ce_avg: 0.40467292070388794, mse_avg: 0.06533115357160568
1010
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step500
1011
+ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
1012
+ [eval debug] first 3 batch fingerprints:
1013
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1014
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1015
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1016
+ ce_avg: 0.07139278203248978, mse_avg: 0.04338355362415314
1017
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step1000
1018
+ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
1019
+ [eval debug] first 3 batch fingerprints:
1020
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1021
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1022
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1023
+ ce_avg: 0.09475355595350266, mse_avg: 0.04140673205256462
1024
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step1500
1025
+ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
1026
+ [eval debug] first 3 batch fingerprints:
1027
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1028
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1029
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1030
+ ce_avg: 0.1198282465338707, mse_avg: 0.04170006886124611
1031
  [2026-01-07 12:56:10] (step=0000820) Train Loss mse: 0.0426, Train Loss ce: 0.0636, Train Steps/Sec: 0.09,
1032
  [2026-01-07 12:56:23] (step=0000821) Train Loss mse: 0.0290, Train Loss ce: 0.0615, Train Steps/Sec: 0.07,
1033
  [2026-01-07 12:56:34] (step=0000822) Train Loss mse: 0.0403, Train Loss ce: 0.0646, Train Steps/Sec: 0.09,
 
1071
  [2026-01-07 13:05:13] (step=0000860) Train Loss mse: 0.0359, Train Loss ce: 0.0613, Train Steps/Sec: 0.08,
1072
  [2026-01-07 13:05:29] (step=0000861) Train Loss mse: 0.0381, Train Loss ce: 0.0595, Train Steps/Sec: 0.06,
1073
  [2026-01-07 13:05:42] (step=0000862) Train Loss mse: 0.0404, Train Loss ce: 0.0619, Train Steps/Sec: 0.07,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1074
  [2026-01-07 13:05:55] (step=0000863) Train Loss mse: 0.0506, Train Loss ce: 0.0734, Train Steps/Sec: 0.08,
1075
  [2026-01-07 13:06:08] (step=0000864) Train Loss mse: 0.0548, Train Loss ce: 0.0637, Train Steps/Sec: 0.07,
1076
  [2026-01-07 13:06:22] (step=0000865) Train Loss mse: 0.0454, Train Loss ce: 0.0644, Train Steps/Sec: 0.07,
checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/wandb/offline-run-20260107_184933-checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed-run0/files/output.log CHANGED
@@ -817,63 +817,6 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
817
  [2026-01-07 22:03:25] (step=0000806) Train Loss mse: 0.0474, Train Loss ce: 0.0616, Train Steps/Sec: 0.08,
818
  [2026-01-07 22:03:37] (step=0000807) Train Loss mse: 0.0525, Train Loss ce: 0.0598, Train Steps/Sec: 0.08,
819
  [2026-01-07 22:03:49] (step=0000808) Train Loss mse: 0.0454, Train Loss ce: 0.0603, Train Steps/Sec: 0.09,
820
- [2026-01-07 22:04:05] (step=0000809) Train Loss mse: 0.0326, Train Loss ce: 0.0647, Train Steps/Sec: 0.06,
821
- [2026-01-07 22:04:17] (step=0000810) Train Loss mse: 0.0437, Train Loss ce: 0.0584, Train Steps/Sec: 0.08,
822
- [2026-01-07 22:04:29] (step=0000811) Train Loss mse: 0.0597, Train Loss ce: 0.0633, Train Steps/Sec: 0.08,
823
- [2026-01-07 22:04:41] (step=0000812) Train Loss mse: 0.0387, Train Loss ce: 0.0599, Train Steps/Sec: 0.09,
824
- [2026-01-07 22:04:55] (step=0000813) Train Loss mse: 0.0491, Train Loss ce: 0.0613, Train Steps/Sec: 0.07,
825
- [2026-01-07 22:05:08] (step=0000814) Train Loss mse: 0.0546, Train Loss ce: 0.0671, Train Steps/Sec: 0.08,
826
- [2026-01-07 22:05:24] (step=0000815) Train Loss mse: 0.0472, Train Loss ce: 0.0726, Train Steps/Sec: 0.06,
827
- [2026-01-07 22:05:41] (step=0000816) Train Loss mse: 0.0341, Train Loss ce: 0.0607, Train Steps/Sec: 0.06,
828
- [2026-01-07 22:05:56] (step=0000817) Train Loss mse: 0.0452, Train Loss ce: 0.0611, Train Steps/Sec: 0.06,
829
- [2026-01-07 22:06:10] (step=0000818) Train Loss mse: 0.0349, Train Loss ce: 0.0640, Train Steps/Sec: 0.08,
830
- [2026-01-07 22:06:21] (step=0000819) Train Loss mse: 0.0482, Train Loss ce: 0.0636, Train Steps/Sec: 0.09,
831
- [2026-01-07 22:06:32] (step=0000820) Train Loss mse: 0.0424, Train Loss ce: 0.0601, Train Steps/Sec: 0.09,
832
- [2026-01-07 22:06:45] (step=0000821) Train Loss mse: 0.0306, Train Loss ce: 0.0608, Train Steps/Sec: 0.08,
833
- [2026-01-07 22:06:56] (step=0000822) Train Loss mse: 0.0402, Train Loss ce: 0.0640, Train Steps/Sec: 0.09,
834
- [2026-01-07 22:07:10] (step=0000823) Train Loss mse: 0.0510, Train Loss ce: 0.0636, Train Steps/Sec: 0.08,
835
- [2026-01-07 22:07:26] (step=0000824) Train Loss mse: 0.0551, Train Loss ce: 0.0690, Train Steps/Sec: 0.06,
836
- [2026-01-07 22:07:39] (step=0000825) Train Loss mse: 0.0469, Train Loss ce: 0.0644, Train Steps/Sec: 0.07,
837
- [2026-01-07 22:07:52] (step=0000826) Train Loss mse: 0.0598, Train Loss ce: 0.0676, Train Steps/Sec: 0.08,
838
- [2026-01-07 22:08:02] (step=0000827) Train Loss mse: 0.0474, Train Loss ce: 0.0626, Train Steps/Sec: 0.09,
839
- [2026-01-07 22:08:16] (step=0000828) Train Loss mse: 0.0396, Train Loss ce: 0.0633, Train Steps/Sec: 0.07,
840
- [2026-01-07 22:08:32] (step=0000829) Train Loss mse: 0.0341, Train Loss ce: 0.0630, Train Steps/Sec: 0.06,
841
- [2026-01-07 22:08:44] (step=0000830) Train Loss mse: 0.0481, Train Loss ce: 0.0613, Train Steps/Sec: 0.08,
842
- [2026-01-07 22:08:56] (step=0000831) Train Loss mse: 0.0491, Train Loss ce: 0.0609, Train Steps/Sec: 0.08,
843
- [2026-01-07 22:09:08] (step=0000832) Train Loss mse: 0.0519, Train Loss ce: 0.0643, Train Steps/Sec: 0.08,
844
- [2026-01-07 22:09:24] (step=0000833) Train Loss mse: 0.0422, Train Loss ce: 0.0661, Train Steps/Sec: 0.06,
845
- [2026-01-07 22:09:36] (step=0000834) Train Loss mse: 0.0502, Train Loss ce: 0.0766, Train Steps/Sec: 0.09,
846
- [2026-01-07 22:09:51] (step=0000835) Train Loss mse: 0.0386, Train Loss ce: 0.0603, Train Steps/Sec: 0.06,
847
- [2026-01-07 22:10:05] (step=0000836) Train Loss mse: 0.0366, Train Loss ce: 0.0586, Train Steps/Sec: 0.07,
848
- [2026-01-07 22:10:17] (step=0000837) Train Loss mse: 0.0561, Train Loss ce: 0.0694, Train Steps/Sec: 0.08,
849
- [2026-01-07 22:10:33] (step=0000838) Train Loss mse: 0.0368, Train Loss ce: 0.0644, Train Steps/Sec: 0.06,
850
- [2026-01-07 22:10:46] (step=0000839) Train Loss mse: 0.0375, Train Loss ce: 0.0635, Train Steps/Sec: 0.07,
851
- [2026-01-07 22:11:02] (step=0000840) Train Loss mse: 0.0462, Train Loss ce: 0.0678, Train Steps/Sec: 0.06,
852
- [2026-01-07 22:11:15] (step=0000841) Train Loss mse: 0.0450, Train Loss ce: 0.0676, Train Steps/Sec: 0.08,
853
- [2026-01-07 22:11:28] (step=0000842) Train Loss mse: 0.0397, Train Loss ce: 0.0628, Train Steps/Sec: 0.08,
854
- [2026-01-07 22:11:44] (step=0000843) Train Loss mse: 0.0385, Train Loss ce: 0.0599, Train Steps/Sec: 0.06,
855
- [2026-01-07 22:11:57] (step=0000844) Train Loss mse: 0.0495, Train Loss ce: 0.0653, Train Steps/Sec: 0.07,
856
- [2026-01-07 22:12:11] (step=0000845) Train Loss mse: 0.0314, Train Loss ce: 0.0680, Train Steps/Sec: 0.07,
857
- [2026-01-07 22:12:24] (step=0000846) Train Loss mse: 0.0336, Train Loss ce: 0.0606, Train Steps/Sec: 0.08,
858
- [2026-01-07 22:12:40] (step=0000847) Train Loss mse: 0.0251, Train Loss ce: 0.0687, Train Steps/Sec: 0.06,
859
- [2026-01-07 22:12:53] (step=0000848) Train Loss mse: 0.0325, Train Loss ce: 0.0635, Train Steps/Sec: 0.08,
860
- [2026-01-07 22:13:09] (step=0000849) Train Loss mse: 0.0502, Train Loss ce: 0.0647, Train Steps/Sec: 0.06,
861
- [2026-01-07 22:13:21] (step=0000850) Train Loss mse: 0.0455, Train Loss ce: 0.0658, Train Steps/Sec: 0.08,
862
- [2026-01-07 22:13:34] (step=0000851) Train Loss mse: 0.0463, Train Loss ce: 0.0628, Train Steps/Sec: 0.07,
863
- [2026-01-07 22:13:47] (step=0000852) Train Loss mse: 0.0526, Train Loss ce: 0.0703, Train Steps/Sec: 0.08,
864
- [2026-01-07 22:13:59] (step=0000853) Train Loss mse: 0.0485, Train Loss ce: 0.0628, Train Steps/Sec: 0.08,
865
- [2026-01-07 22:14:15] (step=0000854) Train Loss mse: 0.0407, Train Loss ce: 0.0654, Train Steps/Sec: 0.06,
866
- [2026-01-07 22:14:28] (step=0000855) Train Loss mse: 0.0481, Train Loss ce: 0.0616, Train Steps/Sec: 0.08,
867
- [2026-01-07 22:14:42] (step=0000856) Train Loss mse: 0.0525, Train Loss ce: 0.0702, Train Steps/Sec: 0.07,
868
- [2026-01-07 22:14:55] (step=0000857) Train Loss mse: 0.0311, Train Loss ce: 0.0600, Train Steps/Sec: 0.07,
869
- [2026-01-07 22:15:07] (step=0000858) Train Loss mse: 0.0595, Train Loss ce: 0.0597, Train Steps/Sec: 0.08,
870
- [2026-01-07 22:15:21] (step=0000859) Train Loss mse: 0.0382, Train Loss ce: 0.0590, Train Steps/Sec: 0.07,
871
- [2026-01-07 22:15:34] (step=0000860) Train Loss mse: 0.0360, Train Loss ce: 0.0655, Train Steps/Sec: 0.08,
872
- [2026-01-07 22:15:50] (step=0000861) Train Loss mse: 0.0379, Train Loss ce: 0.0579, Train Steps/Sec: 0.06,
873
- [2026-01-07 22:16:04] (step=0000862) Train Loss mse: 0.0402, Train Loss ce: 0.0612, Train Steps/Sec: 0.07,
874
- [2026-01-07 22:16:16] (step=0000863) Train Loss mse: 0.0490, Train Loss ce: 0.0752, Train Steps/Sec: 0.08,
875
- [2026-01-07 22:16:30] (step=0000864) Train Loss mse: 0.0547, Train Loss ce: 0.0663, Train Steps/Sec: 0.07,
876
- [2026-01-07 22:16:43] (step=0000865) Train Loss mse: 0.0452, Train Loss ce: 0.0632, Train Steps/Sec: 0.07,
877
  FullyShardedDataParallel(
878
  (_fsdp_wrapped_module): Bagel(
879
  (language_model): Qwen2ForCausalLM(
@@ -1074,13 +1017,63 @@ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
1074
  fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1075
  fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1076
  ce_avg: 0.1378343403339386, mse_avg: 0.041579172015190125
1077
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step2000
1078
- Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
1079
- [eval debug] first 3 batch fingerprints:
1080
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1081
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1082
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1083
- ce_avg: 0.14916321635246277, mse_avg: 0.03999507427215576
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1084
  [2026-01-07 22:16:57] (step=0000866) Train Loss mse: 0.0280, Train Loss ce: 0.0637, Train Steps/Sec: 0.08,
1085
  [2026-01-07 22:17:13] (step=0000867) Train Loss mse: 0.0495, Train Loss ce: 0.0617, Train Steps/Sec: 0.06,
1086
  [2026-01-07 22:17:27] (step=0000868) Train Loss mse: 0.0502, Train Loss ce: 0.0616, Train Steps/Sec: 0.07,
@@ -2221,6 +2214,20 @@ ce_avg: 0.14916321635246277, mse_avg: 0.03999507427215576
2221
  [2026-01-08 02:40:09] (step=0002003) Train Loss mse: 0.0433, Train Loss ce: 0.0551, Train Steps/Sec: 0.08,
2222
  [2026-01-08 02:40:20] (step=0002004) Train Loss mse: 0.0496, Train Loss ce: 0.0568, Train Steps/Sec: 0.09,
2223
  [2026-01-08 02:40:34] (step=0002005) Train Loss mse: 0.0419, Train Loss ce: 0.0568, Train Steps/Sec: 0.07,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2224
  [2026-01-08 02:40:46] (step=0002006) Train Loss mse: 0.0387, Train Loss ce: 0.0622, Train Steps/Sec: 0.08,
2225
  [2026-01-08 02:40:57] (step=0002007) Train Loss mse: 0.0473, Train Loss ce: 0.0671, Train Steps/Sec: 0.09,
2226
  [2026-01-08 02:41:11] (step=0002008) Train Loss mse: 0.0478, Train Loss ce: 0.0585, Train Steps/Sec: 0.07,
@@ -2257,20 +2264,6 @@ ce_avg: 0.14916321635246277, mse_avg: 0.03999507427215576
2257
  [2026-01-08 02:48:21] (step=0002039) Train Loss mse: 0.0500, Train Loss ce: 0.0585, Train Steps/Sec: 0.08,
2258
  [2026-01-08 02:48:34] (step=0002040) Train Loss mse: 0.0508, Train Loss ce: 0.0584, Train Steps/Sec: 0.07,
2259
  [2026-01-08 02:48:47] (step=0002041) Train Loss mse: 0.0540, Train Loss ce: 0.0626, Train Steps/Sec: 0.08,
2260
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step2500
2261
- Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
2262
- [eval debug] first 3 batch fingerprints:
2263
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2264
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2265
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2266
- ce_avg: 0.15878577530384064, mse_avg: 0.04078574851155281
2267
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step3000
2268
- Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
2269
- [eval debug] first 3 batch fingerprints:
2270
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2271
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2272
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2273
- ce_avg: 0.05950051546096802, mse_avg: 0.04290447011590004
2274
  [2026-01-08 02:49:01] (step=0002042) Train Loss mse: 0.0372, Train Loss ce: 0.0575, Train Steps/Sec: 0.08,
2275
  [2026-01-08 02:49:13] (step=0002043) Train Loss mse: 0.0501, Train Loss ce: 0.0586, Train Steps/Sec: 0.08,
2276
  [2026-01-08 02:49:24] (step=0002044) Train Loss mse: 0.0542, Train Loss ce: 0.0584, Train Steps/Sec: 0.09,
@@ -3129,6 +3122,20 @@ ce_avg: 0.05950051546096802, mse_avg: 0.04290447011590004
3129
  [2026-01-08 06:07:35] (step=0002894) Train Loss mse: 0.0419, Train Loss ce: 0.0592, Train Steps/Sec: 0.08,
3130
  [2026-01-08 06:07:49] (step=0002895) Train Loss mse: 0.0422, Train Loss ce: 0.0577, Train Steps/Sec: 0.07,
3131
  [2026-01-08 06:08:02] (step=0002896) Train Loss mse: 0.0327, Train Loss ce: 0.0590, Train Steps/Sec: 0.08,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3132
  [2026-01-08 06:08:15] (step=0002897) Train Loss mse: 0.0434, Train Loss ce: 0.0559, Train Steps/Sec: 0.07,
3133
  [2026-01-08 06:08:29] (step=0002898) Train Loss mse: 0.0478, Train Loss ce: 0.0610, Train Steps/Sec: 0.07,
3134
  [2026-01-08 06:08:45] (step=0002899) Train Loss mse: 0.0369, Train Loss ce: 0.0575, Train Steps/Sec: 0.06,
@@ -3219,20 +3226,6 @@ ce_avg: 0.05950051546096802, mse_avg: 0.04290447011590004
3219
  [2026-01-08 06:28:08] (step=0002984) Train Loss mse: 0.0387, Train Loss ce: 0.0602, Train Steps/Sec: 0.06,
3220
  [2026-01-08 06:28:21] (step=0002985) Train Loss mse: 0.0372, Train Loss ce: 0.0530, Train Steps/Sec: 0.08,
3221
  [2026-01-08 06:28:33] (step=0002986) Train Loss mse: 0.0326, Train Loss ce: 0.0561, Train Steps/Sec: 0.08,
3222
- [2026-01-08 06:28:49
3223
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step3500
3224
- Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
3225
- [eval debug] first 3 batch fingerprints:
3226
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
3227
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
3228
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
3229
- ce_avg: 0.05882769450545311, mse_avg: 0.039856743067502975
3230
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step4000
3231
- Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
3232
- [eval debug] first 3 batch fingerprints:
3233
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
3234
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
3235
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
3236
  [2026-01-08 06:28:49] (step=0002987) Train Loss mse: 0.0406, Train Loss ce: 0.0596, Train Steps/Sec: 0.06,
3237
  [2026-01-08 06:29:01] (step=0002988) Train Loss mse: 0.0462, Train Loss ce: 0.0583, Train Steps/Sec: 0.09,
3238
  [2026-01-08 06:29:14] (step=0002989) Train Loss mse: 0.0422, Train Loss ce: 0.0606, Train Steps/Sec: 0.07,
@@ -4378,6 +4371,20 @@ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
4378
  [2026-01-08 10:53:01] (step=0004129) Train Loss mse: 0.0533, Train Loss ce: 0.0561, Train Steps/Sec: 0.09,
4379
  [2026-01-08 10:53:13] (step=0004130) Train Loss mse: 0.0327, Train Loss ce: 0.0578, Train Steps/Sec: 0.09,
4380
  [2026-01-08 10:53:29] (step=0004131) Train Loss mse: 0.0476, Train Loss ce: 0.0549, Train Steps/Sec: 0.06,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4381
  [2026-01-08 10:53:41] (step=0004132) Train Loss mse: 0.0427, Train Loss ce: 0.0588, Train Steps/Sec: 0.08,
4382
  [2026-01-08 10:53:57] (step=0004133) Train Loss mse: 0.0461, Train Loss ce: 0.0568, Train Steps/Sec: 0.06,
4383
  [2026-01-08 10:54:10] (step=0004134) Train Loss mse: 0.0357, Train Loss ce: 0.0562, Train Steps/Sec: 0.08,
@@ -4530,20 +4537,6 @@ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
4530
  [2026-01-08 11:27:44] (step=0004281) Train Loss mse: 0.0305, Train Loss ce: 0.0599, Train Steps/Sec: 0.08,
4531
  [2026-01-08 11:27:57] (step=0004282) Train Loss mse: 0.0300, Train Loss ce: 0.0572, Train Steps/Sec: 0.07,
4532
  [2026-01-08 11:28:12] (step=0004283) Train Loss mse: 0.0388, Train Loss ce: 0.0600, Train Steps/Sec: 0.07,
4533
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step4500
4534
- Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
4535
- [eval debug] first 3 batch fingerprints:
4536
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
4537
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
4538
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
4539
- ce_avg: 0.05841600522398949, mse_avg: 0.040690697729587555
4540
- base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step5000
4541
- Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
4542
- [eval debug] first 3 batch fingerprints:
4543
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
4544
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
4545
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
4546
- ce_avg: 0.058565959334373474, mse_avg: 0.04069836065173149
4547
  [2026-01-08 11:28:26] (step=0004284) Train Loss mse: 0.0509, Train Loss ce: 0.0572, Train Steps/Sec: 0.07,
4548
  [2026-01-08 11:28:39] (step=0004285) Train Loss mse: 0.0355, Train Loss ce: 0.0584, Train Steps/Sec: 0.07,
4549
  [2026-01-08 11:28:52] (step=0004286) Train Loss mse: 0.0272, Train Loss ce: 0.0561, Train Steps/Sec: 0.08,
@@ -5262,4 +5255,11 @@ ce_avg: 0.058565959334373474, mse_avg: 0.04069836065173149
5262
  [2026-01-08 14:13:25] (step=0004999) Train Loss mse: 0.0486, Train Loss ce: 0.0565, Train Steps/Sec: 0.08,
5263
  [2026-01-08 14:14:43] (step=0005000) Train Loss mse: 0.0408, Train Loss ce: 0.0569, Train Steps/Sec: 0.01,
5264
  [2026-01-08 14:14:43] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/0005000.
5265
- [2026-01-08 14:17:28] Done!
 
 
 
 
 
 
 
 
817
  [2026-01-07 22:03:25] (step=0000806) Train Loss mse: 0.0474, Train Loss ce: 0.0616, Train Steps/Sec: 0.08,
818
  [2026-01-07 22:03:37] (step=0000807) Train Loss mse: 0.0525, Train Loss ce: 0.0598, Train Steps/Sec: 0.08,
819
  [2026-01-07 22:03:49] (step=0000808) Train Loss mse: 0.0454, Train Loss ce: 0.0603, Train Steps/Sec: 0.09,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
820
  FullyShardedDataParallel(
821
  (_fsdp_wrapped_module): Bagel(
822
  (language_model): Qwen2ForCausalLM(
 
1017
  fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1018
  fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
1019
  ce_avg: 0.1378343403339386, mse_avg: 0.041579172015190125
1020
+ [2026-01-07 22:04:05] (step=0000809) Train Loss mse: 0.0326, Train Loss ce: 0.0647, Train Steps/Sec: 0.06,
1021
+ [2026-01-07 22:04:17] (step=0000810) Train Loss mse: 0.0437, Train Loss ce: 0.0584, Train Steps/Sec: 0.08,
1022
+ [2026-01-07 22:04:29] (step=0000811) Train Loss mse: 0.0597, Train Loss ce: 0.0633, Train Steps/Sec: 0.08,
1023
+ [2026-01-07 22:04:41] (step=0000812) Train Loss mse: 0.0387, Train Loss ce: 0.0599, Train Steps/Sec: 0.09,
1024
+ [2026-01-07 22:04:55] (step=0000813) Train Loss mse: 0.0491, Train Loss ce: 0.0613, Train Steps/Sec: 0.07,
1025
+ [2026-01-07 22:05:08] (step=0000814) Train Loss mse: 0.0546, Train Loss ce: 0.0671, Train Steps/Sec: 0.08,
1026
+ [2026-01-07 22:05:24] (step=0000815) Train Loss mse: 0.0472, Train Loss ce: 0.0726, Train Steps/Sec: 0.06,
1027
+ [2026-01-07 22:05:41] (step=0000816) Train Loss mse: 0.0341, Train Loss ce: 0.0607, Train Steps/Sec: 0.06,
1028
+ [2026-01-07 22:05:56] (step=0000817) Train Loss mse: 0.0452, Train Loss ce: 0.0611, Train Steps/Sec: 0.06,
1029
+ [2026-01-07 22:06:10] (step=0000818) Train Loss mse: 0.0349, Train Loss ce: 0.0640, Train Steps/Sec: 0.08,
1030
+ [2026-01-07 22:06:21] (step=0000819) Train Loss mse: 0.0482, Train Loss ce: 0.0636, Train Steps/Sec: 0.09,
1031
+ [2026-01-07 22:06:32] (step=0000820) Train Loss mse: 0.0424, Train Loss ce: 0.0601, Train Steps/Sec: 0.09,
1032
+ [2026-01-07 22:06:45] (step=0000821) Train Loss mse: 0.0306, Train Loss ce: 0.0608, Train Steps/Sec: 0.08,
1033
+ [2026-01-07 22:06:56] (step=0000822) Train Loss mse: 0.0402, Train Loss ce: 0.0640, Train Steps/Sec: 0.09,
1034
+ [2026-01-07 22:07:10] (step=0000823) Train Loss mse: 0.0510, Train Loss ce: 0.0636, Train Steps/Sec: 0.08,
1035
+ [2026-01-07 22:07:26] (step=0000824) Train Loss mse: 0.0551, Train Loss ce: 0.0690, Train Steps/Sec: 0.06,
1036
+ [2026-01-07 22:07:39] (step=0000825) Train Loss mse: 0.0469, Train Loss ce: 0.0644, Train Steps/Sec: 0.07,
1037
+ [2026-01-07 22:07:52] (step=0000826) Train Loss mse: 0.0598, Train Loss ce: 0.0676, Train Steps/Sec: 0.08,
1038
+ [2026-01-07 22:08:02] (step=0000827) Train Loss mse: 0.0474, Train Loss ce: 0.0626, Train Steps/Sec: 0.09,
1039
+ [2026-01-07 22:08:16] (step=0000828) Train Loss mse: 0.0396, Train Loss ce: 0.0633, Train Steps/Sec: 0.07,
1040
+ [2026-01-07 22:08:32] (step=0000829) Train Loss mse: 0.0341, Train Loss ce: 0.0630, Train Steps/Sec: 0.06,
1041
+ [2026-01-07 22:08:44] (step=0000830) Train Loss mse: 0.0481, Train Loss ce: 0.0613, Train Steps/Sec: 0.08,
1042
+ [2026-01-07 22:08:56] (step=0000831) Train Loss mse: 0.0491, Train Loss ce: 0.0609, Train Steps/Sec: 0.08,
1043
+ [2026-01-07 22:09:08] (step=0000832) Train Loss mse: 0.0519, Train Loss ce: 0.0643, Train Steps/Sec: 0.08,
1044
+ [2026-01-07 22:09:24] (step=0000833) Train Loss mse: 0.0422, Train Loss ce: 0.0661, Train Steps/Sec: 0.06,
1045
+ [2026-01-07 22:09:36] (step=0000834) Train Loss mse: 0.0502, Train Loss ce: 0.0766, Train Steps/Sec: 0.09,
1046
+ [2026-01-07 22:09:51] (step=0000835) Train Loss mse: 0.0386, Train Loss ce: 0.0603, Train Steps/Sec: 0.06,
1047
+ [2026-01-07 22:10:05] (step=0000836) Train Loss mse: 0.0366, Train Loss ce: 0.0586, Train Steps/Sec: 0.07,
1048
+ [2026-01-07 22:10:17] (step=0000837) Train Loss mse: 0.0561, Train Loss ce: 0.0694, Train Steps/Sec: 0.08,
1049
+ [2026-01-07 22:10:33] (step=0000838) Train Loss mse: 0.0368, Train Loss ce: 0.0644, Train Steps/Sec: 0.06,
1050
+ [2026-01-07 22:10:46] (step=0000839) Train Loss mse: 0.0375, Train Loss ce: 0.0635, Train Steps/Sec: 0.07,
1051
+ [2026-01-07 22:11:02] (step=0000840) Train Loss mse: 0.0462, Train Loss ce: 0.0678, Train Steps/Sec: 0.06,
1052
+ [2026-01-07 22:11:15] (step=0000841) Train Loss mse: 0.0450, Train Loss ce: 0.0676, Train Steps/Sec: 0.08,
1053
+ [2026-01-07 22:11:28] (step=0000842) Train Loss mse: 0.0397, Train Loss ce: 0.0628, Train Steps/Sec: 0.08,
1054
+ [2026-01-07 22:11:44] (step=0000843) Train Loss mse: 0.0385, Train Loss ce: 0.0599, Train Steps/Sec: 0.06,
1055
+ [2026-01-07 22:11:57] (step=0000844) Train Loss mse: 0.0495, Train Loss ce: 0.0653, Train Steps/Sec: 0.07,
1056
+ [2026-01-07 22:12:11] (step=0000845) Train Loss mse: 0.0314, Train Loss ce: 0.0680, Train Steps/Sec: 0.07,
1057
+ [2026-01-07 22:12:24] (step=0000846) Train Loss mse: 0.0336, Train Loss ce: 0.0606, Train Steps/Sec: 0.08,
1058
+ [2026-01-07 22:12:40] (step=0000847) Train Loss mse: 0.0251, Train Loss ce: 0.0687, Train Steps/Sec: 0.06,
1059
+ [2026-01-07 22:12:53] (step=0000848) Train Loss mse: 0.0325, Train Loss ce: 0.0635, Train Steps/Sec: 0.08,
1060
+ [2026-01-07 22:13:09] (step=0000849) Train Loss mse: 0.0502, Train Loss ce: 0.0647, Train Steps/Sec: 0.06,
1061
+ [2026-01-07 22:13:21] (step=0000850) Train Loss mse: 0.0455, Train Loss ce: 0.0658, Train Steps/Sec: 0.08,
1062
+ [2026-01-07 22:13:34] (step=0000851) Train Loss mse: 0.0463, Train Loss ce: 0.0628, Train Steps/Sec: 0.07,
1063
+ [2026-01-07 22:13:47] (step=0000852) Train Loss mse: 0.0526, Train Loss ce: 0.0703, Train Steps/Sec: 0.08,
1064
+ [2026-01-07 22:13:59] (step=0000853) Train Loss mse: 0.0485, Train Loss ce: 0.0628, Train Steps/Sec: 0.08,
1065
+ [2026-01-07 22:14:15] (step=0000854) Train Loss mse: 0.0407, Train Loss ce: 0.0654, Train Steps/Sec: 0.06,
1066
+ [2026-01-07 22:14:28] (step=0000855) Train Loss mse: 0.0481, Train Loss ce: 0.0616, Train Steps/Sec: 0.08,
1067
+ [2026-01-07 22:14:42] (step=0000856) Train Loss mse: 0.0525, Train Loss ce: 0.0702, Train Steps/Sec: 0.07,
1068
+ [2026-01-07 22:14:55] (step=0000857) Train Loss mse: 0.0311, Train Loss ce: 0.0600, Train Steps/Sec: 0.07,
1069
+ [2026-01-07 22:15:07] (step=0000858) Train Loss mse: 0.0595, Train Loss ce: 0.0597, Train Steps/Sec: 0.08,
1070
+ [2026-01-07 22:15:21] (step=0000859) Train Loss mse: 0.0382, Train Loss ce: 0.0590, Train Steps/Sec: 0.07,
1071
+ [2026-01-07 22:15:34] (step=0000860) Train Loss mse: 0.0360, Train Loss ce: 0.0655, Train Steps/Sec: 0.08,
1072
+ [2026-01-07 22:15:50] (step=0000861) Train Loss mse: 0.0379, Train Loss ce: 0.0579, Train Steps/Sec: 0.06,
1073
+ [2026-01-07 22:16:04] (step=0000862) Train Loss mse: 0.0402, Train Loss ce: 0.0612, Train Steps/Sec: 0.07,
1074
+ [2026-01-07 22:16:16] (step=0000863) Train Loss mse: 0.0490, Train Loss ce: 0.0752, Train Steps/Sec: 0.08,
1075
+ [2026-01-07 22:16:30] (step=0000864) Train Loss mse: 0.0547, Train Loss ce: 0.0663, Train Steps/Sec: 0.07,
1076
+ [2026-01-07 22:16:43] (step=0000865) Train Loss mse: 0.0452, Train Loss ce: 0.0632, Train Steps/Sec: 0.07,
1077
  [2026-01-07 22:16:57] (step=0000866) Train Loss mse: 0.0280, Train Loss ce: 0.0637, Train Steps/Sec: 0.08,
1078
  [2026-01-07 22:17:13] (step=0000867) Train Loss mse: 0.0495, Train Loss ce: 0.0617, Train Steps/Sec: 0.06,
1079
  [2026-01-07 22:17:27] (step=0000868) Train Loss mse: 0.0502, Train Loss ce: 0.0616, Train Steps/Sec: 0.07,
 
2214
  [2026-01-08 02:40:09] (step=0002003) Train Loss mse: 0.0433, Train Loss ce: 0.0551, Train Steps/Sec: 0.08,
2215
  [2026-01-08 02:40:20] (step=0002004) Train Loss mse: 0.0496, Train Loss ce: 0.0568, Train Steps/Sec: 0.09,
2216
  [2026-01-08 02:40:34] (step=0002005) Train Loss mse: 0.0419, Train Loss ce: 0.0568, Train Steps/Sec: 0.07,
2217
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step2000
2218
+ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
2219
+ [eval debug] first 3 batch fingerprints:
2220
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2221
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2222
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2223
+ ce_avg: 0.14916321635246277, mse_avg: 0.03999507427215576
2224
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step2500
2225
+ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
2226
+ [eval debug] first 3 batch fingerprints:
2227
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2228
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2229
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
2230
+ ce_avg: 0.15878577530384064, mse_avg: 0.04078574851155281
2231
  [2026-01-08 02:40:46] (step=0002006) Train Loss mse: 0.0387, Train Loss ce: 0.0622, Train Steps/Sec: 0.08,
2232
  [2026-01-08 02:40:57] (step=0002007) Train Loss mse: 0.0473, Train Loss ce: 0.0671, Train Steps/Sec: 0.09,
2233
  [2026-01-08 02:41:11] (step=0002008) Train Loss mse: 0.0478, Train Loss ce: 0.0585, Train Steps/Sec: 0.07,
 
2264
  [2026-01-08 02:48:21] (step=0002039) Train Loss mse: 0.0500, Train Loss ce: 0.0585, Train Steps/Sec: 0.08,
2265
  [2026-01-08 02:48:34] (step=0002040) Train Loss mse: 0.0508, Train Loss ce: 0.0584, Train Steps/Sec: 0.07,
2266
  [2026-01-08 02:48:47] (step=0002041) Train Loss mse: 0.0540, Train Loss ce: 0.0626, Train Steps/Sec: 0.08,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2267
  [2026-01-08 02:49:01] (step=0002042) Train Loss mse: 0.0372, Train Loss ce: 0.0575, Train Steps/Sec: 0.08,
2268
  [2026-01-08 02:49:13] (step=0002043) Train Loss mse: 0.0501, Train Loss ce: 0.0586, Train Steps/Sec: 0.08,
2269
  [2026-01-08 02:49:24] (step=0002044) Train Loss mse: 0.0542, Train Loss ce: 0.0584, Train Steps/Sec: 0.09,
 
3122
  [2026-01-08 06:07:35] (step=0002894) Train Loss mse: 0.0419, Train Loss ce: 0.0592, Train Steps/Sec: 0.08,
3123
  [2026-01-08 06:07:49] (step=0002895) Train Loss mse: 0.0422, Train Loss ce: 0.0577, Train Steps/Sec: 0.07,
3124
  [2026-01-08 06:08:02] (step=0002896) Train Loss mse: 0.0327, Train Loss ce: 0.0590, Train Steps/Sec: 0.08,
3125
+ [2026-01-08 06:08:15
3126
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step3000
3127
+ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
3128
+ [eval debug] first 3 batch fingerprints:
3129
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
3130
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
3131
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
3132
+ ce_avg: 0.05950051546096802, mse_avg: 0.04290447011590004
3133
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step3500
3134
+ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
3135
+ [eval debug] first 3 batch fingerprints:
3136
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
3137
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
3138
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
3139
  [2026-01-08 06:08:15] (step=0002897) Train Loss mse: 0.0434, Train Loss ce: 0.0559, Train Steps/Sec: 0.07,
3140
  [2026-01-08 06:08:29] (step=0002898) Train Loss mse: 0.0478, Train Loss ce: 0.0610, Train Steps/Sec: 0.07,
3141
  [2026-01-08 06:08:45] (step=0002899) Train Loss mse: 0.0369, Train Loss ce: 0.0575, Train Steps/Sec: 0.06,
 
3226
  [2026-01-08 06:28:08] (step=0002984) Train Loss mse: 0.0387, Train Loss ce: 0.0602, Train Steps/Sec: 0.06,
3227
  [2026-01-08 06:28:21] (step=0002985) Train Loss mse: 0.0372, Train Loss ce: 0.0530, Train Steps/Sec: 0.08,
3228
  [2026-01-08 06:28:33] (step=0002986) Train Loss mse: 0.0326, Train Loss ce: 0.0561, Train Steps/Sec: 0.08,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3229
  [2026-01-08 06:28:49] (step=0002987) Train Loss mse: 0.0406, Train Loss ce: 0.0596, Train Steps/Sec: 0.06,
3230
  [2026-01-08 06:29:01] (step=0002988) Train Loss mse: 0.0462, Train Loss ce: 0.0583, Train Steps/Sec: 0.09,
3231
  [2026-01-08 06:29:14] (step=0002989) Train Loss mse: 0.0422, Train Loss ce: 0.0606, Train Steps/Sec: 0.07,
 
4371
  [2026-01-08 10:53:01] (step=0004129) Train Loss mse: 0.0533, Train Loss ce: 0.0561, Train Steps/Sec: 0.09,
4372
  [2026-01-08 10:53:13] (step=0004130) Train Loss mse: 0.0327, Train Loss ce: 0.0578, Train Steps/Sec: 0.09,
4373
  [2026-01-08 10:53:29] (step=0004131) Train Loss mse: 0.0476, Train Loss ce: 0.0549, Train Steps/Sec: 0.06,
4374
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step4000
4375
+ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
4376
+ [eval debug] first 3 batch fingerprints:
4377
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
4378
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
4379
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
4380
+ ce_avg: 0.05835578218102455, mse_avg: 0.04091694951057434
4381
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step4500
4382
+ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
4383
+ [eval debug] first 3 batch fingerprints:
4384
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
4385
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
4386
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
4387
+ ce_avg: 0.05841600522398949, mse_avg: 0.040690697729587555
4388
  [2026-01-08 10:53:41] (step=0004132) Train Loss mse: 0.0427, Train Loss ce: 0.0588, Train Steps/Sec: 0.08,
4389
  [2026-01-08 10:53:57] (step=0004133) Train Loss mse: 0.0461, Train Loss ce: 0.0568, Train Steps/Sec: 0.06,
4390
  [2026-01-08 10:54:10] (step=0004134) Train Loss mse: 0.0357, Train Loss ce: 0.0562, Train Steps/Sec: 0.08,
 
4537
  [2026-01-08 11:27:44] (step=0004281) Train Loss mse: 0.0305, Train Loss ce: 0.0599, Train Steps/Sec: 0.08,
4538
  [2026-01-08 11:27:57] (step=0004282) Train Loss mse: 0.0300, Train Loss ce: 0.0572, Train Steps/Sec: 0.07,
4539
  [2026-01-08 11:28:12] (step=0004283) Train Loss mse: 0.0388, Train Loss ce: 0.0600, Train Steps/Sec: 0.07,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4540
  [2026-01-08 11:28:26] (step=0004284) Train Loss mse: 0.0509, Train Loss ce: 0.0572, Train Steps/Sec: 0.07,
4541
  [2026-01-08 11:28:39] (step=0004285) Train Loss mse: 0.0355, Train Loss ce: 0.0584, Train Steps/Sec: 0.07,
4542
  [2026-01-08 11:28:52] (step=0004286) Train Loss mse: 0.0272, Train Loss ce: 0.0561, Train Steps/Sec: 0.08,
 
5255
  [2026-01-08 14:13:25] (step=0004999) Train Loss mse: 0.0486, Train Loss ce: 0.0565, Train Steps/Sec: 0.08,
5256
  [2026-01-08 14:14:43] (step=0005000) Train Loss mse: 0.0408, Train Loss ce: 0.0569, Train Steps/Sec: 0.01,
5257
  [2026-01-08 14:14:43] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/0005000.
5258
+ [2026-01-08 14:17:28] Done!
5259
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce/eval_used_rows, step_tag is checkpoints_vlm_gym_jigsaw_one_image_lr2e_5_ce_ema993_hashed_step5000
5260
+ Preparing Dataset vlm_gym_jigsaw_celoss_evalonce/vlm_gym_jigsaw_val
5261
+ [eval debug] first 3 batch fingerprints:
5262
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
5263
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
5264
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_jigsaw_celoss_evalonce'}]
5265
+ ce_avg: 0.058565959334373474, mse_avg: 0.04069836065173149