Junyi42 commited on
Commit
32a25e1
·
verified ·
1 Parent(s): b844ee1

Upload checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins

Browse files
checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/wandb/offline-run-20260125_170425-checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins-run0/files/output.log CHANGED
@@ -1,3 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  wandb: Detected [huggingface_hub.inference] in use.
2
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
3
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
@@ -729,192 +915,6 @@ wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
729
  [2026-01-25 21:03:00] (step=0000718) Train Loss mse: 0.0111, Train Loss ce: 0.2815, Train Steps/Sec: 0.05,
730
  [2026-01-25 21:03:14] (step=0000719) Train Loss mse: 0.0212, Train Loss ce: 0.3005, Train Steps/Sec: 0.07,
731
  [2026-01-25 21:03:34] (step=0000720) Train Loss mse: 0.0175, Train Loss ce: 0.2827, Train Steps/Sec: 0.05,
732
- FullyShardedDataParallel(
733
- (_fsdp_wrapped_module): Bagel(
734
- (language_model): Qwen2ForCausalLM(
735
- (model): Qwen2Model(
736
- (embed_tokens): Embedding(152064, 3584)
737
- (layers): ModuleList(
738
- (0-27): 28 x FullyShardedDataParallel(
739
- (_fsdp_wrapped_module): CheckpointWrapper(
740
- (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
741
- (self_attn): PackedAttentionMoT(
742
- (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
743
- (k_proj): Linear(in_features=3584, out_features=512, bias=True)
744
- (v_proj): Linear(in_features=3584, out_features=512, bias=True)
745
- (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
746
- (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
747
- (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
748
- (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
749
- (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
750
- (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
751
- (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
752
- (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
753
- (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
754
- )
755
- (mlp): Qwen2MLP(
756
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
757
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
758
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
759
- (act_fn): SiLU()
760
- )
761
- (mlp_moe_gen): Qwen2MLP(
762
- (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
763
- (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
764
- (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
765
- (act_fn): SiLU()
766
- )
767
- (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
768
- (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
769
- (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
770
- (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
771
- )
772
- )
773
- )
774
- )
775
- (norm): Qwen2RMSNorm((3584,), eps=1e-06)
776
- (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
777
- (rotary_emb): Qwen2RotaryEmbedding()
778
- )
779
- (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
780
- )
781
- (time_embedder): FullyShardedDataParallel(
782
- (_fsdp_wrapped_module): TimestepEmbedder(
783
- (mlp): Sequential(
784
- (0): Linear(in_features=256, out_features=3584, bias=True)
785
- (1): SiLU()
786
- (2): Linear(in_features=3584, out_features=3584, bias=True)
787
- )
788
- )
789
- )
790
- (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
791
- (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
792
- (latent_pos_embed): FullyShardedDataParallel(
793
- (_fsdp_wrapped_module): PositionEmbedding()
794
- )
795
- (vit_model): SiglipVisionModel(
796
- (vision_model): FullyShardedDataParallel(
797
- (_fsdp_wrapped_module): SiglipVisionTransformer(
798
- (embeddings): SiglipVisionEmbeddings(
799
- (position_embedding): Embedding(4900, 1152)
800
- (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
801
- )
802
- (encoder): SiglipEncoder(
803
- (layers): ModuleList(
804
- (0-25): 26 x FullyShardedDataParallel(
805
- (_fsdp_wrapped_module): CheckpointWrapper(
806
- (_checkpoint_wrapped_module): SiglipEncoderLayer(
807
- (self_attn): SiglipFlashAttention2(
808
- (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
809
- (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
810
- (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
811
- (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
812
- )
813
- (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
814
- (mlp): SiglipMLP(
815
- (activation_fn): PytorchGELUTanh()
816
- (fc1): Linear(in_features=1152, out_features=4304, bias=True)
817
- (fc2): Linear(in_features=4304, out_features=1152, bias=True)
818
- )
819
- (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
820
- )
821
- )
822
- )
823
- )
824
- )
825
- (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
826
- )
827
- )
828
- )
829
- (connector): FullyShardedDataParallel(
830
- (_fsdp_wrapped_module): CheckpointWrapper(
831
- (_checkpoint_wrapped_module): MLPconnector(
832
- (activation_fn): PytorchGELUTanh()
833
- (fc1): Linear(in_features=1152, out_features=3584, bias=True)
834
- (fc2): Linear(in_features=3584, out_features=3584, bias=True)
835
- )
836
- )
837
- )
838
- (vit_pos_embed): FullyShardedDataParallel(
839
- (_fsdp_wrapped_module): PositionEmbedding()
840
- )
841
- )
842
- )
843
- _flat_param True
844
- language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
845
- language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
846
- language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
847
- language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
848
- language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
849
- language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
850
- language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
851
- language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
852
- language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
853
- language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
854
- language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
855
- language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
856
- language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
857
- language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
858
- language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
859
- language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
860
- language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
861
- language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
862
- language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
863
- language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
864
- language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
865
- language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
866
- language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
867
- language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
868
- language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
869
- language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
870
- language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
871
- language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
872
- time_embedder._fsdp_wrapped_module._flat_param True
873
- latent_pos_embed._fsdp_wrapped_module._flat_param False
874
- vit_model.vision_model._fsdp_wrapped_module._flat_param True
875
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
876
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
877
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
878
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
879
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
880
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
881
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
882
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
883
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
884
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
885
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
886
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
887
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
888
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
889
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
890
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
891
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
892
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
893
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
894
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
895
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
896
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
897
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
898
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
899
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
900
- vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
901
- connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
902
- vit_pos_embed._fsdp_wrapped_module._flat_param False
903
- Preparing Dataset vlm_gym_colorization_celoss/vlm_gym_colorization_train
904
- base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step0
905
- Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
906
- [eval debug] first 3 batch fingerprints:
907
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
908
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
909
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
910
- ce_avg: 1.3898617029190063, mse_avg: 0.05326032266020775
911
- base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step500
912
- Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
913
- [eval debug] first 3 batch fingerprints:
914
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
915
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
916
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
917
- ce_avg: 0.2838270962238312, mse_avg: 0.00850633904337883
918
  [2026-01-25 21:03:53] (step=0000721) Train Loss mse: 0.0203, Train Loss ce: 0.2965, Train Steps/Sec: 0.05,
919
  [2026-01-25 21:04:11] (step=0000722) Train Loss mse: 0.0150, Train Loss ce: 0.2638, Train Steps/Sec: 0.05,
920
  [2026-01-25 21:04:33] (step=0000723) Train Loss mse: 0.0155, Train Loss ce: 0.2844, Train Steps/Sec: 0.05,
@@ -1008,7 +1008,28 @@ ce_avg: 0.2838270962238312, mse_avg: 0.00850633904337883
1008
  [2026-01-25 21:32:44] (step=0000811) Train Loss mse: 0.0192, Train Loss ce: 0.2767, Train Steps/Sec: 0.06,
1009
  [2026-01-25 21:33:04] (step=0000812) Train Loss mse: 0.0235, Train Loss ce: 0.2813, Train Steps/Sec: 0.05,
1010
  [2026-01-25 21:33:19] (step=0000813) Train Loss mse: 0.0263, Train Loss ce: 0.2622, Train Steps/Sec: 0.06,
1011
- [2026-01-25 21:33:41] (step=0000814) Train Loss mse: 0.0219, Train Loss ce: 0.2812, Train Steps/Sec: 0.05,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1012
  [2026-01-25 21:33:57] (step=0000815) Train Loss mse: 0.0234, Train Loss ce: 0.2941, Train Steps/Sec: 0.07,
1013
  [2026-01-25 21:34:16] (step=0000816) Train Loss mse: 0.0220, Train Loss ce: 0.2732, Train Steps/Sec: 0.05,
1014
  [2026-01-25 21:34:34] (step=0000817) Train Loss mse: 0.0237, Train Loss ce: 0.2544, Train Steps/Sec: 0.05,
@@ -2082,41 +2103,6 @@ ce_avg: 0.2838270962238312, mse_avg: 0.00850633904337883
2082
  [2026-01-26 03:14:59] (step=0001885) Train Loss mse: 0.0153, Train Loss ce: 0.2766, Train Steps/Sec: 0.05,
2083
  [2026-01-26 03:15:19] (step=0001886) Train Loss mse: 0.0292, Train Loss ce: 0.2450, Train Steps/Sec: 0.05,
2084
  [2026-01-26 03:15:36] (step=0001887) Train Loss mse: 0.0145, Train Loss ce: 0.2653, Train Steps/Sec: 0.06,
2085
- base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step1000
2086
- Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
2087
- [eval debug] first 3 batch fingerprints:
2088
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2089
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2090
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2091
- ce_avg: 0.4483165144920349, mse_avg: 0.008000156842172146
2092
- base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step1500
2093
- Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
2094
- [eval debug] first 3 batch fingerprints:
2095
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2096
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2097
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2098
- ce_avg: 0.5879027843475342, mse_avg: 0.008525446057319641
2099
- base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step2000
2100
- Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
2101
- [eval debug] first 3 batch fingerprints:
2102
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2103
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2104
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2105
- ce_avg: 1.2331268787384033, mse_avg: 0.008829578757286072
2106
- base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step2500
2107
- Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
2108
- [eval debug] first 3 batch fingerprints:
2109
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2110
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2111
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2112
- ce_avg: 1.185328722000122, mse_avg: 0.008346919901669025
2113
- base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step3000
2114
- Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
2115
- [eval debug] first 3 batch fingerprints:
2116
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2117
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2118
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2119
- ce_avg: 0.24374419450759888, mse_avg: 0.007726446725428104
2120
  [2026-01-26 03:15:56] (step=0001888) Train Loss mse: 0.0274, Train Loss ce: 0.2608, Train Steps/Sec: 0.05,
2121
  [2026-01-26 03:16:09] (step=0001889) Train Loss mse: 0.0220, Train Loss ce: 0.2631, Train Steps/Sec: 0.08,
2122
  [2026-01-26 03:16:27] (step=0001890) Train Loss mse: 0.0193, Train Loss ce: 0.2902, Train Steps/Sec: 0.06,
@@ -2201,6 +2187,20 @@ ce_avg: 0.24374419450759888, mse_avg: 0.007726446725428104
2201
  [2026-01-26 03:41:34] (step=0001969) Train Loss mse: 0.0250, Train Loss ce: 0.2561, Train Steps/Sec: 0.06,
2202
  [2026-01-26 03:41:53] (step=0001970) Train Loss mse: 0.0162, Train Loss ce: 0.2314, Train Steps/Sec: 0.05,
2203
  [2026-01-26 03:42:15] (step=0001971) Train Loss mse: 0.0232, Train Loss ce: 0.2660, Train Steps/Sec: 0.04,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2204
  [2026-01-26 03:42:35] (step=0001972) Train Loss mse: 0.0179, Train Loss ce: 0.2603, Train Steps/Sec: 0.05,
2205
  [2026-01-26 03:42:54] (step=0001973) Train Loss mse: 0.0221, Train Loss ce: 0.2504, Train Steps/Sec: 0.05,
2206
  [2026-01-26 03:43:14] (step=0001974) Train Loss mse: 0.0300, Train Loss ce: 0.2581, Train Steps/Sec: 0.05,
@@ -3231,6 +3231,13 @@ ce_avg: 0.24374419450759888, mse_avg: 0.007726446725428104
3231
  [2026-01-26 09:12:03] (step=0002996) Train Loss mse: 0.0169, Train Loss ce: 0.2552, Train Steps/Sec: 0.06,
3232
  [2026-01-26 09:12:21] (step=0002997) Train Loss mse: 0.0187, Train Loss ce: 0.2464, Train Steps/Sec: 0.06,
3233
  [2026-01-26 09:12:42] (step=0002998) Train Loss mse: 0.0267, Train Loss ce: 0.2342, Train Steps/Sec: 0.05,
 
 
 
 
 
 
 
3234
  [2026-01-26 09:13:01] (step=0002999) Train Loss mse: 0.0298, Train Loss ce: 0.2517, Train Steps/Sec: 0.05,
3235
  [2026-01-26 09:14:50] (step=0003000) Train Loss mse: 0.0192, Train Loss ce: 0.2577, Train Steps/Sec: 0.01,
3236
  [2026-01-26 09:15:09] (step=0003001) Train Loss mse: 0.0150, Train Loss ce: 0.2598, Train Steps/Sec: 0.05,
@@ -3256,13 +3263,20 @@ ce_avg: 0.24374419450759888, mse_avg: 0.007726446725428104
3256
  [2026-01-26 09:21:26] (step=0003021) Train Loss mse: 0.0134, Train Loss ce: 0.2563, Train Steps/Sec: 0.05,
3257
  [2026-01-26 09:21:45] (step=0003022) Train Loss mse: 0.0198, Train Loss ce: 0.2363, Train Steps/Sec: 0.05,
3258
  [2026-01-26 09:22:06] (step=0003023) Train Loss mse: 0.0328, Train Loss ce: 0.2532, Train Steps/Sec: 0.05,
3259
- base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step3500
3260
- Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
3261
- [eval debug] first 3 batch fingerprints:
3262
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
3263
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
3264
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
3265
- ce_avg: 0.2417636662721634, mse_avg: 0.007533130701631308
 
 
 
 
 
 
 
3266
  [2026-01-26 09:26:47] (step=0003038) Train Loss mse: 0.0324, Train Loss ce: 0.2545, Train Steps/Sec: 0.04,
3267
  [2026-01-26 09:27:08] (step=0003039) Train Loss mse: 0.0328, Train Loss ce: 0.2373, Train Steps/Sec: 0.05,
3268
  [2026-01-26 09:27:29] (step=0003040) Train Loss mse: 0.0207, Train Loss ce: 0.2709, Train Steps/Sec: 0.05,
@@ -4151,20 +4165,6 @@ ce_avg: 0.2417636662721634, mse_avg: 0.007533130701631308
4151
  [2026-01-26 14:10:34] (step=0003923) Train Loss mse: 0.0186, Train Loss ce: 0.2431, Train Steps/Sec: 0.06,
4152
  [2026-01-26 14:10:49] (step=0003924) Train Loss mse: 0.0123, Train Loss ce: 0.2697, Train Steps/Sec: 0.06,
4153
  [2026-01-26 14:11:05] (step=0003925) Train Loss mse: 0.0176, Train Loss ce: 0.2535, Train Steps/Sec: 0.06,
4154
- base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step4000
4155
- Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
4156
- [eval debug] first 3 batch fingerprints:
4157
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4158
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4159
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4160
- ce_avg: 0.24075058102607727, mse_avg: 0.007322242017835379
4161
- base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step4500
4162
- Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
4163
- [eval debug] first 3 batch fingerprints:
4164
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4165
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4166
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4167
- ce_avg: 0.23980583250522614, mse_avg: 0.007650755811482668
4168
  [2026-01-26 14:11:24] (step=0003926) Train Loss mse: 0.0236, Train Loss ce: 0.2457, Train Steps/Sec: 0.05,
4169
  [2026-01-26 14:11:47] (step=0003927) Train Loss mse: 0.0151, Train Loss ce: 0.2381, Train Steps/Sec: 0.04,
4170
  [2026-01-26 14:12:08] (step=0003928) Train Loss mse: 0.0155, Train Loss ce: 0.2358, Train Steps/Sec: 0.05,
@@ -4196,6 +4196,27 @@ ce_avg: 0.23980583250522614, mse_avg: 0.007650755811482668
4196
  [2026-01-26 14:20:02] (step=0003954) Train Loss mse: 0.0229, Train Loss ce: 0.2695, Train Steps/Sec: 0.06,
4197
  [2026-01-26 14:20:22] (step=0003955) Train Loss mse: 0.0199, Train Loss ce: 0.2492, Train Steps/Sec: 0.05,
4198
  [2026-01-26 14:20:43] (step=0003956) Train Loss mse: 0.0114, Train Loss ce: 0.2503, Train Steps/Sec: 0.05,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4199
  [2026-01-26 14:21:06] (step=0003957) Train Loss mse: 0.0343, Train Loss ce: 0.2657, Train Steps/Sec: 0.04,
4200
  [2026-01-26 14:21:25] (step=0003958) Train Loss mse: 0.0271, Train Loss ce: 0.2690, Train Steps/Sec: 0.05,
4201
  [2026-01-26 14:21:43] (step=0003959) Train Loss mse: 0.0122, Train Loss ce: 0.2418, Train Steps/Sec: 0.05,
@@ -5241,11 +5262,4 @@ ce_avg: 0.23980583250522614, mse_avg: 0.007650755811482668
5241
  [2026-01-26 19:56:56] (step=0004999) Train Loss mse: 0.0202, Train Loss ce: 0.2401, Train Steps/Sec: 0.04,
5242
  [2026-01-26 19:58:44] (step=0005000) Train Loss mse: 0.0209, Train Loss ce: 0.2473, Train Steps/Sec: 0.01,
5243
  [2026-01-26 19:58:44] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/0005000.
5244
- [2026-01-26 20:01:32] Done!
5245
- base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step5000
5246
- Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
5247
- [eval debug] first 3 batch fingerprints:
5248
- fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
5249
- fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
5250
- fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
5251
- ce_avg: 0.23948809504508972, mse_avg: 0.007586085703223944
 
1
+ [2026-01-25 21:33:41] (step=0000814) Train Loss mse: 0.0219, Train Loss ce: 0.2812, Train Steps/Sec: 0.05,
2
+ FullyShardedDataParallel(
3
+ (_fsdp_wrapped_module): Bagel(
4
+ (language_model): Qwen2ForCausalLM(
5
+ (model): Qwen2Model(
6
+ (embed_tokens): Embedding(152064, 3584)
7
+ (layers): ModuleList(
8
+ (0-27): 28 x FullyShardedDataParallel(
9
+ (_fsdp_wrapped_module): CheckpointWrapper(
10
+ (_checkpoint_wrapped_module): Qwen2MoTDecoderLayer(
11
+ (self_attn): PackedAttentionMoT(
12
+ (q_proj): Linear(in_features=3584, out_features=3584, bias=True)
13
+ (k_proj): Linear(in_features=3584, out_features=512, bias=True)
14
+ (v_proj): Linear(in_features=3584, out_features=512, bias=True)
15
+ (o_proj): Linear(in_features=3584, out_features=3584, bias=False)
16
+ (q_norm): Qwen2RMSNorm((128,), eps=1e-06)
17
+ (k_norm): Qwen2RMSNorm((128,), eps=1e-06)
18
+ (q_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
19
+ (k_norm_moe_gen): Qwen2RMSNorm((128,), eps=1e-06)
20
+ (q_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=True)
21
+ (k_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
22
+ (v_proj_moe_gen): Linear(in_features=3584, out_features=512, bias=True)
23
+ (o_proj_moe_gen): Linear(in_features=3584, out_features=3584, bias=False)
24
+ )
25
+ (mlp): Qwen2MLP(
26
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
27
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
28
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
29
+ (act_fn): SiLU()
30
+ )
31
+ (mlp_moe_gen): Qwen2MLP(
32
+ (gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
33
+ (up_proj): Linear(in_features=3584, out_features=18944, bias=False)
34
+ (down_proj): Linear(in_features=18944, out_features=3584, bias=False)
35
+ (act_fn): SiLU()
36
+ )
37
+ (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
38
+ (input_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
39
+ (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
40
+ (post_attention_layernorm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
41
+ )
42
+ )
43
+ )
44
+ )
45
+ (norm): Qwen2RMSNorm((3584,), eps=1e-06)
46
+ (norm_moe_gen): Qwen2RMSNorm((3584,), eps=1e-06)
47
+ (rotary_emb): Qwen2RotaryEmbedding()
48
+ )
49
+ (lm_head): Linear(in_features=3584, out_features=152064, bias=False)
50
+ )
51
+ (time_embedder): FullyShardedDataParallel(
52
+ (_fsdp_wrapped_module): TimestepEmbedder(
53
+ (mlp): Sequential(
54
+ (0): Linear(in_features=256, out_features=3584, bias=True)
55
+ (1): SiLU()
56
+ (2): Linear(in_features=3584, out_features=3584, bias=True)
57
+ )
58
+ )
59
+ )
60
+ (vae2llm): Linear(in_features=64, out_features=3584, bias=True)
61
+ (llm2vae): Linear(in_features=3584, out_features=64, bias=True)
62
+ (latent_pos_embed): FullyShardedDataParallel(
63
+ (_fsdp_wrapped_module): PositionEmbedding()
64
+ )
65
+ (vit_model): SiglipVisionModel(
66
+ (vision_model): FullyShardedDataParallel(
67
+ (_fsdp_wrapped_module): SiglipVisionTransformer(
68
+ (embeddings): SiglipVisionEmbeddings(
69
+ (position_embedding): Embedding(4900, 1152)
70
+ (patch_embedding): Linear(in_features=588, out_features=1152, bias=True)
71
+ )
72
+ (encoder): SiglipEncoder(
73
+ (layers): ModuleList(
74
+ (0-25): 26 x FullyShardedDataParallel(
75
+ (_fsdp_wrapped_module): CheckpointWrapper(
76
+ (_checkpoint_wrapped_module): SiglipEncoderLayer(
77
+ (self_attn): SiglipFlashAttention2(
78
+ (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
79
+ (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
80
+ (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
81
+ (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
82
+ )
83
+ (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
84
+ (mlp): SiglipMLP(
85
+ (activation_fn): PytorchGELUTanh()
86
+ (fc1): Linear(in_features=1152, out_features=4304, bias=True)
87
+ (fc2): Linear(in_features=4304, out_features=1152, bias=True)
88
+ )
89
+ (layer_norm2): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
90
+ )
91
+ )
92
+ )
93
+ )
94
+ )
95
+ (post_layernorm): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
96
+ )
97
+ )
98
+ )
99
+ (connector): FullyShardedDataParallel(
100
+ (_fsdp_wrapped_module): CheckpointWrapper(
101
+ (_checkpoint_wrapped_module): MLPconnector(
102
+ (activation_fn): PytorchGELUTanh()
103
+ (fc1): Linear(in_features=1152, out_features=3584, bias=True)
104
+ (fc2): Linear(in_features=3584, out_features=3584, bias=True)
105
+ )
106
+ )
107
+ )
108
+ (vit_pos_embed): FullyShardedDataParallel(
109
+ (_fsdp_wrapped_module): PositionEmbedding()
110
+ )
111
+ )
112
+ )
113
+ _flat_param True
114
+ language_model.model.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
115
+ language_model.model.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
116
+ language_model.model.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
117
+ language_model.model.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
118
+ language_model.model.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
119
+ language_model.model.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
120
+ language_model.model.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
121
+ language_model.model.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
122
+ language_model.model.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
123
+ language_model.model.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
124
+ language_model.model.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
125
+ language_model.model.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
126
+ language_model.model.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
127
+ language_model.model.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
128
+ language_model.model.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
129
+ language_model.model.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
130
+ language_model.model.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
131
+ language_model.model.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
132
+ language_model.model.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
133
+ language_model.model.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
134
+ language_model.model.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
135
+ language_model.model.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
136
+ language_model.model.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
137
+ language_model.model.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
138
+ language_model.model.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
139
+ language_model.model.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
140
+ language_model.model.layers.26._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
141
+ language_model.model.layers.27._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
142
+ time_embedder._fsdp_wrapped_module._flat_param True
143
+ latent_pos_embed._fsdp_wrapped_module._flat_param False
144
+ vit_model.vision_model._fsdp_wrapped_module._flat_param True
145
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.0._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
146
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.1._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
147
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.2._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
148
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.3._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
149
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.4._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
150
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.5._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
151
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.6._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
152
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.7._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
153
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.8._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
154
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.9._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
155
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.10._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
156
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.11._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
157
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.12._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
158
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.13._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
159
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.14._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
160
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.15._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
161
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.16._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
162
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.17._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
163
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.18._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
164
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.19._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
165
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.20._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
166
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.21._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
167
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.22._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
168
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.23._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
169
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.24._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
170
+ vit_model.vision_model._fsdp_wrapped_module.encoder.layers.25._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
171
+ connector._fsdp_wrapped_module._checkpoint_wrapped_module._flat_param True
172
+ vit_pos_embed._fsdp_wrapped_module._flat_param False
173
+ Preparing Dataset vlm_gym_colorization_celoss/vlm_gym_colorization_train
174
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step0
175
+ Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
176
+ [eval debug] first 3 batch fingerprints:
177
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
178
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
179
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
180
+ ce_avg: 1.3898617029190063, mse_avg: 0.05326032266020775
181
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step500
182
+ Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
183
+ [eval debug] first 3 batch fingerprints:
184
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
185
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
186
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
187
  wandb: Detected [huggingface_hub.inference] in use.
188
  wandb: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
189
  wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
 
915
  [2026-01-25 21:03:00] (step=0000718) Train Loss mse: 0.0111, Train Loss ce: 0.2815, Train Steps/Sec: 0.05,
916
  [2026-01-25 21:03:14] (step=0000719) Train Loss mse: 0.0212, Train Loss ce: 0.3005, Train Steps/Sec: 0.07,
917
  [2026-01-25 21:03:34] (step=0000720) Train Loss mse: 0.0175, Train Loss ce: 0.2827, Train Steps/Sec: 0.05,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
918
  [2026-01-25 21:03:53] (step=0000721) Train Loss mse: 0.0203, Train Loss ce: 0.2965, Train Steps/Sec: 0.05,
919
  [2026-01-25 21:04:11] (step=0000722) Train Loss mse: 0.0150, Train Loss ce: 0.2638, Train Steps/Sec: 0.05,
920
  [2026-01-25 21:04:33] (step=0000723) Train Loss mse: 0.0155, Train Loss ce: 0.2844, Train Steps/Sec: 0.05,
 
1008
  [2026-01-25 21:32:44] (step=0000811) Train Loss mse: 0.0192, Train Loss ce: 0.2767, Train Steps/Sec: 0.06,
1009
  [2026-01-25 21:33:04] (step=0000812) Train Loss mse: 0.0235, Train Loss ce: 0.2813, Train Steps/Sec: 0.05,
1010
  [2026-01-25 21:33:19] (step=0000813) Train Loss mse: 0.0263, Train Loss ce: 0.2622, Train Steps/Sec: 0.06,
1011
+ ce_avg: 0.2838270962238312, mse_avg: 0.00850633904337883
1012
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step1000
1013
+ Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
1014
+ [eval debug] first 3 batch fingerprints:
1015
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
1016
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
1017
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
1018
+ ce_avg: 0.4483165144920349, mse_avg: 0.008000156842172146
1019
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step1500
1020
+ Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
1021
+ [eval debug] first 3 batch fingerprints:
1022
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
1023
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
1024
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
1025
+ ce_avg: 0.5879027843475342, mse_avg: 0.008525446057319641
1026
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step2000
1027
+ Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
1028
+ [eval debug] first 3 batch fingerprints:
1029
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
1030
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
1031
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
1032
+ ce_avg: 1.2331268787384033, mse_avg: 0.008829578757286072
1033
  [2026-01-25 21:33:57] (step=0000815) Train Loss mse: 0.0234, Train Loss ce: 0.2941, Train Steps/Sec: 0.07,
1034
  [2026-01-25 21:34:16] (step=0000816) Train Loss mse: 0.0220, Train Loss ce: 0.2732, Train Steps/Sec: 0.05,
1035
  [2026-01-25 21:34:34] (step=0000817) Train Loss mse: 0.0237, Train Loss ce: 0.2544, Train Steps/Sec: 0.05,
 
2103
  [2026-01-26 03:14:59] (step=0001885) Train Loss mse: 0.0153, Train Loss ce: 0.2766, Train Steps/Sec: 0.05,
2104
  [2026-01-26 03:15:19] (step=0001886) Train Loss mse: 0.0292, Train Loss ce: 0.2450, Train Steps/Sec: 0.05,
2105
  [2026-01-26 03:15:36] (step=0001887) Train Loss mse: 0.0145, Train Loss ce: 0.2653, Train Steps/Sec: 0.06,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2106
  [2026-01-26 03:15:56] (step=0001888) Train Loss mse: 0.0274, Train Loss ce: 0.2608, Train Steps/Sec: 0.05,
2107
  [2026-01-26 03:16:09] (step=0001889) Train Loss mse: 0.0220, Train Loss ce: 0.2631, Train Steps/Sec: 0.08,
2108
  [2026-01-26 03:16:27] (step=0001890) Train Loss mse: 0.0193, Train Loss ce: 0.2902, Train Steps/Sec: 0.06,
 
2187
  [2026-01-26 03:41:34] (step=0001969) Train Loss mse: 0.0250, Train Loss ce: 0.2561, Train Steps/Sec: 0.06,
2188
  [2026-01-26 03:41:53] (step=0001970) Train Loss mse: 0.0162, Train Loss ce: 0.2314, Train Steps/Sec: 0.05,
2189
  [2026-01-26 03:42:15] (step=0001971) Train Loss mse: 0.0232, Train Loss ce: 0.2660, Train Steps/Sec: 0.04,
2190
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step2500
2191
+ Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
2192
+ [eval debug] first 3 batch fingerprints:
2193
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2194
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2195
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2196
+ ce_avg: 1.185328722000122, mse_avg: 0.008346919901669025
2197
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step3000
2198
+ Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
2199
+ [eval debug] first 3 batch fingerprints:
2200
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2201
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2202
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
2203
+ ce_avg: 0.24374419450759888, mse_avg: 0.007726446725428104
2204
  [2026-01-26 03:42:35] (step=0001972) Train Loss mse: 0.0179, Train Loss ce: 0.2603, Train Steps/Sec: 0.05,
2205
  [2026-01-26 03:42:54] (step=0001973) Train Loss mse: 0.0221, Train Loss ce: 0.2504, Train Steps/Sec: 0.05,
2206
  [2026-01-26 03:43:14] (step=0001974) Train Loss mse: 0.0300, Train Loss ce: 0.2581, Train Steps/Sec: 0.05,
 
3231
  [2026-01-26 09:12:03] (step=0002996) Train Loss mse: 0.0169, Train Loss ce: 0.2552, Train Steps/Sec: 0.06,
3232
  [2026-01-26 09:12:21] (step=0002997) Train Loss mse: 0.0187, Train Loss ce: 0.2464, Train Steps/Sec: 0.06,
3233
  [2026-01-26 09:12:42] (step=0002998) Train Loss mse: 0.0267, Train Loss ce: 0.2342, Train Steps/Sec: 0.05,
3234
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step3500
3235
+ Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
3236
+ [eval debug] first 3 batch fingerprints:
3237
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
3238
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
3239
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
3240
+ ce_avg: 0.2417636662721634, mse_avg: 0.007533130701631308
3241
  [2026-01-26 09:13:01] (step=0002999) Train Loss mse: 0.0298, Train Loss ce: 0.2517, Train Steps/Sec: 0.05,
3242
  [2026-01-26 09:14:50] (step=0003000) Train Loss mse: 0.0192, Train Loss ce: 0.2577, Train Steps/Sec: 0.01,
3243
  [2026-01-26 09:15:09] (step=0003001) Train Loss mse: 0.0150, Train Loss ce: 0.2598, Train Steps/Sec: 0.05,
 
3263
  [2026-01-26 09:21:26] (step=0003021) Train Loss mse: 0.0134, Train Loss ce: 0.2563, Train Steps/Sec: 0.05,
3264
  [2026-01-26 09:21:45] (step=0003022) Train Loss mse: 0.0198, Train Loss ce: 0.2363, Train Steps/Sec: 0.05,
3265
  [2026-01-26 09:22:06] (step=0003023) Train Loss mse: 0.0328, Train Loss ce: 0.2532, Train Steps/Sec: 0.05,
3266
+ [2026-01-26 09:22:24] (step=0003024) Train Loss mse: 0.0296, Train Loss ce: 0.2491, Train Steps/Sec: 0.06,
3267
+ [2026-01-26 09:22:40] (step=0003025) Train Loss mse: 0.0156, Train Loss ce: 0.2473, Train Steps/Sec: 0.06,
3268
+ [2026-01-26 09:22:57] (step=0003026) Train Loss mse: 0.0173, Train Loss ce: 0.2442, Train Steps/Sec: 0.06,
3269
+ [2026-01-26 09:23:17] (step=0003027) Train Loss mse: 0.0204, Train Loss ce: 0.2472, Train Steps/Sec: 0.05,
3270
+ [2026-01-26 09:23:36] (step=0003028) Train Loss mse: 0.0190, Train Loss ce: 0.2462, Train Steps/Sec: 0.05,
3271
+ [2026-01-26 09:23:57] (step=0003029) Train Loss mse: 0.0203, Train Loss ce: 0.2455, Train Steps/Sec: 0.05,
3272
+ [2026-01-26 09:24:19] (step=0003030) Train Loss mse: 0.0311, Train Loss ce: 0.2568, Train Steps/Sec: 0.05,
3273
+ [2026-01-26 09:24:38] (step=0003031) Train Loss mse: 0.0205, Train Loss ce: 0.2451, Train Steps/Sec: 0.05,
3274
+ [2026-01-26 09:24:55] (step=0003032) Train Loss mse: 0.0161, Train Loss ce: 0.2535, Train Steps/Sec: 0.06,
3275
+ [2026-01-26 09:25:10] (step=0003033) Train Loss mse: 0.0199, Train Loss ce: 0.2518, Train Steps/Sec: 0.07,
3276
+ [2026-01-26 09:25:30] (step=0003034) Train Loss mse: 0.0306, Train Loss ce: 0.2664, Train Steps/Sec: 0.05,
3277
+ [2026-01-26 09:25:49] (step=0003035) Train Loss mse: 0.0209, Train Loss ce: 0.2695, Train Steps/Sec: 0.05,
3278
+ [2026-01-26 09:26:09] (step=0003036) Train Loss mse: 0.0244, Train Loss ce: 0.2414, Train Steps/Sec: 0.05,
3279
+ [2026-01-26 09:26:25] (step=0003037) Train Loss mse: 0.0370, Train Loss ce: 0.2761, Train Steps/Sec: 0.06,
3280
  [2026-01-26 09:26:47] (step=0003038) Train Loss mse: 0.0324, Train Loss ce: 0.2545, Train Steps/Sec: 0.04,
3281
  [2026-01-26 09:27:08] (step=0003039) Train Loss mse: 0.0328, Train Loss ce: 0.2373, Train Steps/Sec: 0.05,
3282
  [2026-01-26 09:27:29] (step=0003040) Train Loss mse: 0.0207, Train Loss ce: 0.2709, Train Steps/Sec: 0.05,
 
4165
  [2026-01-26 14:10:34] (step=0003923) Train Loss mse: 0.0186, Train Loss ce: 0.2431, Train Steps/Sec: 0.06,
4166
  [2026-01-26 14:10:49] (step=0003924) Train Loss mse: 0.0123, Train Loss ce: 0.2697, Train Steps/Sec: 0.06,
4167
  [2026-01-26 14:11:05] (step=0003925) Train Loss mse: 0.0176, Train Loss ce: 0.2535, Train Steps/Sec: 0.06,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4168
  [2026-01-26 14:11:24] (step=0003926) Train Loss mse: 0.0236, Train Loss ce: 0.2457, Train Steps/Sec: 0.05,
4169
  [2026-01-26 14:11:47] (step=0003927) Train Loss mse: 0.0151, Train Loss ce: 0.2381, Train Steps/Sec: 0.04,
4170
  [2026-01-26 14:12:08] (step=0003928) Train Loss mse: 0.0155, Train Loss ce: 0.2358, Train Steps/Sec: 0.05,
 
4196
  [2026-01-26 14:20:02] (step=0003954) Train Loss mse: 0.0229, Train Loss ce: 0.2695, Train Steps/Sec: 0.06,
4197
  [2026-01-26 14:20:22] (step=0003955) Train Loss mse: 0.0199, Train Loss ce: 0.2492, Train Steps/Sec: 0.05,
4198
  [2026-01-26 14:20:43] (step=0003956) Train Loss mse: 0.0114, Train Loss ce: 0.2503, Train Steps/Sec: 0.05,
4199
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step4000
4200
+ Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
4201
+ [eval debug] first 3 batch fingerprints:
4202
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4203
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4204
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4205
+ ce_avg: 0.24075058102607727, mse_avg: 0.007322242017835379
4206
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step4500
4207
+ Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
4208
+ [eval debug] first 3 batch fingerprints:
4209
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4210
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4211
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4212
+ ce_avg: 0.23980583250522614, mse_avg: 0.007650755811482668
4213
+ base_dir is /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/eval_used_rows, step_tag is checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins_step5000
4214
+ Preparing Dataset vlm_gym_colorization_celoss_evalonce/vlm_gym_colorization_val
4215
+ [eval debug] first 3 batch fingerprints:
4216
+ fp[0]: [{'data_indexes': [0], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4217
+ fp[1]: [{'data_indexes': [8], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4218
+ fp[2]: [{'data_indexes': [16], 'worker_id': 0, 'dataset_name': 'vlm_gym_colorization_celoss_evalonce'}]
4219
+ ce_avg: 0.23948809504508972, mse_avg: 0.007586085703223944
4220
  [2026-01-26 14:21:06] (step=0003957) Train Loss mse: 0.0343, Train Loss ce: 0.2657, Train Steps/Sec: 0.04,
4221
  [2026-01-26 14:21:25] (step=0003958) Train Loss mse: 0.0271, Train Loss ce: 0.2690, Train Steps/Sec: 0.05,
4222
  [2026-01-26 14:21:43] (step=0003959) Train Loss mse: 0.0122, Train Loss ce: 0.2418, Train Steps/Sec: 0.05,
 
5262
  [2026-01-26 19:56:56] (step=0004999) Train Loss mse: 0.0202, Train Loss ce: 0.2401, Train Steps/Sec: 0.04,
5263
  [2026-01-26 19:58:44] (step=0005000) Train Loss mse: 0.0209, Train Loss ce: 0.2473, Train Steps/Sec: 0.01,
5264
  [2026-01-26 19:58:44] Saving checkpoint to /dev/shm/models/checkpoints_vlm_gym_colorization_one_image_lr2e_5_ce_ins/0005000.
5265
+ [2026-01-26 20:01:32] Done!