| [NeMo W 2024-03-18 05:25:14 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default. |
| See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information. |
| ret = run_job( |
| |
| [NeMo I 2024-03-18 05:25:14 train_gpt_sft:118] |
| |
| ************** Experiment configuration *********** |
| [NeMo I 2024-03-18 05:25:14 train_gpt_sft:119] |
| name: gemma-7b-sql-nemo |
| trainer: |
| num_nodes: 1 |
| devices: 8 |
| accelerator: gpu |
| precision: bf16 |
| sft: |
| max_epochs: 1 |
| max_steps: -1 |
| val_check_interval: 1000 |
| save_interval: ${.val_check_interval} |
| limit_val_batches: 40 |
| gradient_clip_val: 1.0 |
| logger: false |
| enable_checkpointing: false |
| use_distributed_sampler: false |
| max_time: null |
| max_epochs: ${.sft.max_epochs} |
| max_steps: ${.sft.max_steps} |
| exp_manager: |
| explicit_log_dir: models/gemma-7b-sql-nemo |
| exp_dir: null |
| name: ${name} |
| create_wandb_logger: false |
| wandb_logger_kwargs: |
| project: null |
| name: null |
| resume_if_exists: true |
| resume_ignore_no_checkpoint: true |
| create_checkpoint_callback: true |
| checkpoint_callback_params: |
| monitor: validation_loss |
| save_top_k: 5 |
| mode: min |
| save_nemo_on_train_end: true |
| filename: megatron_gpt_sft--{${.monitor}:.3f}-{step}-{consumed_samples}-{epoch} |
| model_parallel_size: ${model.tensor_model_parallel_size} |
| save_best_model: false |
| model: |
| seed: 1234 |
| tensor_model_parallel_size: 4 |
| pipeline_model_parallel_size: 1 |
| restore_from_path: /workspace/models/pytorch-7b-pt.nemo |
| resume_from_checkpoint: null |
| save_nemo_on_validation_end: true |
| sync_batch_comm: false |
| megatron_amp_O2: true |
| encoder_seq_length: 4096 |
| sequence_parallel: false |
| activations_checkpoint_granularity: null |
| activations_checkpoint_method: null |
| activations_checkpoint_num_layers: null |
| activations_checkpoint_layers_per_pipeline: null |
| answer_only_loss: true |
| gradient_as_bucket_view: false |
| seq_len_interpolation_factor: null |
| use_flash_attention: null |
| hidden_dropout: 0.0 |
| attention_dropout: 0.0 |
| ffn_dropout: 0.0 |
| peft: |
| peft_scheme: none |
| restore_from_path: null |
| lora_tuning: |
| target_modules: |
| - attention_qkv |
| adapter_dim: 32 |
| adapter_dropout: 0.0 |
| column_init_method: xavier |
| row_init_method: zero |
| layer_selection: null |
| weight_tying: false |
| position_embedding_strategy: null |
| data: |
| chat: false |
| chat_prompt_tokens: |
| system_turn_start: "\0" |
| turn_start: "\x11" |
| label_start: "\x12" |
| end_of_turn: ' |
| |
| ' |
| end_of_name: ' |
| |
| ' |
| sample: false |
| num_workers: 0 |
| dataloader_type: single |
| train_ds: |
| file_path: nsql.jsonl |
| global_batch_size: 128 |
| micro_batch_size: 1 |
| shuffle: true |
| memmap_workers: null |
| max_seq_length: 8192 |
| min_seq_length: 1 |
| drop_last: true |
| label_key: output |
| add_eos: true |
| add_sep: false |
| add_bos: false |
| truncation_field: input |
| index_mapping_dir: null |
| prompt_template: '{input} {output}' |
| hf_dataset: false |
| truncation_method: right |
| validation_ds: |
| file_path: nsql.jsonl |
| global_batch_size: 128 |
| micro_batch_size: 1 |
| shuffle: false |
| memmap_workers: ${model.data.train_ds.memmap_workers} |
| max_seq_length: ${model.data.train_ds.max_seq_length} |
| min_seq_length: 1 |
| drop_last: true |
| label_key: ${model.data.train_ds.label_key} |
| add_eos: ${model.data.train_ds.add_eos} |
| add_sep: ${model.data.train_ds.add_sep} |
| add_bos: ${model.data.train_ds.add_bos} |
| truncation_field: ${model.data.train_ds.truncation_field} |
| index_mapping_dir: null |
| prompt_template: ${model.data.train_ds.prompt_template} |
| hf_dataset: false |
| truncation_method: right |
| output_original_text: true |
| optim: |
| name: distributed_fused_adam |
| lr: 5.0e-06 |
| weight_decay: 0.01 |
| betas: |
| - 0.9 |
| - 0.98 |
| sched: |
| name: CosineAnnealing |
| warmup_steps: 10 |
| constant_steps: 1000 |
| min_lr: 9.0e-07 |
| bias_activation_fusion: true |
| |
| [NeMo W 2024-03-18 05:25:14 exp_manager:630] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :models/gemma-7b-sql-nemo/checkpoints. Training from scratch. |
| [NeMo I 2024-03-18 05:25:14 exp_manager:396] Experiments will be logged at models/gemma-7b-sql-nemo |
| [NeMo I 2024-03-18 05:25:14 exp_manager:856] TensorboardLogger has been set up |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo I 2024-03-18 05:25:57 megatron_init:241] Rank 3 has data parallel group : [3, 7] |
| [NeMo I 2024-03-18 05:25:57 megatron_init:247] Rank 3 has combined group of data parallel and context parallel : [3, 7] |
| [NeMo I 2024-03-18 05:25:57 megatron_init:252] All data parallel group ranks with context parallel combined: [[0, 4], [1, 5], [2, 6], [3, 7]] |
| [NeMo I 2024-03-18 05:25:57 megatron_init:255] Ranks 3 has data parallel rank: 0 |
| [NeMo I 2024-03-18 05:25:57 megatron_init:272] Rank 3 has context parallel group: [3] |
| [NeMo I 2024-03-18 05:25:57 megatron_init:275] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]] |
| [NeMo I 2024-03-18 05:25:57 megatron_init:276] Ranks 3 has context parallel rank: 0 |
| [NeMo I 2024-03-18 05:25:57 megatron_init:287] Rank 3 has model parallel group: [0, 1, 2, 3] |
| [NeMo I 2024-03-18 05:25:57 megatron_init:288] All model parallel group ranks: [[0, 1, 2, 3], [4, 5, 6, 7]] |
| [NeMo I 2024-03-18 05:25:57 megatron_init:298] Rank 3 has tensor model parallel group: [0, 1, 2, 3] |
| [NeMo I 2024-03-18 05:25:57 megatron_init:302] All tensor model parallel group ranks: [[0, 1, 2, 3], [4, 5, 6, 7]] |
| [NeMo I 2024-03-18 05:25:57 megatron_init:303] Rank 3 has tensor model parallel rank: 3 |
| [NeMo I 2024-03-18 05:25:57 megatron_init:317] Rank 3 has pipeline model parallel group: [3] |
| [NeMo I 2024-03-18 05:25:57 megatron_init:329] Rank 3 has embedding group: [3] |
| [NeMo I 2024-03-18 05:25:57 megatron_init:335] All pipeline model parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]] |
| [NeMo I 2024-03-18 05:25:57 megatron_init:336] Rank 3 has pipeline model parallel rank 0 |
| [NeMo I 2024-03-18 05:25:57 megatron_init:337] All embedding group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]] |
| [NeMo I 2024-03-18 05:25:57 megatron_init:338] Rank 3 has embedding rank: 0 |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo I 2024-03-18 05:25:57 tokenizer_utils:191] Getting SentencePiece with model: /tmp/tmpe7phpf8c/c1f49ba929c24b7e95b7219ca958f881_tokenizer-final.model |
| [NeMo I 2024-03-18 05:25:57 megatron_base_model:520] Padded vocab_size: 256000, original vocab_size: 256000, dummy tokens: 0. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_overlap in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: tp_comm_bulk_dgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:1078] The model: GPTSFTModel() does not have field.name: barrier_with_L1_time in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:492] The model: GPTSFTModel() does not have field.name: num_moe_experts in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:492] The model: GPTSFTModel() does not have field.name: bias_gelu_fusion in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:492] The model: GPTSFTModel() does not have field.name: fp8_wgrad in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 megatron_base_model:492] The model: GPTSFTModel() does not have field.name: clone_scatter_output_in_embedding in its cfg. Add this key to cfg or config_mapping to make to make it configurable. |
| [NeMo W 2024-03-18 05:25:57 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/module/base.py:611: UserWarning: To guarantee overlapping TP and SP collectives with the backwardGEMMs, set environment variable CUDA_DEVICE_MAX_CONNECTIONS = 1 |
| warnings.warn( |
| |
| [NeMo I 2024-03-18 05:27:29 nlp_overrides:1100] Model GPTSFTModel was successfully restored from /workspace/models/pytorch-7b-pt.nemo. |
| [NeMo I 2024-03-18 05:27:29 train_script_utils:169] Running full finetuning since no peft scheme is given. |
| | Name | Type | Params |
| ---------------------------------------- |
| 0 | model | Float16Module | 2.1 B |
| ---------------------------------------- |
| 2.1 B Trainable params |
| 0 Non-trainable params |
| 2.1 B Total params |
| 8,538.206 Total estimated model params size (MB) |
| [NeMo I 2024-03-18 05:27:29 text_memmap_dataset:116] Building data files |
| [NeMo I 2024-03-18 05:27:31 text_memmap_dataset:158] Loading data files |
| [NeMo I 2024-03-18 05:27:31 text_memmap_dataset:249] Loading nsql.jsonl |
| [NeMo I 2024-03-18 05:27:31 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000700 |
| [NeMo I 2024-03-18 05:27:31 text_memmap_dataset:165] Computing global indices |
| [NeMo I 2024-03-18 05:27:31 text_memmap_dataset:116] Building data files |
| [NeMo I 2024-03-18 05:27:34 text_memmap_dataset:158] Loading data files |
| [NeMo I 2024-03-18 05:27:34 text_memmap_dataset:249] Loading nsql.jsonl |
| [NeMo I 2024-03-18 05:27:34 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000550 |
| [NeMo I 2024-03-18 05:27:34 text_memmap_dataset:165] Computing global indices |
| [NeMo I 2024-03-18 05:27:34 builders:327] Building dataloader with consumed samples: 0 |
| [NeMo W 2024-03-18 05:27:34 experimental:26] `<class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'>` is experimental and not ready for production yet. Use at your own risk. |
| [NeMo I 2024-03-18 05:27:34 builders:327] Building dataloader with consumed samples: 0 |
| [NeMo W 2024-03-18 05:27:34 experimental:26] `<class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'>` is experimental and not ready for production yet. Use at your own risk. |
| [NeMo I 2024-03-18 05:27:40 megatron_gpt_model:1296] Pipeline model parallel rank: 0, Tensor model parallel rank: 3, Number of model parameters on device: 2.13e+09. Total number of model parameters: 8.54e+09. |
| [NeMo I 2024-03-18 05:27:40 modelPT:723] Optimizer config = MegatronDistributedFusedAdam ( |
| Parameter Group 0 |
| betas: [0.9, 0.98] |
| bias_correction: True |
| eps: 1e-08 |
| lr: 5e-06 |
| weight_decay: 0.01 |
| |
| Parameter Group 1 |
| betas: [0.9, 0.98] |
| bias_correction: True |
| eps: 1e-08 |
| lr: 5e-06 |
| weight_decay: 0.0 |
| ) |
| [NeMo I 2024-03-18 05:27:40 lr_scheduler:915] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7ec8c934d8d0>" |
| will be used during training (effective maximum steps = 613) - |
| Parameters : |
| (warmup_steps: 10 |
| constant_steps: 1000 |
| min_lr: 9.0e-07 |
| max_steps: 613 |
| ) |
|
|