SystemAdmin123 commited on
Commit
775543e
·
verified ·
1 Parent(s): 66506dc

End of training

Browse files
Files changed (1) hide show
  1. README.md +15 -27
README.md CHANGED
@@ -21,7 +21,7 @@ should probably proofread and complete it, then remove this comment. -->
21
  axolotl version: `0.6.0`
22
  ```yaml
23
  base_model: JackFram/llama-68m
24
- batch_size: 64
25
  bf16: true
26
  chat_template: tokenizer_default_fallback_alpaca
27
  datasets:
@@ -37,7 +37,7 @@ datasets:
37
  system_prompt: ''
38
  device_map: auto
39
  eval_sample_packing: false
40
- eval_steps: 50
41
  flash_attention: true
42
  gradient_checkpointing: true
43
  group_by_length: true
@@ -46,7 +46,7 @@ hub_strategy: checkpoint
46
  learning_rate: 0.0002
47
  logging_steps: 10
48
  lr_scheduler: cosine
49
- max_steps: 5000
50
  micro_batch_size: 32
51
  model_type: AutoModelForCausalLM
52
  num_epochs: 100
@@ -55,13 +55,15 @@ output_dir: /root/.sn56/axolotl/tmp/llama-68m
55
  pad_to_sequence_len: true
56
  resize_token_embeddings_to_32x: false
57
  sample_packing: true
58
- save_steps: 50
59
- save_total_limit: 2
60
  sequence_len: 2048
61
  special_tokens:
62
  pad_token: </s>
63
  tokenizer_type: LlamaTokenizerFast
64
  torch_dtype: bf16
 
 
65
  trust_remote_code: true
66
  val_set_size: 0.1
67
  wandb_entity: ''
@@ -79,8 +81,6 @@ warmup_ratio: 0.05
79
  # llama-68m
80
 
81
  This model is a fine-tuned version of [JackFram/llama-68m](https://huggingface.co/JackFram/llama-68m) on the argilla/databricks-dolly-15k-curated-en dataset.
82
- It achieves the following results on the evaluation set:
83
- - Loss: 4.0103
84
 
85
  ## Model description
86
 
@@ -104,31 +104,19 @@ The following hyperparameters were used during training:
104
  - eval_batch_size: 32
105
  - seed: 42
106
  - distributed_type: multi-GPU
107
- - num_devices: 2
108
- - total_train_batch_size: 64
109
- - total_eval_batch_size: 64
110
  - optimizer: Use OptimizerNames.ADAMW_BNB with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
111
  - lr_scheduler_type: cosine
112
- - lr_scheduler_warmup_steps: 30
113
- - training_steps: 600
114
 
115
  ### Training results
116
 
117
- | Training Loss | Epoch | Step | Validation Loss |
118
- |:-------------:|:-------:|:----:|:---------------:|
119
- | No log | 0.0769 | 1 | 3.9168 |
120
- | 2.5978 | 3.8462 | 50 | 2.8149 |
121
- | 2.0808 | 7.6923 | 100 | 2.9664 |
122
- | 1.6294 | 11.5385 | 150 | 3.2337 |
123
- | 1.2699 | 15.3846 | 200 | 3.5217 |
124
- | 1.0092 | 19.2308 | 250 | 3.7262 |
125
- | 0.8392 | 23.0769 | 300 | 3.8683 |
126
- | 0.7428 | 26.9231 | 350 | 3.9435 |
127
- | 0.6952 | 30.7692 | 400 | 3.9860 |
128
- | 0.6762 | 34.6154 | 450 | 3.9990 |
129
- | 0.6739 | 38.4615 | 500 | 4.0167 |
130
- | 0.6691 | 42.3077 | 550 | 4.0208 |
131
- | 0.6667 | 46.1538 | 600 | 4.0103 |
132
 
133
 
134
  ### Framework versions
 
21
  axolotl version: `0.6.0`
22
  ```yaml
23
  base_model: JackFram/llama-68m
24
+ batch_size: 128
25
  bf16: true
26
  chat_template: tokenizer_default_fallback_alpaca
27
  datasets:
 
37
  system_prompt: ''
38
  device_map: auto
39
  eval_sample_packing: false
40
+ eval_steps: 200
41
  flash_attention: true
42
  gradient_checkpointing: true
43
  group_by_length: true
 
46
  learning_rate: 0.0002
47
  logging_steps: 10
48
  lr_scheduler: cosine
49
+ max_steps: 10000
50
  micro_batch_size: 32
51
  model_type: AutoModelForCausalLM
52
  num_epochs: 100
 
55
  pad_to_sequence_len: true
56
  resize_token_embeddings_to_32x: false
57
  sample_packing: true
58
+ save_steps: 200
59
+ save_total_limit: 1
60
  sequence_len: 2048
61
  special_tokens:
62
  pad_token: </s>
63
  tokenizer_type: LlamaTokenizerFast
64
  torch_dtype: bf16
65
+ training_args_kwargs:
66
+ hub_private_repo: true
67
  trust_remote_code: true
68
  val_set_size: 0.1
69
  wandb_entity: ''
 
81
  # llama-68m
82
 
83
  This model is a fine-tuned version of [JackFram/llama-68m](https://huggingface.co/JackFram/llama-68m) on the argilla/databricks-dolly-15k-curated-en dataset.
 
 
84
 
85
  ## Model description
86
 
 
104
  - eval_batch_size: 32
105
  - seed: 42
106
  - distributed_type: multi-GPU
107
+ - num_devices: 4
108
+ - total_train_batch_size: 128
109
+ - total_eval_batch_size: 128
110
  - optimizer: Use OptimizerNames.ADAMW_BNB with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
111
  - lr_scheduler_type: cosine
112
+ - lr_scheduler_warmup_steps: 5
113
+ - training_steps: 100
114
 
115
  ### Training results
116
 
117
+ | Training Loss | Epoch | Step | Validation Loss |
118
+ |:-------------:|:------:|:----:|:---------------:|
119
+ | No log | 0.1667 | 1 | 3.9323 |
 
 
 
 
 
 
 
 
 
 
 
 
120
 
121
 
122
  ### Framework versions