opt-babylm1-ntb_seed-42_5e-6

This model was trained from scratch on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 32
eval_batch_size: 64
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 256
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.05
num_epochs: 20.0

Training Loss	Epoch	Step	Validation Loss
5.2628	0.4284	1000	5.2240
4.6939	0.8567	2000	4.7039
4.2394	1.2849	3000	4.2252
3.8775	1.7132	4000	3.8713
3.5715	2.1414	5000	3.5832
3.4254	2.5697	6000	3.4170
3.3181	2.9981	7000	3.3120
3.2325	3.4262	8000	3.2545
3.178	3.8546	9000	3.1972
3.1138	4.2827	10000	3.1649
3.0877	4.7111	11000	3.1289
3.029	5.1392	12000	3.1055
3.0266	5.5676	13000	3.0898
3.0196	5.9959	14000	3.0696
2.9691	6.4241	15000	3.0585
2.9671	6.8524	16000	3.0440
2.9281	7.2806	17000	3.0326
2.9242	7.7089	18000	3.0206
2.8772	8.1371	19000	3.0156
2.8874	8.5654	20000	3.0068
2.8931	8.9938	21000	2.9946
2.8553	9.4219	22000	2.9949
2.8518	9.8503	23000	2.9862
2.8246	10.2784	24000	2.9834
2.8336	10.7068	25000	2.9747
2.7922	11.1349	26000	2.9783
2.8071	11.5633	27000	2.9705
2.8117	11.9916	28000	2.9613
2.7766	12.4198	29000	2.9631
2.7804	12.8481	30000	2.9598
2.7576	13.2763	31000	2.9605
2.7614	13.7046	32000	2.9543
2.7294	14.1328	33000	2.9561
2.7397	14.5611	34000	2.9536
2.7472	14.9895	35000	2.9491
2.7139	15.4176	36000	2.9510
2.7276	15.8460	37000	2.9466
2.7032	16.2741	38000	2.9491
2.714	16.7025	39000	2.9455
2.6882	17.1306	40000	2.9476
2.693	17.5590	41000	2.9455
2.6907	17.9874	42000	2.9425
2.6765	18.4155	43000	2.9444
2.681	18.8439	44000	2.9423
2.6668	19.2720	45000	2.9433
2.6672	19.7004	46000	2.9430

Safetensors

Model size

97.8M params

Tensor type

F16