opt-babylm1-ntb_seed-1024_5e-6

This model was trained from scratch on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 32
eval_batch_size: 64
seed: 1024
gradient_accumulation_steps: 8
total_train_batch_size: 256
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.05
num_epochs: 20.0

Training Loss	Epoch	Step	Validation Loss
5.2931	0.4284	1000	5.2469
4.7248	0.8567	2000	4.7222
4.2333	1.2849	3000	4.2283
3.8715	1.7132	4000	3.8683
3.56	2.1414	5000	3.5667
3.4165	2.5697	6000	3.4090
3.3154	2.9981	7000	3.3184
3.2294	3.4262	8000	3.2480
3.1789	3.8546	9000	3.1970
3.1107	4.2827	10000	3.1586
3.0822	4.7111	11000	3.1289
3.0348	5.1392	12000	3.1093
3.0284	5.5676	13000	3.0849
3.0121	5.9959	14000	3.0679
2.9631	6.4241	15000	3.0565
2.9636	6.8524	16000	3.0366
2.9202	7.2806	17000	3.0331
2.9172	7.7089	18000	3.0174
2.8756	8.1371	19000	3.0138
2.88	8.5654	20000	3.0041
2.8866	8.9938	21000	2.9964
2.8505	9.4219	22000	2.9931
2.857	9.8503	23000	2.9814
2.8132	10.2784	24000	2.9835
2.8305	10.7068	25000	2.9775
2.7874	11.1349	26000	2.9745
2.806	11.5633	27000	2.9664
2.8069	11.9916	28000	2.9603
2.776	12.4198	29000	2.9656
2.7852	12.8481	30000	2.9596
2.754	13.2763	31000	2.9598
2.7609	13.7046	32000	2.9539
2.7254	14.1328	33000	2.9552
2.74	14.5611	34000	2.9548
2.7416	14.9895	35000	2.9452
2.7196	15.4176	36000	2.9497
2.7254	15.8460	37000	2.9453
2.6957	16.2741	38000	2.9476
2.7028	16.7025	39000	2.9434
2.6772	17.1306	40000	2.9465
2.6905	17.5590	41000	2.9442
2.6925	17.9874	42000	2.9403
2.6761	18.4155	43000	2.9433
2.6742	18.8439	44000	2.9416
2.6641	19.2720	45000	2.9428
2.6606	19.7004	46000	2.9423

Safetensors

Model size

97.8M params

Tensor type

F16