opt-babylm1_seed-42_1e-6

This model was trained from scratch on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 1e-06
train_batch_size: 32
eval_batch_size: 64
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 256
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.05
num_epochs: 20.0

Training Loss	Epoch	Step	Validation Loss
5.964	0.4206	1000	5.8746
4.9145	0.8413	2000	4.8984
4.5344	1.2616	3000	4.5433
4.2322	1.6823	4000	4.2203
3.9694	2.1026	5000	3.9752
3.7958	2.5233	6000	3.7851
3.6053	2.9439	7000	3.5977
3.4888	3.3643	8000	3.4823
3.404	3.7849	9000	3.3924
3.3238	4.2053	10000	3.3380
3.2816	4.6259	11000	3.2862
3.2171	5.0463	12000	3.2522
3.1938	5.4669	13000	3.2163
3.1599	5.8875	14000	3.1851
3.124	6.3079	15000	3.1696
3.1094	6.7285	16000	3.1485
3.0677	7.1489	17000	3.1319
3.0715	7.5695	18000	3.1178
3.0578	7.9902	19000	3.1009
3.0319	8.4105	20000	3.0907
3.0204	8.8312	21000	3.0804
2.9903	9.2515	22000	3.0694
2.9874	9.6722	23000	3.0618
2.9539	10.0925	24000	3.0564
2.9538	10.5132	25000	3.0468
2.9552	10.9338	26000	3.0397
2.9319	11.3542	27000	3.0366
2.9305	11.7748	28000	3.0280
2.9145	12.1952	29000	3.0254
2.9091	12.6158	30000	3.0211
2.8855	13.0362	31000	3.0164
2.8941	13.4568	32000	3.0127
2.886	13.8774	33000	3.0080
2.8712	14.2978	34000	3.0073
2.8764	14.7184	35000	3.0029
2.8622	15.1388	36000	3.0007
2.865	15.5594	37000	2.9975
2.862	15.9801	38000	2.9947
2.8394	16.4004	39000	2.9937
2.85	16.8211	40000	2.9913
2.8272	17.2414	41000	2.9909
2.8336	17.6621	42000	2.9888
2.8245	18.0824	43000	2.9882
2.8293	18.5031	44000	2.9874
2.8366	18.9237	45000	2.9864
2.8243	19.3441	46000	2.9861
2.8229	19.7647	47000	2.9858

Safetensors

Model size

97.8M params

Tensor type

F16