opt-babylm1-randomremoval_seed-1024_5e-6

This model was trained from scratch on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 32
eval_batch_size: 64
seed: 1024
gradient_accumulation_steps: 8
total_train_batch_size: 256
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.05
num_epochs: 20.0

Training Loss	Epoch	Step	Validation Loss
5.2575	0.4236	1000	5.2274
4.6932	0.8472	2000	4.6903
4.2391	1.2707	3000	4.2385
3.872	1.6943	4000	3.8540
3.5682	2.1178	5000	3.5667
3.4179	2.5414	6000	3.4043
3.3127	2.9649	7000	3.3098
3.2297	3.3884	8000	3.2466
3.1928	3.8120	9000	3.1968
3.1198	4.2355	10000	3.1613
3.102	4.6591	11000	3.1299
3.0285	5.0826	12000	3.1074
3.0355	5.5062	13000	3.0867
3.0242	5.9298	14000	3.0685
2.9815	6.3533	15000	3.0631
2.9801	6.7769	16000	3.0397
2.932	7.2004	17000	3.0326
2.9292	7.6240	18000	3.0192
2.8848	8.0474	19000	3.0173
2.8934	8.4710	20000	3.0084
2.8941	8.8946	21000	2.9943
2.8572	9.3181	22000	2.9946
2.8679	9.7417	23000	2.9851
2.8201	10.1652	24000	2.9837
2.8429	10.5888	25000	2.9795
2.8276	11.0123	26000	2.9780
2.8103	11.4359	27000	2.9710
2.8168	11.8595	28000	2.9672
2.788	12.2830	29000	2.9668
2.7959	12.7066	30000	2.9622
2.7593	13.1300	31000	2.9603
2.7712	13.5536	32000	2.9570
2.7757	13.9772	33000	2.9500
2.7449	14.4007	34000	2.9553
2.7557	14.8243	35000	2.9492
2.7215	15.2478	36000	2.9536
2.7413	15.6714	37000	2.9473
2.7039	16.0949	38000	2.9495
2.7163	16.5185	39000	2.9461
2.7198	16.9421	40000	2.9426
2.6968	17.3656	41000	2.9458
2.7002	17.7892	42000	2.9421
2.6841	18.2126	43000	2.9447
2.6839	18.6362	44000	2.9430
2.6723	19.0597	45000	2.9432
2.6754	19.4833	46000	2.9427
2.6792	19.9069	47000	2.9421

Safetensors

Model size

97.8M params

Tensor type

F16