opt-babylm1_seed-1024_5e-6

This model was trained from scratch on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 32
eval_batch_size: 64
seed: 1024
gradient_accumulation_steps: 8
total_train_batch_size: 256
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.05
num_epochs: 20.0

Training Loss	Epoch	Step	Validation Loss
5.2297	0.4206	1000	5.1826
4.7637	0.8413	2000	4.7401
4.299	1.2616	3000	4.2874
3.9607	1.6823	4000	3.9489
3.6494	2.1026	5000	3.6517
3.4719	2.5233	6000	3.4667
3.3668	2.9439	7000	3.3572
3.2732	3.3643	8000	3.2821
3.2214	3.7849	9000	3.2332
3.1418	4.2053	10000	3.1875
3.1246	4.6259	11000	3.1543
3.0548	5.0463	12000	3.1296
3.051	5.4669	13000	3.1023
3.0457	5.8875	14000	3.0826
2.9936	6.3079	15000	3.0715
2.9888	6.7285	16000	3.0553
2.9411	7.1489	17000	3.0484
2.9517	7.5695	18000	3.0344
2.9439	7.9902	19000	3.0212
2.9073	8.4105	20000	3.0138
2.9174	8.8312	21000	3.0022
2.8815	9.2515	22000	3.0026
2.8825	9.6722	23000	2.9974
2.8312	10.0925	24000	2.9940
2.8472	10.5132	25000	2.9875
2.8536	10.9338	26000	2.9748
2.8264	11.3542	27000	2.9771
2.8321	11.7748	28000	2.9682
2.7887	12.1952	29000	2.9709
2.7964	12.6158	30000	2.9657
2.7693	13.0362	31000	2.9662
2.7821	13.4568	32000	2.9598
2.7789	13.8774	33000	2.9547
2.7499	14.2978	34000	2.9573
2.7644	14.7184	35000	2.9529
2.7347	15.1388	36000	2.9533
2.736	15.5594	37000	2.9505
2.7476	15.9801	38000	2.9454
2.7259	16.4004	39000	2.9481
2.7222	16.8211	40000	2.9446
2.7054	17.2414	41000	2.9468
2.7133	17.6621	42000	2.9437
2.6935	18.0824	43000	2.9455
2.6976	18.5031	44000	2.9438
2.7072	18.9237	45000	2.9419
2.6934	19.3441	46000	2.9426
2.6919	19.7647	47000	2.9422

Safetensors

Model size

97.8M params

Tensor type

F16