opt-babylm1-randomremoval_seed-42_5e-6

This model was trained from scratch on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 32
eval_batch_size: 64
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 256
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.05
num_epochs: 20.0

Training Loss	Epoch	Step	Validation Loss
5.2712	0.4236	1000	5.2297
4.7125	0.8472	2000	4.7089
4.2105	1.2707	3000	4.2034
3.8531	1.6943	4000	3.8440
3.5679	2.1178	5000	3.5593
3.408	2.5414	6000	3.4013
3.3092	2.9649	7000	3.3081
3.2229	3.3884	8000	3.2421
3.1825	3.8120	9000	3.1910
3.1175	4.2355	10000	3.1605
3.0947	4.6591	11000	3.1276
3.024	5.0826	12000	3.1013
3.0297	5.5062	13000	3.0816
3.0068	5.9298	14000	3.0617
2.9613	6.3533	15000	3.0500
2.9614	6.7769	16000	3.0341
2.9233	7.2004	17000	3.0316
2.9264	7.6240	18000	3.0157
2.8709	8.0474	19000	3.0090
2.8893	8.4710	20000	3.0024
2.8951	8.8946	21000	2.9894
2.8582	9.3181	22000	2.9951
2.8656	9.7417	23000	2.9820
2.8161	10.1652	24000	2.9869
2.8393	10.5888	25000	2.9735
2.8198	11.0123	26000	2.9739
2.8067	11.4359	27000	2.9667
2.8062	11.8595	28000	2.9608
2.7781	12.2830	29000	2.9631
2.7868	12.7066	30000	2.9567
2.7511	13.1300	31000	2.9622
2.7625	13.5536	32000	2.9537
2.7698	13.9772	33000	2.9473
2.7375	14.4007	34000	2.9522
2.7515	14.8243	35000	2.9454
2.7198	15.2478	36000	2.9490
2.7254	15.6714	37000	2.9462
2.7019	16.0949	38000	2.9481
2.7064	16.5185	39000	2.9441
2.7156	16.9421	40000	2.9404
2.6925	17.3656	41000	2.9435
2.6967	17.7892	42000	2.9396
2.6798	18.2126	43000	2.9414
2.6831	18.6362	44000	2.9400
2.6751	19.0597	45000	2.9399
2.6752	19.4833	46000	2.9398
2.6779	19.9069	47000	2.9395

Safetensors

Model size

97.8M params

Tensor type

F16