opt-babylm1_seed-42_1e-5

This model was trained from scratch on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 32
eval_batch_size: 64
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 256
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.05
num_epochs: 20.0

Training Loss	Epoch	Step	Validation Loss
5.3691	0.4206	1000	5.3019
9.4778	0.8413	2000	7.7662
11.7536	1.2616	3000	10.9107
6.2443	1.6823	4000	5.9901
5.0938	2.1026	5000	5.0813
4.951	2.5233	6000	4.9423
4.8344	2.9439	7000	4.8479
4.7846	3.3643	8000	4.7772
4.7222	3.7849	9000	4.7218
4.6755	4.2053	10000	4.6585
4.621	4.6259	11000	4.5979
4.5495	5.0463	12000	4.5372
4.4998	5.4669	13000	4.4716
4.4136	5.8875	14000	4.4219
4.3616	6.3079	15000	4.3583
4.2967	6.7285	16000	4.2875
4.2367	7.1489	17000	4.2283
4.1903	7.5695	18000	4.1773
4.1288	7.9902	19000	4.1198
4.09	8.4105	20000	4.0740
4.042	8.8312	21000	4.0270
3.9811	9.2515	22000	3.9675
3.924	9.6722	23000	3.9094
3.8737	10.0925	24000	3.8639
3.833	10.5132	25000	3.8097
3.7976	10.9338	26000	3.7833
3.7641	11.3542	27000	3.7449
3.7239	11.7748	28000	3.7115
3.7045	12.1952	29000	3.6885
3.6655	12.6158	30000	3.6515
3.637	13.0362	31000	3.6258
3.6274	13.4568	32000	3.6069
3.5978	13.8774	33000	3.5860
3.5794	14.2978	34000	3.5686
3.5657	14.7184	35000	3.5510
3.5505	15.1388	36000	3.5370
3.5407	15.5594	37000	3.5238
3.5272	15.9801	38000	3.5150
3.4993	16.4004	39000	3.5013
3.5021	16.8211	40000	3.4901
3.48	17.2414	41000	3.4835
3.4798	17.6621	42000	3.4765
3.4723	18.0824	43000	3.4723
3.4706	18.5031	44000	3.4689
3.4776	18.9237	45000	3.4657
3.4685	19.3441	46000	3.4641
3.4625	19.7647	47000	3.4634

Safetensors

Model size

97.8M params

Tensor type

F16