opt-babylm1_seed-211_5e-6

This model was trained from scratch on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 32
eval_batch_size: 64
seed: 211
gradient_accumulation_steps: 8
total_train_batch_size: 256
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.05
num_epochs: 20.0

Training Loss	Epoch	Step	Validation Loss
5.2657	0.4206	1000	5.2429
4.7058	0.8413	2000	4.7120
4.2262	1.2616	3000	4.2310
3.8735	1.6823	4000	3.8682
3.5661	2.1026	5000	3.5738
3.4174	2.5233	6000	3.4080
3.3118	2.9439	7000	3.3058
3.2284	3.3643	8000	3.2433
3.18	3.7849	9000	3.1947
3.1179	4.2053	10000	3.1561
3.0937	4.6259	11000	3.1282
3.0223	5.0463	12000	3.1030
3.0283	5.4669	13000	3.0828
3.0162	5.8875	14000	3.0614
2.9568	6.3079	15000	3.0483
2.965	6.7285	16000	3.0331
2.9084	7.1489	17000	3.0258
2.9168	7.5695	18000	3.0127
2.9277	7.9902	19000	3.0023
2.8789	8.4105	20000	2.9999
2.887	8.8312	21000	2.9910
2.85	9.2515	22000	2.9881
2.8541	9.6722	23000	2.9817
2.8065	10.0925	24000	2.9763
2.827	10.5132	25000	2.9726
2.8302	10.9338	26000	2.9610
2.7886	11.3542	27000	2.9628
2.8112	11.7748	28000	2.9584
2.7724	12.1952	29000	2.9601
2.7749	12.6158	30000	2.9572
2.7427	13.0362	31000	2.9548
2.7541	13.4568	32000	2.9505
2.763	13.8774	33000	2.9443
2.7283	14.2978	34000	2.9480
2.7383	14.7184	35000	2.9427
2.7111	15.1388	36000	2.9461
2.7177	15.5594	37000	2.9415
2.7221	15.9801	38000	2.9372
2.6956	16.4004	39000	2.9407
2.7097	16.8211	40000	2.9359
2.6858	17.2414	41000	2.9396
2.686	17.6621	42000	2.9358
2.6683	18.0824	43000	2.9378
2.6763	18.5031	44000	2.9361
2.6785	18.9237	45000	2.9346
2.6581	19.3441	46000	2.9354
2.6601	19.7647	47000	2.9352

Safetensors

Model size

97.8M params

Tensor type

F16