Thank you for your contribution
Hi,
I just want to thank you for your excellent work and your valuable contribution to the comunity.
Thanks! We hope you find them useful, please don't hesitate to reach out if you have any more questions.
How much time does it take to train a model like this ? how many A100 or H100 are needed for continual training considering 50gb of legal documents ?
It will mainly depend on how many tokens you extract from those 50GB (let us say ~10B tokens) and your setup (e.g., mlm_probability, data packing, hardware efficiency, pre-training code).
As a reference point: on a single node with 4xH100 of 64GB, we observed ~450,000 tokens/sec with mlm_probability=0.3 (lower values will train faster). That translates to roughly ~6 hours per epoch over the dataset. On a single GPU, expect to take about ~4x longer.
In practice, you don't need a large cluster for this, 1 A100/H100 is sufficient for that dataset.