Running on CPU Upgrade 219 The Synthetic Data Playbook: Generating Trillions of the Finest Tokens 📝 219 Explore synthetic data experiments on a virtual bookshelf
🤏 Smol-Data Collection Tried and tested mixes for strong pretraining. Inspired by https://huggingface.co/blog/codelion/optimal-dataset-mixing • 14 items • Updated Mar 2 • 12