Harley-ml commited on
Commit
aff2749
·
verified ·
1 Parent(s): dac471b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -499,10 +499,10 @@ Tenete-8M was trained on an **RTX 2060 6GB** for one epoch with a batch size of
499
 
500
  The dataset encompasses **577M tokens**, and includes **4 sources**:
501
 
502
- 1. **Textbooks**: Web data is too noisy, so we decided to use Tiny-Textbooks, a synthetic dataset generated by
503
- 2. **Medium Articles**: While web data, especially medium articles, is noisy, we still need human-written examples
504
- 3. **Books**: Albeit small, books are still needed to instill creativity into the model
505
- 4. **Q&A**: just sprinkled in, just to add more knowledge and question-answering.
506
 
507
  We chose to not include code, raw webdata (e.g., fineweb, c4, etc.), and more narrow domains (e.g., arxiv, clinical trials, lesswrong, etc.).
508
 
 
499
 
500
  The dataset encompasses **577M tokens**, and includes **4 sources**:
501
 
502
+ 1. **Textbooks** (1.2GB): Web data is too noisy, so we decided to use Tiny-Textbooks, a synthetic dataset generated by
503
+ 2. **Medium Articles** (960MB): While web data, especially medium articles, is noisy, we still need human-written examples
504
+ 3. **Books** (284MB): Albeit small, books are still needed to instill creativity into the model
505
+ 4. **Q&A** (14MB): just sprinkled in, just to add more knowledge and question-answering.
506
 
507
  We chose to not include code, raw webdata (e.g., fineweb, c4, etc.), and more narrow domains (e.g., arxiv, clinical trials, lesswrong, etc.).
508