Update README.md
Browse files
README.md
CHANGED
|
@@ -499,9 +499,9 @@ Tenete-8M was trained on an **RTX 2060 6GB** for one epoch with a batch size of
|
|
| 499 |
|
| 500 |
The dataset encompasses **577M tokens**, and includes **4 sources**:
|
| 501 |
|
| 502 |
-
1.
|
| 503 |
-
2. **Medium Articles** (960MB): While web data, especially medium articles, is noisy, we still need human-written examples
|
| 504 |
-
3. **Books** (284MB): Albeit small, books are still needed to instill creativity into the model
|
| 505 |
4. **Q&A** (14MB): just sprinkled in, just to add more knowledge and question-answering.
|
| 506 |
|
| 507 |
We chose to not include code, raw webdata (e.g., fineweb, c4, etc.), and more narrow domains (e.g., arxiv, clinical trials, lesswrong, etc.).
|
|
|
|
| 499 |
|
| 500 |
The dataset encompasses **577M tokens**, and includes **4 sources**:
|
| 501 |
|
| 502 |
+
1. [Textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-textbooks) (1.2GB): Web data is too noisy, so we decided to use Tiny-Textbooks, a synthetic dataset generated by [Nous-Hermes-Llama2-13b](https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b)
|
| 503 |
+
2. [**Medium Articles**](https://huggingface.co/datasets/fabiochiu/medium-articles) (960MB): While web data, especially medium articles, is noisy, we still need human-written examples
|
| 504 |
+
3. [**Books**](https://huggingface.co/datasets/kmfoda/booksum)** (284MB): Albeit small, books are still needed to instill creativity into the model
|
| 505 |
4. **Q&A** (14MB): just sprinkled in, just to add more knowledge and question-answering.
|
| 506 |
|
| 507 |
We chose to not include code, raw webdata (e.g., fineweb, c4, etc.), and more narrow domains (e.g., arxiv, clinical trials, lesswrong, etc.).
|