roshbeed
/

text8-dataset

roshbeed commited on Jun 21, 2025

Commit

26271a8

verified ·

1 Parent(s): d6eabb7

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md ADDED Viewed

+# Text8 Dataset
+This repository contains the Text8 dataset, a large text corpus commonly used for training word embeddings and language models.
+## Dataset Information
+- **Source**: http://mattmahoney.net/dc/text8.zip
+- **License**: Public domain
+- **Format**: Text corpus
+- **Size**: Large text corpus (~100MB)
+## Files
+- `text8_full.txt`: Complete text8 corpus
+- `text8_sentences.json`: Text8 split into sentences for easier processing
+- `dataset_info.json`: Dataset metadata
+## Usage
+You can load this dataset in your training scripts using:
+```python
+from huggingface_hub import hf_hub_download
+import json
+# Download sentences
+sentences_path = hf_hub_download(
+    repo_id="roshbeed/text8-dataset",
+    filename="text8_sentences.json",
+    token="your_token"
+)
+with open(sentences_path, 'r') as f:
+    data = json.load(f)
+    sentences = data['sentences']
+# Use sentences for training
+```
+## Citation
+If you use this dataset, please cite the original source.