roshbeed commited on
Commit
26271a8
·
verified ·
1 Parent(s): d6eabb7

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +42 -0
README.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Text8 Dataset
2
+
3
+ This repository contains the Text8 dataset, a large text corpus commonly used for training word embeddings and language models.
4
+
5
+ ## Dataset Information
6
+
7
+ - **Source**: http://mattmahoney.net/dc/text8.zip
8
+ - **License**: Public domain
9
+ - **Format**: Text corpus
10
+ - **Size**: Large text corpus (~100MB)
11
+
12
+ ## Files
13
+
14
+ - `text8_full.txt`: Complete text8 corpus
15
+ - `text8_sentences.json`: Text8 split into sentences for easier processing
16
+ - `dataset_info.json`: Dataset metadata
17
+
18
+ ## Usage
19
+
20
+ You can load this dataset in your training scripts using:
21
+
22
+ ```python
23
+ from huggingface_hub import hf_hub_download
24
+ import json
25
+
26
+ # Download sentences
27
+ sentences_path = hf_hub_download(
28
+ repo_id="roshbeed/text8-dataset",
29
+ filename="text8_sentences.json",
30
+ token="your_token"
31
+ )
32
+
33
+ with open(sentences_path, 'r') as f:
34
+ data = json.load(f)
35
+ sentences = data['sentences']
36
+
37
+ # Use sentences for training
38
+ ```
39
+
40
+ ## Citation
41
+
42
+ If you use this dataset, please cite the original source.