MegaScale dataset example
Here is a working example from the MegaScale dataset.
Process data through scripts and store each split as a separate
.parquetfile
intermediate/<dataset_id>_<split_id>.parquetUse datasets package to load the local dataset into memory. See below for more examples of how to load different types of datasets
dataset_tag = "dataset3"
cache_dir = "path/to/scratch"
dataset = datasets.load_dataset(
"parquet", name \= dataset\_tag, data\_dir \= "./intermediate", data\_files \= { "train" : f"{dataset\_tag}\_train.parquet", "validation" : f"{dataset\_tag}\_valdation.parquet", "test" : f"{dataset\_tag}\_test.parquet"}, cache\_dir \= cache\_dir, keep\_in\_memory \= True)Set up Personal Access Keys on HuggingFace (see above)
Use the datasets package to push the dataset to hub, for ex
repo_id = "RosettaCommons/MegaScale"
dataset.push_to_hub(
repo\_id \= repo\_id, config\_name \= dataset\_tag, data\_dir \= f"{dataset\_tag}/data", commit\_message \= "Upload {dataset\_tag}")This produces on HuggingFace
https://huggingface.co/datasets/{repo\_id}/tree/main/{dataset\_tag}/data/
{split}-<split-chunk-index>-of-<n-split-chunks>.parquet