MolecularDatasetCurationGuide / examples /01_megascale_dataset.md
maom's picture
Rename examples/01_megascale_dataset to examples/01_megascale_dataset.md
f6c0344 verified

MegaScale dataset example

Here is a working example from the MegaScale dataset.

  1. Process data through scripts and store each split as a separate .parquet file
    intermediate/<dataset_id>_<split_id>.parquet

  2. Use datasets package to load the local dataset into memory. See below for more examples of how to load different types of datasets

    dataset_tag = "dataset3"

    cache_dir = "path/to/scratch"

    dataset = datasets.load_dataset(

    "parquet",
    
    name \= dataset\_tag,
    
    data\_dir \= "./intermediate",
    
    data\_files \= {
    
        "train" : f"{dataset\_tag}\_train.parquet",
    
        "validation" : f"{dataset\_tag}\_valdation.parquet",
    
        "test" : f"{dataset\_tag}\_test.parquet"},
    
    cache\_dir \= cache\_dir,
    
    keep\_in\_memory \= True)
    
  3. Set up Personal Access Keys on HuggingFace (see above)

  4. Use the datasets package to push the dataset to hub, for ex

    repo_id = "RosettaCommons/MegaScale"

    dataset.push_to_hub(

    repo\_id \= repo\_id,
    
    config\_name \= dataset\_tag,
    
    data\_dir \= f"{dataset\_tag}/data",
    
    commit\_message \= "Upload {dataset\_tag}")
    
  5. This produces on HuggingFace

    https://huggingface.co/datasets/{repo\_id}/tree/main/{dataset\_tag}/data/

    {split}-<split-chunk-index>-of-<n-split-chunks>.parquet