File size: 935 Bytes
d8bc908
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
"""Data preparation and streaming for ARB pre-training.

Each module exposes:
    stream(split, shuffle, ...) -> Iterable[dict]
        Yields batches of pre-processed modality data.
    num_samples(split) -> int
        Total available samples for epoch scheduling.

Dataset references:
    Text:  https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
    Code:  https://huggingface.co/datasets/bigcode/starcoderdata
    Image: https://huggingface.co/datasets/opendiffusionai/cc12m-4mp-realistic
    Audio: https://huggingface.co/datasets/openslr/librispeech_asr
    Video: https://huggingface.co/datasets/TempoFunk/webvid-10M
"""
from .prepare_fineweb import FineWebStream, FineWebConfig
from .prepare_starcoder import StarCoderStream, StarCoderConfig
from .prepare_cc12m import CC12MStream, CC12MConfig
from .prepare_librispeech import LibriSpeechStream, LibriSpeechConfig
from .prepare_webvid import WebVidStream, WebVidConfig