ARBS / training /data /__init__.py
CLIWorks's picture
Upload folder using huggingface_hub
d8bc908 verified
"""Data preparation and streaming for ARB pre-training.
Each module exposes:
stream(split, shuffle, ...) -> Iterable[dict]
Yields batches of pre-processed modality data.
num_samples(split) -> int
Total available samples for epoch scheduling.
Dataset references:
Text: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
Code: https://huggingface.co/datasets/bigcode/starcoderdata
Image: https://huggingface.co/datasets/opendiffusionai/cc12m-4mp-realistic
Audio: https://huggingface.co/datasets/openslr/librispeech_asr
Video: https://huggingface.co/datasets/TempoFunk/webvid-10M
"""
from .prepare_fineweb import FineWebStream, FineWebConfig
from .prepare_starcoder import StarCoderStream, StarCoderConfig
from .prepare_cc12m import CC12MStream, CC12MConfig
from .prepare_librispeech import LibriSpeechStream, LibriSpeechConfig
from .prepare_webvid import WebVidStream, WebVidConfig