| """Data preparation and streaming for ARB pre-training. | |
| Each module exposes: | |
| stream(split, shuffle, ...) -> Iterable[dict] | |
| Yields batches of pre-processed modality data. | |
| num_samples(split) -> int | |
| Total available samples for epoch scheduling. | |
| Dataset references: | |
| Text: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu | |
| Code: https://huggingface.co/datasets/bigcode/starcoderdata | |
| Image: https://huggingface.co/datasets/opendiffusionai/cc12m-4mp-realistic | |
| Audio: https://huggingface.co/datasets/openslr/librispeech_asr | |
| Video: https://huggingface.co/datasets/TempoFunk/webvid-10M | |
| """ | |
| from .prepare_fineweb import FineWebStream, FineWebConfig | |
| from .prepare_starcoder import StarCoderStream, StarCoderConfig | |
| from .prepare_cc12m import CC12MStream, CC12MConfig | |
| from .prepare_librispeech import LibriSpeechStream, LibriSpeechConfig | |
| from .prepare_webvid import WebVidStream, WebVidConfig | |