"""Data preparation and streaming for ARB pre-training. Each module exposes: stream(split, shuffle, ...) -> Iterable[dict] Yields batches of pre-processed modality data. num_samples(split) -> int Total available samples for epoch scheduling. Dataset references: Text: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu Code: https://huggingface.co/datasets/bigcode/starcoderdata Image: https://huggingface.co/datasets/opendiffusionai/cc12m-4mp-realistic Audio: https://huggingface.co/datasets/openslr/librispeech_asr Video: https://huggingface.co/datasets/TempoFunk/webvid-10M """ from .prepare_fineweb import FineWebStream, FineWebConfig from .prepare_starcoder import StarCoderStream, StarCoderConfig from .prepare_cc12m import CC12MStream, CC12MConfig from .prepare_librispeech import LibriSpeechStream, LibriSpeechConfig from .prepare_webvid import WebVidStream, WebVidConfig