A collection of processed CommonCrawl data as part of the BigBanyanTree initiative. Each dataset is extracted from a random 1% sample of the data.