| --- |
| pretty_name: mC4 |
| annotations_creators: |
| - no-annotation |
| language_creators: |
| - found |
| languages: |
| - af |
| - am |
| - ar |
| - az |
| - be |
| - bg |
| - bg-Latn |
| - bn |
| - ca |
| - ceb |
| - co |
| - cs |
| - cy |
| - da |
| - de |
| - el |
| - el-Latn |
| - en |
| - eo |
| - es |
| - et |
| - eu |
| - fa |
| - fi |
| - fil |
| - fr |
| - fy |
| - ga |
| - gd |
| - gl |
| - gu |
| - ha |
| - haw |
| - hi |
| - hi-Latn |
| - hmn |
| - ht |
| - hu |
| - hy |
| - id |
| - ig |
| - is |
| - it |
| - iw |
| - ja |
| - ja-Latn |
| - jv |
| - ka |
| - kk |
| - km |
| - kn |
| - ko |
| - ku |
| - ky |
| - la |
| - lb |
| - lo |
| - lt |
| - lv |
| - mg |
| - mi |
| - mk |
| - ml |
| - mn |
| - mr |
| - ms |
| - mt |
| - my |
| - ne |
| - nl |
| - "no" |
| - ny |
| - pa |
| - pl |
| - ps |
| - pt |
| - ro |
| - ru |
| - ru-Latn |
| - sd |
| - si |
| - sk |
| - sl |
| - sm |
| - sn |
| - so |
| - sq |
| - sr |
| - st |
| - su |
| - sv |
| - sw |
| - ta |
| - te |
| - tg |
| - th |
| - tr |
| - uk |
| - und |
| - ur |
| - uz |
| - vi |
| - xh |
| - yi |
| - yo |
| - zh |
| - zh-Latn |
| - zu |
| licenses: |
| - odc-by-1.0 |
| multilinguality: |
| - multilingual |
| size_categories: |
| - n<1K |
| - 1K<n<10K |
| - 10K<n<100K |
| - 100K<n<1M |
| - 1M<n<10M |
| - 10M<n<100M |
| - 100M<n<1B |
| - 1B<n<10B |
| source_datasets: |
| - original |
| task_categories: |
| - sequence-modeling |
| task_ids: |
| - language-modeling |
| paperswithcode_id: mc4 |
| --- |
| |
| # Dataset Card for mC4 |
|
|
| ## Table of Contents |
|
|
| - [Dataset Card for mC4](#dataset-card-for-mc4) |
| - [Table of Contents](#table-of-contents) |
| - [Dataset Description](#dataset-description) |
| - [Dataset Summary](#dataset-summary) |
| - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) |
| - [Languages](#languages) |
| - [Dataset Structure](#dataset-structure) |
| - [Data Instances](#data-instances) |
| - [Data Fields](#data-fields) |
| - [Data Splits](#data-splits) |
| - [Dataset Creation](#dataset-creation) |
| - [Curation Rationale](#curation-rationale) |
| - [Source Data](#source-data) |
| - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) |
| - [Who are the source language producers?](#who-are-the-source-language-producers) |
| - [Annotations](#annotations) |
| - [Annotation process](#annotation-process) |
| - [Who are the annotators?](#who-are-the-annotators) |
| - [Personal and Sensitive Information](#personal-and-sensitive-information) |
| - [Considerations for Using the Data](#considerations-for-using-the-data) |
| - [Social Impact of Dataset](#social-impact-of-dataset) |
| - [Discussion of Biases](#discussion-of-biases) |
| - [Other Known Limitations](#other-known-limitations) |
| - [Additional Information](#additional-information) |
| - [Dataset Curators](#dataset-curators) |
| - [Licensing Information](#licensing-information) |
| - [Citation Information](#citation-information) |
| - [Contributions](#contributions) |
|
|
| ## Dataset Description |
|
|
| - **Homepage:** https://huggingface.co/datasets/allenai/c4 |
| - **Paper:** https://arxiv.org/abs/1910.10683 |
|
|
| ### Dataset Summary |
|
|
| A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". |
|
|
| This is the version prepared by AllenAI, hosted at this address: https://huggingface.co/datasets/allenai/c4 |
|
|
| 108 languages are available and are reported in the table below. |
|
|
| Note that the languages that end with "-Latn" are simply romanized variants, i.e. written using the Latin script. |
|
|
| | language code | language name | |
| |:----------------|:---------------------| |
| | af | Afrikaans | |
| | am | Amharic | |
| | ar | Arabic | |
| | az | Azerbaijani | |
| | be | Belarusian | |
| | bg | Bulgarian | |
| | bg-Latn | Bulgarian (Latin) | |
| | bn | Bangla | |
| | ca | Catalan | |
| | ceb | Cebuano | |
| | co | Corsican | |
| | cs | Czech | |
| | cy | Welsh | |
| | da | Danish | |
| | de | German | |
| | el | Greek | |
| | el-Latn | Greek (Latin) | |
| | en | English | |
| | eo | Esperanto | |
| | es | Spanish | |
| | et | Estonian | |
| | eu | Basque | |
| | fa | Persian | |
| | fi | Finnish | |
| | fil | Filipino | |
| | fr | French | |
| | fy | Western Frisian | |
| | ga | Irish | |
| | gd | Scottish Gaelic | |
| | gl | Galician | |
| | gu | Gujarati | |
| | ha | Hausa | |
| | haw | Hawaiian | |
| | hi | Hindi | |
| | hi-Latn | Hindi (Latin script) | |
| | hmn | Hmong, Mong | |
| | ht | Haitian | |
| | hu | Hungarian | |
| | hy | Armenian | |
| | id | Indonesian | |
| | ig | Igbo | |
| | is | Icelandic | |
| | it | Italian | |
| | iw | former Hebrew | |
| | ja | Japanese | |
| | ja-Latn | Japanese (Latin) | |
| | jv | Javanese | |
| | ka | Georgian | |
| | kk | Kazakh | |
| | km | Khmer | |
| | kn | Kannada | |
| | ko | Korean | |
| | ku | Kurdish | |
| | ky | Kyrgyz | |
| | la | Latin | |
| | lb | Luxembourgish | |
| | lo | Lao | |
| | lt | Lithuanian | |
| | lv | Latvian | |
| | mg | Malagasy | |
| | mi | Maori | |
| | mk | Macedonian | |
| | ml | Malayalam | |
| | mn | Mongolian | |
| | mr | Marathi | |
| | ms | Malay | |
| | mt | Maltese | |
| | my | Burmese | |
| | ne | Nepali | |
| | nl | Dutch | |
| | no | Norwegian | |
| | ny | Nyanja | |
| | pa | Punjabi | |
| | pl | Polish | |
| | ps | Pashto | |
| | pt | Portuguese | |
| | ro | Romanian | |
| | ru | Russian | |
| | ru-Latn | Russian (Latin) | |
| | sd | Sindhi | |
| | si | Sinhala | |
| | sk | Slovak | |
| | sl | Slovenian | |
| | sm | San Marino | |
| | sn | Shona | |
| | so | Somali | |
| | sq | Albanian | |
| | sr | Serbian | |
| | st | Southern Sotho | |
| | su | Sundanese | |
| | sv | Swedish | |
| | sw | Swahili | |
| | ta | Tamil | |
| | te | Telugu | |
| | tg | Tajik | |
| | th | Thai | |
| | tr | Turkish | |
| | uk | Ukrainian | |
| | und | Unknown language | |
| | ur | Urdu | |
| | uz | Uzbek | |
| | vi | Vietnamese | |
| | xh | Xhosa | |
| | yi | Yiddish | |
| | yo | Yoruba | |
| | zh | Chinese | |
| | zh-Latn | Chinese (Latin) | |
| | zu | Zulu | |
|
|
| You can load the mC4 subset of any language like this: |
|
|
| ```python |
| from datasets import load_dataset |
| |
| en_mc4 = load_dataset("mc4", "en") |
| ``` |
|
|
| And if you can even specify a list of languages: |
|
|
| ```python |
| from datasets import load_dataset |
| |
| mc4_subset_with_five_languages = load_dataset("mc4", languages=["en", "fr", "es", "de", "zh"]) |
| ``` |
|
|
| ### Supported Tasks and Leaderboards |
|
|
| mC4 is mainly intended to pretrain language models and word representations. |
|
|
| ### Languages |
|
|
| The dataset supports 108 languages. |
|
|
| ## Dataset Structure |
|
|
| ### Data Instances |
|
|
| An example form the `en` config is: |
|
|
| ``` |
| {'timestamp': '2018-06-24T01:32:39Z', |
| 'text': 'Farm Resources in Plumas County\nShow Beginning Farmer Organizations & Professionals (304)\nThere are 304 resources serving Plumas County in the following categories:\nMap of Beginning Farmer Organizations & Professionals serving Plumas County\nVictoria Fisher - Office Manager - Loyalton, CA\nAmy Lynn Rasband - UCCE Plumas-Sierra Administrative Assistant II - Quincy , CA\nShow Farm Income Opportunities Organizations & Professionals (353)\nThere are 353 resources serving Plumas County in the following categories:\nFarm Ranch And Forest Retailers (18)\nMap of Farm Income Opportunities Organizations & Professionals serving Plumas County\nWarner Valley Wildlife Area - Plumas County\nShow Farm Resources Organizations & Professionals (297)\nThere are 297 resources serving Plumas County in the following categories:\nMap of Farm Resources Organizations & Professionals serving Plumas County\nThere are 57 resources serving Plumas County in the following categories:\nMap of Organic Certification Organizations & Professionals serving Plumas County', |
| 'url': 'http://www.californialandcan.org/Plumas/Farm-Resources/'} |
| ``` |
|
|
| ### Data Fields |
|
|
| The data have several fields: |
|
|
| - `url`: url of the source as a string |
| - `text`: text content as a string |
| - `timestamp`: timestamp as a string |
|
|
| ### Data Splits |
|
|
| To build mC4, the authors used [CLD3](https://github.com/google/cld3) to identify over 100 languages. The resulting mC4 subsets for each language are reported in this table: |
|
|
| | config | train | validation | |
| |:---------|:--------|:-------------| |
| | af | ? | ? | |
| | am | ? | ? | |
| | ar | ? | ? | |
| | az | ? | ? | |
| | be | ? | ? | |
| | bg | ? | ? | |
| | bg-Latn | ? | ? | |
| | bn | ? | ? | |
| | ca | ? | ? | |
| | ceb | ? | ? | |
| | co | ? | ? | |
| | cs | ? | ? | |
| | cy | ? | ? | |
| | da | ? | ? | |
| | de | ? | ? | |
| | el | ? | ? | |
| | el-Latn | ? | ? | |
| | en | ? | ? | |
| | eo | ? | ? | |
| | es | ? | ? | |
| | et | ? | ? | |
| | eu | ? | ? | |
| | fa | ? | ? | |
| | fi | ? | ? | |
| | fil | ? | ? | |
| | fr | ? | ? | |
| | fy | ? | ? | |
| | ga | ? | ? | |
| | gd | ? | ? | |
| | gl | ? | ? | |
| | gu | ? | ? | |
| | ha | ? | ? | |
| | haw | ? | ? | |
| | hi | ? | ? | |
| | hi-Latn | ? | ? | |
| | hmn | ? | ? | |
| | ht | ? | ? | |
| | hu | ? | ? | |
| | hy | ? | ? | |
| | id | ? | ? | |
| | ig | ? | ? | |
| | is | ? | ? | |
| | it | ? | ? | |
| | iw | ? | ? | |
| | ja | ? | ? | |
| | ja-Latn | ? | ? | |
| | jv | ? | ? | |
| | ka | ? | ? | |
| | kk | ? | ? | |
| | km | ? | ? | |
| | kn | ? | ? | |
| | ko | ? | ? | |
| | ku | ? | ? | |
| | ky | ? | ? | |
| | la | ? | ? | |
| | lb | ? | ? | |
| | lo | ? | ? | |
| | lt | ? | ? | |
| | lv | ? | ? | |
| | mg | ? | ? | |
| | mi | ? | ? | |
| | mk | ? | ? | |
| | ml | ? | ? | |
| | mn | ? | ? | |
| | mr | ? | ? | |
| | ms | ? | ? | |
| | mt | ? | ? | |
| | my | ? | ? | |
| | ne | ? | ? | |
| | nl | ? | ? | |
| | no | ? | ? | |
| | ny | ? | ? | |
| | pa | ? | ? | |
| | pl | ? | ? | |
| | ps | ? | ? | |
| | pt | ? | ? | |
| | ro | ? | ? | |
| | ru | ? | ? | |
| | ru-Latn | ? | ? | |
| | sd | ? | ? | |
| | si | ? | ? | |
| | sk | ? | ? | |
| | sl | ? | ? | |
| | sm | ? | ? | |
| | sn | ? | ? | |
| | so | ? | ? | |
| | sq | ? | ? | |
| | sr | ? | ? | |
| | st | ? | ? | |
| | su | ? | ? | |
| | sv | ? | ? | |
| | sw | ? | ? | |
| | ta | ? | ? | |
| | te | ? | ? | |
| | tg | ? | ? | |
| | th | ? | ? | |
| | tr | ? | ? | |
| | uk | ? | ? | |
| | und | ? | ? | |
| | ur | ? | ? | |
| | uz | ? | ? | |
| | vi | ? | ? | |
| | xh | ? | ? | |
| | yi | ? | ? | |
| | yo | ? | ? | |
| | zh | ? | ? | |
| | zh-Latn | ? | ? | |
| | zu | ? | ? | |
|
|
| ## Dataset Creation |
|
|
| ### Curation Rationale |
|
|
| [More Information Needed] |
|
|
| ### Source Data |
|
|
| #### Initial Data Collection and Normalization |
|
|
| [More Information Needed] |
|
|
| #### Who are the source language producers? |
|
|
| [More Information Needed] |
|
|
| ### Annotations |
|
|
| #### Annotation process |
|
|
| [More Information Needed] |
|
|
| #### Who are the annotators? |
|
|
| [More Information Needed] |
|
|
| ### Personal and Sensitive Information |
|
|
| [More Information Needed] |
|
|
| ## Considerations for Using the Data |
|
|
| ### Social Impact of Dataset |
|
|
| [More Information Needed] |
|
|
| ### Discussion of Biases |
|
|
| [More Information Needed] |
|
|
| ### Other Known Limitations |
|
|
| [More Information Needed] |
|
|
| ## Additional Information |
|
|
| ### Dataset Curators |
|
|
| [More Information Needed] |
|
|
| ### Licensing Information |
|
|
| AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset. |
|
|
| ### Citation Information |
|
|
| ``` |
| @article{2019t5, |
| author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, |
| title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, |
| journal = {arXiv e-prints}, |
| year = {2019}, |
| archivePrefix = {arXiv}, |
| eprint = {1910.10683}, |
| } |
| ``` |
|
|
| ### Contributions |
|
|
| Thanks to [@dirkgr](https://github.com/dirkgr) and [@lhoestq](https://github.com/lhoestq) for adding this dataset. |
|
|