plT5 Base

plT5 models are T5-based language models trained on Polish corpora. The models were optimized for the original T5 denoising target.

Corpus

plT5 was trained on six different corpora available for Polish language:

Corpus	Tokens	Documents
CCNet Middle	3243M	7.9M
CCNet Head	2641M	7.0M
National Corpus of Polish	1357M	3.9M
Open Subtitles	1056M	1.1M
Wikipedia	260M	1.4M
Wolne Lektury	41M	5.5k

Tokenizer

The training dataset was tokenized into subwords using a sentencepiece unigram model with vocabulary size of 50k tokens.

Usage

Example code:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("allegro/plt5-base")
model = AutoModel.from_pretrained("allegro/plt5-base")

License

CC BY 4.0

Citation

If you use this model, please cite the following paper:

@article{chrabrowa2022evaluation,
  title={Evaluation of Transfer Learning for Polish with a Text-to-Text Model},
  author={Chrabrowa, Aleksandra and Dragan, {\L}ukasz and Grzegorczyk, Karol and Kajtoch, Dariusz and Koszowski, Miko{\l}aj and Mroczkowski, Robert and Rybak, Piotr},
  journal={arXiv preprint arXiv:2205.08808},
  year={2022}
}

Authors

The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences.

You can contact us at: klejbenchmark@allegro.pl

Downloads last month: 82

GGUF

Model size

0.3B params

Architecture

Hardware compatibility

4-bit

5-bit

16-bit

Dataset used to train AdamLangePL/Allegro-PLT5-Base-GGUF

Paper for AdamLangePL/Allegro-PLT5-Base-GGUF

Evaluation of Transfer Learning for Polish with a Text-to-Text Model

Paper • 2205.08808 • Published May 18, 2022 • 1