arxiv:2305.06721

Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*

Published on May 11, 2023

Authors:

Abstract

A Transformer-based foundation model named Albertina PT-* achieves state-of-the-art performance for European Portuguese and American Portuguese by leveraging pre-training on dataset-specific corpora and outperforming existing models on language processing tasks.

AI-generated summary

To advance the neural encoding of Portuguese (PT), and a fortiori the technological preparation of this language for the digital age, we developed a Transformer-based foundation model that sets a new state of the art in this respect for two of its variants, namely European Portuguese from Portugal (PT-PT) and American Portuguese from Brazil (PT-BR). To develop this encoder, which we named Albertina PT-*, a strong model was used as a starting point, DeBERTa, and its pre-training was done over data sets of Portuguese, namely over a data set we gathered for PT-PT and over the brWaC corpus for PT-BR. The performance of Albertina and competing models was assessed by evaluating them on prominent downstream language processing tasks adapted for Portuguese. Both Albertina PT-PT and PT-BR versions are distributed free of charge and under the most permissive license possible and can be run on consumer-grade hardware, thus seeking to contribute to the advancement of research and innovation in language technology for Portuguese.