Papers
arxiv:2305.06721

Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*

Published on May 11, 2023
Authors:
,
,
,
,
,
,

Abstract

A Transformer-based foundation model named Albertina PT-* achieves state-of-the-art performance for European Portuguese and American Portuguese by leveraging pre-training on dataset-specific corpora and outperforming existing models on language processing tasks.

AI-generated summary

To advance the neural encoding of Portuguese (PT), and a fortiori the technological preparation of this language for the digital age, we developed a Transformer-based foundation model that sets a new state of the art in this respect for two of its variants, namely European Portuguese from Portugal (PT-PT) and American Portuguese from Brazil (PT-BR). To develop this encoder, which we named Albertina PT-*, a strong model was used as a starting point, DeBERTa, and its pre-training was done over data sets of Portuguese, namely over a data set we gathered for PT-PT and over the brWaC corpus for PT-BR. The performance of Albertina and competing models was assessed by evaluating them on prominent downstream language processing tasks adapted for Portuguese. Both Albertina PT-PT and PT-BR versions are distributed free of charge and under the most permissive license possible and can be run on consumer-grade hardware, thus seeking to contribute to the advancement of research and innovation in language technology for Portuguese.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2305.06721
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 4

Spaces citing this paper 2

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.