BERTimbau for FOS classification (Title, Keywords, and Abstract)

This is a fine-tune of the BERTimbau encoder model for the task of FOS classification, developed in the context of the PROPOR2026's paper "Field of Science and Technology Classification of Academic Documents in Portuguese". The model was fine-tuned using a collection of Portuguese theses from Estudo Geral, the digital repository of the University of Coimbra.

Note that this model was trained on classifying theses using their title, keywords, and abstract, utilizing special tokens to indicate the corresponding field in the input text.

Model Details

Model Description

Base Encoder: BERTimbau Base
Maximum Sequence Length: 512 tokens
Training Dataset: PROPOR FOS Classification
Language: pt
License: apache-2.0

Model Labels

"Ciências Sociais", "Ciências Médicas e da Saúde", "Humanidades", "Ciências da Engenharia e Tecnologias", "Ciências Exatas e Naturais"

Model Special Tokens

"[TITLE]", "[KEYWORDS]", "[ABSTRACT]"

Evaluation

Metrics

Model	Precision	Recall	F1	Accuracy
bertimbau-base-cased-FOS-T	0.8024	0.7729	0.7849	0.8454
bertimbau-base-cased-FOS-TK	0.8759	0.8418	0.8557	0.8933
bertimbau-base-cased-FOS-TA	0.8841	0.8707	0.8766	0.9119
bertimbau-base-cased-FOS-TKA	0.8860	0.8660	0.8742	0.9098

Note: Precision, Recall, and F1 metrics were macro-averaged

Uses

Direct Inference

from transformers import pipeline

classifier = pipeline(model="ivosimoes/bertimbau-base-cased-FOS-TKA")

classifier("[TITLE] Envelhecimento nas dificuldades intelectuais [KEYWORDS] Deficiente mental, idoso [ABSTRACT] O presente estudo teve como principal objetivo...")

Training Details

Training Hyperparameters

epochs: 10
learning_rate: 3e-5
max_grad_norm: 1.0
warmup_steps: 0
optimizer: AdamW
weight_decay: 0
adam_betas: (0.9, 0.999)
adam_epsilon: 1e-8
seed: 42

Training Results

Training Hardware

Models were trained through a remotely accessed machine containing the following:

CPU: 48x Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz
GPU: 3x NVIDIA RTX A6000 (48GB VRAM)
Memory: 251GB

Library Versions

Python: 3.10.12
Transformers: 5.2.0
PyTorch: 2.10.0+cu129
Datasets: 4.6.1
Accelerate: 1.12.0

Acknowledgment

This work was partially supported by the AMALIA project, funded by FCT/IP in the context of measure RE-C05-i08 of the Portuguese Recovery and Resilience Program;
by the Portuguese Recovery and Resilience Plan (PRR) through project C645008882-00000055, Center for Responsible AI; and by national funds through FCT -- Foundation for Science and Technology I.P., in the framework of the Project CISUC (UIDB/00326/2025 and UIDP/00326/2025).

Citation

BibTeX

@inproceedings{simoes-etal-2026-field,
    title = "Field of Science and Technology Classification of Academic Documents in {P}ortuguese",
    author = "Sim{\~o}es, Ivo  and
      Oliveira, Hugo Gon{\c{c}}alo  and
      Correia, Jo{\~a}o",
    editor = "Souza, Marlo  and
      de-Dios-Flores, Iria  and
      Santos, Diana  and
      Freitas, Larissa  and
      Souza, Jackson Wilke da Cruz  and
      Ribeiro, Eug{\'e}nio",
    booktitle = "Proceedings of the 17th International Conference on Computational Processing of {P}ortuguese ({PROPOR} 2026) - Vol. 1",
    month = apr,
    year = "2026",
    address = "Salvador, Brazil",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.propor-1.104/",
    pages = "1021--1026",
    ISBN = "979-8-89176-387-6",
    abstract = "Towards improving metadata in academic repositories, this study evaluates the efficacy of different transformer-based models in the automatic classification of the Field of Science and Technology (FOS) of academic theses written in Portuguese. We compare the performance of four different encoder models, two multilingual and two Portuguese-specific, against five larger decoder-based LLMs, on a dataset of 9,696 theses characterized by their title, keywords, and abstract. Fine-tuned encoder-based models achieved the best scores (F1 = 88{\%}), outperforming general-purpose decoder models prompted for the task. These results suggest that, for localized academic domains, task-specific fine-tuning remains more effective than general-purpose LLM prompting."
}

Downloads last month: 29

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for ivosimoes/bertimbau-base-cased-FOS-TKA

Base model

neuralmind/bert-base-portuguese-cased

Finetuned

(201)

this model

Dataset used to train ivosimoes/bertimbau-base-cased-FOS-TKA

Collection including ivosimoes/bertimbau-base-cased-FOS-TKA

PROPOR FOS Classification

Collection

Dataset and fine-tuned models from PROPOR2026's "Field of Science and Technology Classification of Academic Documents in Portuguese" paper • 5 items • Updated 5 days ago