BERTimbau for FOS classification (Title, Keywords, and Abstract)
This is a fine-tune of the BERTimbau encoder model for the task of FOS classification, developed in the context of the PROPOR2026's paper "Field of Science and Technology Classification of Academic Documents in Portuguese". The model was fine-tuned using a collection of Portuguese theses from Estudo Geral, the digital repository of the University of Coimbra.
Note that this model was trained on classifying theses using their title, keywords, and abstract, utilizing special tokens to indicate the corresponding field in the input text.
Model Details
Model Description
- Base Encoder: BERTimbau Base
- Maximum Sequence Length: 512 tokens
- Training Dataset: PROPOR FOS Classification
- Language: pt
- License: apache-2.0
Model Labels
"Ciências Sociais", "Ciências Médicas e da Saúde", "Humanidades", "Ciências da Engenharia e Tecnologias", "Ciências Exatas e Naturais"
Model Special Tokens
"[TITLE]", "[KEYWORDS]", "[ABSTRACT]"
Evaluation
Metrics
| Model | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|
| bertimbau-base-cased-FOS-T | 0.8024 | 0.7729 | 0.7849 | 0.8454 |
| bertimbau-base-cased-FOS-TK | 0.8759 | 0.8418 | 0.8557 | 0.8933 |
| bertimbau-base-cased-FOS-TA | 0.8841 | 0.8707 | 0.8766 | 0.9119 |
| bertimbau-base-cased-FOS-TKA | 0.8860 | 0.8660 | 0.8742 | 0.9098 |
Note: Precision, Recall, and F1 metrics were macro-averaged
Uses
Direct Inference
from transformers import pipeline
classifier = pipeline(model="ivosimoes/bertimbau-base-cased-FOS-TKA")
classifier("[TITLE] Envelhecimento nas dificuldades intelectuais [KEYWORDS] Deficiente mental, idoso [ABSTRACT] O presente estudo teve como principal objetivo...")
Training Details
Training Hyperparameters
- epochs: 10
- learning_rate: 3e-5
- max_grad_norm: 1.0
- warmup_steps: 0
- optimizer: AdamW
- weight_decay: 0
- adam_betas: (0.9, 0.999)
- adam_epsilon: 1e-8
- seed: 42
Training Results
Training Hardware
Models were trained through a remotely accessed machine containing the following:
- CPU: 48x Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz
- GPU: 3x NVIDIA RTX A6000 (48GB VRAM)
- Memory: 251GB
Library Versions
- Python: 3.10.12
- Transformers: 5.2.0
- PyTorch: 2.10.0+cu129
- Datasets: 4.6.1
- Accelerate: 1.12.0
Acknowledgment
This work was partially supported by the AMALIA project, funded by FCT/IP in the context of measure RE-C05-i08 of the Portuguese Recovery and Resilience Program;
by the Portuguese Recovery and Resilience Plan (PRR) through project C645008882-00000055, Center for Responsible AI;
and by national funds through FCT -- Foundation for Science and Technology I.P., in the framework of the Project CISUC (UIDB/00326/2025 and UIDP/00326/2025).
Citation
BibTeX
@inproceedings{simoes-etal-2026-field,
title = "Field of Science and Technology Classification of Academic Documents in {P}ortuguese",
author = "Sim{\~o}es, Ivo and
Oliveira, Hugo Gon{\c{c}}alo and
Correia, Jo{\~a}o",
editor = "Souza, Marlo and
de-Dios-Flores, Iria and
Santos, Diana and
Freitas, Larissa and
Souza, Jackson Wilke da Cruz and
Ribeiro, Eug{\'e}nio",
booktitle = "Proceedings of the 17th International Conference on Computational Processing of {P}ortuguese ({PROPOR} 2026) - Vol. 1",
month = apr,
year = "2026",
address = "Salvador, Brazil",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.propor-1.104/",
pages = "1021--1026",
ISBN = "979-8-89176-387-6",
abstract = "Towards improving metadata in academic repositories, this study evaluates the efficacy of different transformer-based models in the automatic classification of the Field of Science and Technology (FOS) of academic theses written in Portuguese. We compare the performance of four different encoder models, two multilingual and two Portuguese-specific, against five larger decoder-based LLMs, on a dataset of 9,696 theses characterized by their title, keywords, and abstract. Fine-tuned encoder-based models achieved the best scores (F1 = 88{\%}), outperforming general-purpose decoder models prompted for the task. These results suggest that, for localized academic domains, task-specific fine-tuning remains more effective than general-purpose LLM prompting."
}
- Downloads last month
- 29
Model tree for ivosimoes/bertimbau-base-cased-FOS-TKA
Base model
neuralmind/bert-base-portuguese-cased