๐Ÿ“— SPECTER2โ€“FAPESP Knowledge Area (Multiclass Classification on FAPESP Area do Conhecimento (Level 1))

This model is a fine-tuned version of allenai/specter2_base on the FAPESP dataset. It achieves the following results on the evaluation set:

  • Loss: 1.3826
  • Accuracy: 0.6926
  • Precision Micro: 0.6926
  • Precision Macro: 0.6941
  • Recall Micro: 0.6926
  • Recall Macro: 0.6794
  • F1 Micro: 0.6926
  • F1 Macro: 0.6765

Model description

This model is a fine-tuned version of SPECTER2 (allenai/specter2_base) adapted for multiclass classification across the 76 รreas do Conhecimento of FAPESP.

The model accepts the title, abstract, or title + abstract of a research projects and assigns it to exactly one of the Areas (e.g., Veterinary Medicine, Dentistry, Physiotherapy and Occupational Therapy, Philosophy).

Key characteristics:

  • Base model: allenai/specter2_base
  • Task: multiclass document classification
  • Labels: 76 Knowledge Areas
  • Activation: softmax
  • Loss: CrossEntropyLoss
  • Output: single best-matching FAPESP's Knowledge Area category

FAPESP's Knowledge Areas represents broad disciplinary domains designed for high-level categorization of R&I documents.

Intended uses & limitations

This multiclass model is suitable for:

  • Assigning publications to top-level scientific disciplines
  • Enriching metadata in:
    • repositories
    • research output systems
    • funding and project datasets
    • bibliometric dashboards
  • Supporting scientometric analyses such as:
    • broad-discipline portfolio mapping
    • domain-level clustering
    • modeling research diversification
  • Classifying documents when only title/abstract is available

The model supports inputs such as:

  • title only
  • abstract only
  • title + abstract (recommended)

Limitations

  • Documents spanning multiple fields must be forced into one labelโ€”an inherent limitation of multiclass classification.
  • The training labels come from FAPESP funded projects, not manual expert annotation.
  • Not suitable for:
    • downstream tasks requiring multilabel outputs
    • WoS Categories or ASJC Areas (use separate models)
    • clinical or regulatory decision-making

Predictions should be treated as field-level disciplinary metadata.

Training and evaluation data

The training and evaluation dataset was constructed from publicly available FAPESP (Fundaรงรฃo de Amparo ร  Pesquisa do Estado de Sรฃo Paulo) research project records. These records cover funded research projects and scholarships across all scientific domains in Brazil.

The dataset was assembled using the following CSV downloads provided by FAPESP:

  • Auxรญlios em andamento (ongoing research grants)
  • Auxรญlios concluรญdos (completed research grants)
  • Bolsas no Brasil em andamento (ongoing domestic scholarships)
  • Bolsas no Brasil concluรญdas (completed domestic scholarships)
  • Bolsas no exterior em andamento (ongoing international scholarships)
  • Bolsas no exterior concluรญdas (completed international scholarships)

Each record contains metadata such as project titles, abstracts, funding type, and scientific classifications.
From these files, the following fields were extracted and standardized:

  • Title (English)
  • Abstract (English)
  • Grande รrea do Conhecimento (major scientific domain)
  • รrea do Conhecimento (field of study)

Only entries containing at least one English component (title or abstract) were retained.
Scientific areas were normalized and mapped to a controlled English taxonomy to ensure consistency and comparability across records.

The final dataset consists of labeled scientific text samples distributed across multiple domains, providing a balanced corpus for supervised classification.

Training procedure

Preprocessing

  • Input text constructed as:
    abstract
  • Tokenization using the SPECTER2 tokenizer
  • Maximum sequence length: 512 tokens

Model

  • Base model: allenai/specter2_base
  • Classification head: linear layer โ†’ softmax
  • Loss: CrossEntropyLoss

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 10

Training results

Training Loss Epoch Step Validation Loss Accuracy Precision Micro Precision Macro Recall Micro Recall Macro F1 Micro F1 Macro
1.3604 1.0 3807 1.2673 0.6468 0.6468 0.6168 0.6468 0.6010 0.6468 0.6006
1.008 2.0 7614 1.1363 0.6881 0.6881 0.6827 0.6881 0.6571 0.6881 0.6577
0.72 3.0 11421 1.1601 0.6938 0.6938 0.6832 0.6938 0.6705 0.6938 0.6686
0.4764 4.0 15228 1.2849 0.6932 0.6932 0.7208 0.6932 0.6850 0.6932 0.6883
0.315 5.0 19035 1.3826 0.6926 0.6926 0.6941 0.6926 0.6794 0.6926 0.6765

Evaluation results

precision recall f1-score support
Administration 0.641509 0.68 0.660194 50
Aerospace Engineering 0.79661 0.783333 0.789916 60
Agricultural Engineering 0.586207 0.53125 0.557377 32
Agronomy 0.596491 0.68 0.635514 50
Animal Husbandry 0.746269 0.847458 0.793651 59
Anthropology 0.807692 0.688525 0.743363 61
Archeology 0.875 0.823529 0.848485 17
Architecture and Town Planning 0.775862 0.703125 0.737705 64
Arts 0.772727 0.822581 0.796875 62
Astronomy 0.78125 0.943396 0.854701 53
Biochemistry 0.434783 0.363636 0.39604 55
Biology 0.428571 0.230769 0.3 26
Biomedical Engineering 0.621212 0.759259 0.683333 54
Biophysics 0.660377 0.714286 0.686275 49
Botany 0.72 0.692308 0.705882 52
Chemical Engineering 0.542373 0.64 0.587156 50
Chemistry 0.56 0.622222 0.589474 45
Civil Engineering 0.649123 0.660714 0.654867 56
Collective Health 0.590164 0.5625 0.576 64
Communications 0.672727 0.72549 0.698113 51
Computer Science 0.727273 0.816327 0.769231 49
Demography 1 0.5 0.666667 2
Dentistry 0.8 0.813559 0.806723 59
Ecology 0.555556 0.6 0.576923 50
Economics 0.678571 0.690909 0.684685 55
Education 0.75 0.688525 0.717949 61
Electrical Engineering 0.809524 0.618182 0.701031 55
Fishery Resources and Fishery Engineering 0.813953 0.729167 0.769231 48
Food Science and Technology 0.744681 0.583333 0.654206 60
Forestry Resources and Forestry Engineering 0.808511 0.883721 0.844444 43
Genetics 0.520833 0.5 0.510204 50
Geography 0.827586 0.8 0.813559 60
Geosciences 0.716418 0.8 0.755906 60
History 0.698113 0.770833 0.732673 48
Home Economics 0 0 0 0
Immunology 0.8125 0.732394 0.77037 71
Industrial Design 0.6 0.5 0.545455 6
Information Science 0.6875 0.733333 0.709677 15
Law 0.766667 0.621622 0.686567 37
Linguistics 0.6875 0.88 0.77193 50
Literature 0.621212 0.854167 0.719298 48
Materials and Metallurgical Engineering 0.688889 0.596154 0.639175 52
Mathematics 0.943396 0.847458 0.892857 59
Mechanical Engineering 0.666667 0.607143 0.635514 56
Medicine 0.444444 0.47619 0.45977 42
Microbiology 0.647059 0.5 0.564103 66
Mining Engineering 1 0.428571 0.6 7
Morphology 0.566667 0.618182 0.591304 55
Museology 0.75 1 0.857143 3
Naval and Oceanic Engineering 0.444444 0.666667 0.533333 6
Nuclear Engineering 0.555556 0.714286 0.625 7
Nursing 0.880952 0.72549 0.795699 51
Nutrition 0.728814 0.781818 0.754386 55
Oceanography 0.866667 0.764706 0.8125 34
Parasitology 0.732143 0.759259 0.745455 54
Pharmacology 0.636364 0.528302 0.57732 53
Pharmacy 0.673913 0.508197 0.579439 61
Philosophy 0.896552 0.825397 0.859504 63
Physical Education 0.722222 0.795918 0.757282 49
Physics 0.672414 0.661017 0.666667 59
Physiology 0.54902 0.538462 0.543689 52
Physiotherapy and Occupational Therapy 0.746032 0.854545 0.79661 55
Political Science 0.694444 0.769231 0.729927 65
Probability and Statistics 0.870968 0.870968 0.870968 31
Production Engineering 0.734694 0.692308 0.712871 52
Psychology 0.764706 0.590909 0.666667 44
Sanitary Engineering 0.746269 0.847458 0.793651 59
Sociology 0.550725 0.622951 0.584615 61
Speech Therapy 0.852459 0.928571 0.888889 56
Theology 1 0.25 0.4 4
Tourism 1 0.5 0.666667 2
Transportation Engineering 0.714286 0.714286 0.714286 7
Urban and Regional Planning 0.466667 0.538462 0.5 13
Veterinary Medicine 0.655172 0.666667 0.66087 57
Welfare Services 1 0.5 0.666667 4
Zoology 0.677419 0.792453 0.730435 53
accuracy 0.699764 0.699764 0.699764 0.699764
macro avg 0.702965 0.672006 0.675986 3384
weighted avg 0.703496 0.699764 0.697517 3384

Framework versions

  • Transformers 4.57.1
  • Pytorch 2.8.0+cu126
  • Datasets 3.6.0
  • Tokenizers 0.22.1
Downloads last month
8
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for SIRIS-Lab/specter2-fapesp-area-multiclass

Finetuned
(31)
this model