metaphor-cat-roberta-large-weights

This model is a fine-tuned version of projecte-aina/roberta-large-ca-v2 on the Catalan metaphor detection dataset metaphor-catalan.

It achieves the following results on the evaluation set:

Precision: 0.6897
Recall: 0.5556
F1: 0.6154
Accuracy: 0.9713

Model description

This model is a RoBERTa-large transformer trained for Catalan and fine-tuned for token-level metaphor detection.

The model performs sequence labeling to identify metaphorical expressions using a BIO tagging scheme:

O – non-metaphorical token
B-METAPHOR – beginning of a metaphorical expression
I-METAPHOR – continuation of a metaphorical expression

The base model was pretrained as part of the AINA project for Catalan NLP.
During fine-tuning, class-weighted cross-entropy loss was applied to mitigate the strong class imbalance in the dataset, where metaphor tokens are much less frequent than literal tokens.

This model is suitable for research in figurative language detection, computational linguistics, and Catalan NLP applications.

Intended uses & limitations

Intended uses:

Detecting metaphorical expressions in Catalan text.
Supporting linguistic research on figurative language.
Assisting annotation workflows for metaphor datasets.
Integrating metaphor detection into Catalan NLP pipelines.

Limitations:

The dataset used for training is relatively small and domain-limited.
The model may not generalize well to:
- highly informal language
- social media text
- poetry or highly creative figurative language.
Predictions are performed at the token level, so additional processing may be required to reconstruct full metaphor spans.
Metaphor detection is inherently subjective, and annotation inconsistencies may affect predictions.

This model should be used as a support tool rather than a definitive metaphor detection system.

Training and evaluation data

Training dataset:
metaphor-catalan

The dataset contains Catalan sentences annotated for metaphorical language at the token level.

Example dataset structure:

tokens: tokenized sentence
tags: BIO labels identifying metaphor spans

Label set used during training:

O
B-METAPHOR
I-METAPHOR

The dataset is highly imbalanced, with many more literal tokens than metaphor tokens.
To address this imbalance, class weights were applied during training.

Training procedure

Hyperparameters

Learning rate: 3e-5
Train batch size: 4
Evaluation batch size: 4
Gradient accumulation steps: 2
Weight decay: 0.01
Warmup steps: 50
Epochs: 15
LR scheduler: linear
Optimizer: AdamW

Framework versions

Transformers: 4.57.3
PyTorch: 2.9.0
Datasets: 4.0.0
Tokenizers: 0.22.1

Downloads last month: 3

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for mariadelcarmenramirez/metaphor-cat-roberta-large-weights

Base model

projecte-aina/roberta-large-ca-v2

Finetuned

(3)

this model

mariadelcarmenramirez
/

metaphor-cat-roberta-large-weights