Adding model and first version of the README

Files changed (6) hide show

README.md +43 -0
config.json +36 -0
pytorch_model.bin +3 -0
tokenizer_config.json +1 -0
training_args.bin +3 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,46 @@
 ---
 license: cc-by-nc-sa-4.0
 ---

 ---
 license: cc-by-nc-sa-4.0
 ---
+# Inclusively Classification Model
+This model is an Italian classification model fine-tuned from the [Italian BERT model](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) for the classification of inclusive language in Italian.
+It has been trained to detect three classes:
+- `inclusive`: the sentence is inclusive (e.g. "Il personale docente e non docente")
+- `not_inclusive`: the sentence is not inclusive (e.g. "I professori")
+- `not_pertinent`: the sentence is not pertinent to the task (e.g. "La scuola è chiusa")
+## Training data
+The model has been trained on a dataset containing:
+- 8580 training sentences
+- 1073 validation sentences
+- 1072 test sentences
+The data collection has been manually annotated by experts in the field of inclusive language (dataset is not publicly available yet).
+## Training procedure
+The model has been fine-tuned from the [Italian BERT model](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased) using the following hyperparameters:
+- `max_length`: 128
+- `batch_size`: 128
+- `learning_rate`: 5e-5
+- `warmup_steps`: 500
+- `epochs`: 10 (best model is selected based on validation accuracy)
+- `optimizer`: AdamW
+## Evaluation results
+The model has been evaluated on the test set and obtained the following results:
+| Model | Accuracy | Inclusive F1 | Not inclusive F1 | Not pertinent F1 |
+|-------|----------|--------------|------------------|------------------|
+| TF-IDF + MLP | 0.68 | 0.63 | 0.69 | 0.66 |
+| TF-IDF + SVM | 0.61 | 0.53 | 0.60 | 0.78 |
+| TF-IDF + GB  | 0.74 | 0.74 | 0.76 | 0.72 |
+| multilingual | 0.86 | 0.88 | 0.89 | 0.83 |
+| **This**     | 0.89 | 0.88 | 0.92 | 0.85 |
+The model has been compared with a multilingual model trained on the same data and obtained better results.

config.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "_name_or_path": "dbmdz/bert-base-italian-xxl-cased",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "inclusive",
+    "1": "not_inclusive",
+    "2": "not_pertinent"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "inclusive": 0,
+    "not_inclusive": 1,
+    "not_pertinent": 2
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "torch_dtype": "float32",
+  "transformers_version": "4.17.0",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 32102
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f11f5dfedbc3c7bbb7ca7e3eea0fa23fa32cde1b349449bcbe683cf3dd606874
+size 442876973

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": false, "max_len": 512, "init_inputs": []}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bc383b5651db0bb264bad9740e8b854a6208441cb65120005c8af42af3496518
+size 2991

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff