mpalinski
/

BERT-OJA-SkillLess

+---
+license: mit
+language:
+- en
+base_model:
+- google-bert/bert-base-uncased
+pipeline_tag: text-classification
+tags:
+- labormarket
+- skills
+---
+# BERT-OJA-SkillLess - Fine-tuned for Online Job Advertisements (OJAs)
+## Model Overview
+**BERT-OJA-SkillLess** is a fine-tuned BERT-based model designed for the **information filtering task** of identifying sentences related to **skills** in **Online Job Advertisements (OJAs)**. The model automates the extraction of relevant information, reducing noise and processing complexity in scraped job advertisements by classifying each sentence as skill-relevant or not.
+In other words, given an unstructured text, the model returns whether it is SkillLess or not.
+---
+## Background
+Online Job Advertisements (OJAs) often include extraneous elements, such as web page descriptions, layout strings, or menu options, introduced during the scraping process. This noise necessitates a **cleaning step**, which we treat as an **information filtering task**.
+Given an OJA represented as a set of \(n\) sentences:
+OJA = {f_1, f_2, ..., f_n}
+the filtering step produces a **filtered set of \(m\) sentences** \(m < n\) that are skill-relevant:
+FilteredOJA = {c_1, c_2, ..., c_m}
+This model uses a fine-tuned BERT to accomplish this filtering, improving efficiency in downstream skill extraction tasks.
+---
+## Training Process
+The model was fine-tuned in two stages:
+### Stage 1: Initial Fine-Tuning
+1. **Dataset:**
+   The ESCO taxonomy was used to construct a dataset of ~25,000 sentences, comprising a balanced distribution of:
+   - **Skill-related sentences** (class 1)
+   - **Occupation-related sentences** (class 0)
+   ESCO was chosen because its skill descriptions closely resemble the contexts in which skills appear in OJAs. By training BERT on these descriptions, the model learns to differentiate between skills and occupations based on contextual clues.
+2. **Training Details:**
+   - **Training Dataset:** 80% of rows
+   - **Validation Dataset:** 20% of rows
+   - **Loss Function:** Cross-entropy
+   - **Batch Size:** 16
+   - **Epochs:** 4
+3. **Results:**
+   - **Training Loss:** 0.0211
+   - **Precision:** 89%
+   - **Recall:** 94%
+4. **Evaluation:**
+   On a manually labeled dataset of 400 OJAs (split into sentences):
+   - **Precision:** 40%
+   - **Recall:** 81%
+---
+### Stage 2: Second Fine-Tuning
+1. **Dataset:**
+   To improve recall and precision, we manually labeled **300 OJAs** (split into sentences). Sentences were annotated as:
+   - **Skill-relevant (class 1)**
+   - **Non-skill-relevant (class 0)**
+   To emphasize skill-related sentences, a **cost matrix** was introduced, doubling the weight for class 1.
+2. **Training Details:**
+   - **Training Dataset:** 75% of manually labeled OJAs
+   - **Validation Dataset:** 25% of manually labeled OJAs
+   - **Batch Size:** 16
+   - **Epochs:** 4
+3. **Results:**
+   - **Precision:** 71%
+   - **Recall:** 93%
+4. **Final Evaluation:**
+   Evaluated on the remaining 100 manually labeled OJAs, the model demonstrated significant improvements in identifying skill-relevant sentences.
+---
+## Model Usage
+This model is ideal for organizations and researchers working on **labour market analysis**, **skill extraction**, or similar NLP tasks requiring fine-grained sentence filtering. By processing OJAs to identify skill-relevant sentences, downstream tasks like taxonomy mapping or skill prediction can be performed with higher precision and reduced noise.
+**YOU NEED TO SPLIT YOUR INPUT TEXT TO USE THE MODEL**
+### How to Use the Model
+You can load the model using the Hugging Face Transformers library as follows:
+```python
+from transformers import BertForSequenceClassification, BertTokenizer
+# Load the model and tokenizer
+model_name = "serino28/BERT-OJA-SkillLess"
+model = BertForSequenceClassification.from_pretrained(model_name)
+tokenizer = BertTokenizer.from_pretrained(model_name)
+# Example input: a single sentence
+sentence = "This job requires proficiency in Python programming."
+inputs = tokenizer(sentence, return_tensors="pt")
+# Get predictions
+outputs = model(**inputs)
+logits = outputs.logits
+predicted_class = logits.argmax().item()
+# Class 1 = Skill-relevant, Class 0 = Non-skill-relevant
+print(f"Predicted Class: {predicted_class}")