mpalinski commited on
Commit
1357bca
·
verified ·
1 Parent(s): fc2e961

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +116 -0
README.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - google-bert/bert-base-uncased
7
+ pipeline_tag: text-classification
8
+ tags:
9
+ - labormarket
10
+ - skills
11
+ ---
12
+ # BERT-OJA-SkillLess - Fine-tuned for Online Job Advertisements (OJAs)
13
+
14
+ ## Model Overview
15
+ **BERT-OJA-SkillLess** is a fine-tuned BERT-based model designed for the **information filtering task** of identifying sentences related to **skills** in **Online Job Advertisements (OJAs)**. The model automates the extraction of relevant information, reducing noise and processing complexity in scraped job advertisements by classifying each sentence as skill-relevant or not.
16
+ In other words, given an unstructured text, the model returns whether it is SkillLess or not.
17
+
18
+ ---
19
+
20
+ ## Background
21
+ Online Job Advertisements (OJAs) often include extraneous elements, such as web page descriptions, layout strings, or menu options, introduced during the scraping process. This noise necessitates a **cleaning step**, which we treat as an **information filtering task**.
22
+
23
+ Given an OJA represented as a set of \(n\) sentences:
24
+
25
+ OJA = {f_1, f_2, ..., f_n}
26
+
27
+ the filtering step produces a **filtered set of \(m\) sentences** \(m < n\) that are skill-relevant:
28
+
29
+ FilteredOJA = {c_1, c_2, ..., c_m}
30
+
31
+ This model uses a fine-tuned BERT to accomplish this filtering, improving efficiency in downstream skill extraction tasks.
32
+
33
+ ---
34
+
35
+ ## Training Process
36
+
37
+ The model was fine-tuned in two stages:
38
+
39
+ ### Stage 1: Initial Fine-Tuning
40
+ 1. **Dataset:**
41
+ The ESCO taxonomy was used to construct a dataset of ~25,000 sentences, comprising a balanced distribution of:
42
+ - **Skill-related sentences** (class 1)
43
+ - **Occupation-related sentences** (class 0)
44
+
45
+ ESCO was chosen because its skill descriptions closely resemble the contexts in which skills appear in OJAs. By training BERT on these descriptions, the model learns to differentiate between skills and occupations based on contextual clues.
46
+
47
+ 2. **Training Details:**
48
+ - **Training Dataset:** 80% of rows
49
+ - **Validation Dataset:** 20% of rows
50
+ - **Loss Function:** Cross-entropy
51
+ - **Batch Size:** 16
52
+ - **Epochs:** 4
53
+
54
+ 3. **Results:**
55
+ - **Training Loss:** 0.0211
56
+ - **Precision:** 89%
57
+ - **Recall:** 94%
58
+
59
+ 4. **Evaluation:**
60
+ On a manually labeled dataset of 400 OJAs (split into sentences):
61
+ - **Precision:** 40%
62
+ - **Recall:** 81%
63
+
64
+ ---
65
+
66
+ ### Stage 2: Second Fine-Tuning
67
+ 1. **Dataset:**
68
+ To improve recall and precision, we manually labeled **300 OJAs** (split into sentences). Sentences were annotated as:
69
+ - **Skill-relevant (class 1)**
70
+ - **Non-skill-relevant (class 0)**
71
+
72
+ To emphasize skill-related sentences, a **cost matrix** was introduced, doubling the weight for class 1.
73
+
74
+ 2. **Training Details:**
75
+ - **Training Dataset:** 75% of manually labeled OJAs
76
+ - **Validation Dataset:** 25% of manually labeled OJAs
77
+ - **Batch Size:** 16
78
+ - **Epochs:** 4
79
+
80
+ 3. **Results:**
81
+ - **Precision:** 71%
82
+ - **Recall:** 93%
83
+
84
+ 4. **Final Evaluation:**
85
+ Evaluated on the remaining 100 manually labeled OJAs, the model demonstrated significant improvements in identifying skill-relevant sentences.
86
+
87
+ ---
88
+
89
+ ## Model Usage
90
+
91
+ This model is ideal for organizations and researchers working on **labour market analysis**, **skill extraction**, or similar NLP tasks requiring fine-grained sentence filtering. By processing OJAs to identify skill-relevant sentences, downstream tasks like taxonomy mapping or skill prediction can be performed with higher precision and reduced noise.
92
+ **YOU NEED TO SPLIT YOUR INPUT TEXT TO USE THE MODEL**
93
+
94
+ ### How to Use the Model
95
+
96
+ You can load the model using the Hugging Face Transformers library as follows:
97
+
98
+ ```python
99
+ from transformers import BertForSequenceClassification, BertTokenizer
100
+
101
+ # Load the model and tokenizer
102
+ model_name = "serino28/BERT-OJA-SkillLess"
103
+ model = BertForSequenceClassification.from_pretrained(model_name)
104
+ tokenizer = BertTokenizer.from_pretrained(model_name)
105
+
106
+ # Example input: a single sentence
107
+ sentence = "This job requires proficiency in Python programming."
108
+ inputs = tokenizer(sentence, return_tensors="pt")
109
+
110
+ # Get predictions
111
+ outputs = model(**inputs)
112
+ logits = outputs.logits
113
+ predicted_class = logits.argmax().item()
114
+
115
+ # Class 1 = Skill-relevant, Class 0 = Non-skill-relevant
116
+ print(f"Predicted Class: {predicted_class}")