ilyankou commited on
Commit
4729d8e
·
verified ·
1 Parent(s): c9bfb81

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -190
README.md CHANGED
@@ -3,219 +3,74 @@ tags:
3
  - setfit
4
  - sentence-transformers
5
  - text-classification
6
- - generated_from_setfit_trainer
 
7
  widget:
8
- - text: what is the closest bus station to pittsburgh international airport
9
- - text: how to register customary land in png
10
- - text: dissolved definition
11
- - text: color Nucleic acid
12
- - text: how long should i water lawn for
13
  metrics:
14
  - accuracy
 
15
  pipeline_tag: text-classification
16
  library_name: setfit
17
  inference: true
18
  base_model: BAAI/bge-small-en-v1.5
 
19
  ---
20
 
21
- # SetFit with BAAI/bge-small-en-v1.5
22
 
23
- This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
 
 
24
 
25
- The model has been trained using an efficient few-shot learning technique that involves:
 
26
 
27
- 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
28
- 2. Training a classification head with features from the fine-tuned Sentence Transformer.
29
 
30
- ## Model Details
31
 
32
- ### Model Description
33
- - **Model Type:** SetFit
34
- - **Sentence Transformer body:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
35
- - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
36
- - **Maximum Sequence Length:** 512 tokens
37
- - **Number of Classes:** 2 classes
38
- <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
39
- <!-- - **Language:** Unknown -->
40
- <!-- - **License:** Unknown -->
41
-
42
- ### Model Sources
43
-
44
- - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
45
- - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
46
- - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
47
-
48
- ### Model Labels
49
- | Label | Examples |
50
- |:------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------|
51
- | 1 | <ul><li>'what city is hurricane harbor located in texas'</li><li>'where is the hills located in lynchburg va'</li><li>'weather hurricane mexico'</li></ul> |
52
- | 0 | <ul><li>'how early can you register to vote'</li><li>'us postal services track'</li><li>'how much does it cost to run a iron'</li></ul> |
53
 
54
- ## Uses
 
 
 
55
 
56
- ### Direct Use for Inference
 
57
 
58
- First install the SetFit library:
59
 
60
- ```bash
61
- pip install setfit
62
- ```
 
63
 
64
- Then you can load this model and run inference.
65
 
66
  ```python
67
  from setfit import SetFitModel
68
 
69
- # Download from the 🤗 Hub
70
- model = SetFitModel.from_pretrained("setfit_model_id")
71
- # Run inference
72
- preds = model("color Nucleic acid")
 
 
 
 
73
  ```
74
 
75
- <!--
76
- ### Downstream Use
77
-
78
- *List how someone could finetune this model on their own dataset.*
79
- -->
80
-
81
- <!--
82
- ### Out-of-Scope Use
83
-
84
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
85
- -->
86
-
87
- <!--
88
- ## Bias, Risks and Limitations
89
-
90
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
91
- -->
92
-
93
- <!--
94
- ### Recommendations
95
-
96
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
97
- -->
98
-
99
- ## Training Details
100
-
101
- ### Training Set Metrics
102
- | Training set | Min | Median | Max |
103
- |:-------------|:----|:-------|:----|
104
- | Word count | 2 | 6.3658 | 17 |
105
-
106
- | Label | Training Sample Count |
107
- |:------|:----------------------|
108
- | 0 | 632 |
109
- | 1 | 568 |
110
-
111
- ### Training Hyperparameters
112
- - batch_size: (64, 64)
113
- - num_epochs: (3, 3)
114
- - max_steps: -1
115
- - sampling_strategy: oversampling
116
- - num_iterations: 20
117
- - body_learning_rate: (2e-05, 2e-05)
118
- - head_learning_rate: 0.01
119
- - loss: CosineSimilarityLoss
120
- - distance_metric: cosine_distance
121
- - margin: 0.25
122
- - end_to_end: False
123
- - use_amp: False
124
- - warmup_proportion: 0.1
125
- - l2_weight: 0.01
126
- - seed: 42
127
- - eval_max_steps: -1
128
- - load_best_model_at_end: False
129
-
130
- ### Training Results
131
- | Epoch | Step | Training Loss | Validation Loss |
132
- |:------:|:----:|:-------------:|:---------------:|
133
- | 0.0013 | 1 | 0.2374 | - |
134
- | 0.0667 | 50 | 0.2383 | - |
135
- | 0.1333 | 100 | 0.2151 | - |
136
- | 0.2 | 150 | 0.0982 | - |
137
- | 0.2667 | 200 | 0.034 | - |
138
- | 0.3333 | 250 | 0.0104 | - |
139
- | 0.4 | 300 | 0.0056 | - |
140
- | 0.4667 | 350 | 0.0035 | - |
141
- | 0.5333 | 400 | 0.002 | - |
142
- | 0.6 | 450 | 0.0013 | - |
143
- | 0.6667 | 500 | 0.0014 | - |
144
- | 0.7333 | 550 | 0.0013 | - |
145
- | 0.8 | 600 | 0.0009 | - |
146
- | 0.8667 | 650 | 0.0011 | - |
147
- | 0.9333 | 700 | 0.0008 | - |
148
- | 1.0 | 750 | 0.0009 | - |
149
- | 1.0667 | 800 | 0.0008 | - |
150
- | 1.1333 | 850 | 0.0008 | - |
151
- | 1.2 | 900 | 0.0007 | - |
152
- | 1.2667 | 950 | 0.0006 | - |
153
- | 1.3333 | 1000 | 0.0009 | - |
154
- | 1.4 | 1050 | 0.0006 | - |
155
- | 1.4667 | 1100 | 0.0005 | - |
156
- | 1.5333 | 1150 | 0.0005 | - |
157
- | 1.6 | 1200 | 0.0005 | - |
158
- | 1.6667 | 1250 | 0.0005 | - |
159
- | 1.7333 | 1300 | 0.0005 | - |
160
- | 1.8 | 1350 | 0.0005 | - |
161
- | 1.8667 | 1400 | 0.0005 | - |
162
- | 1.9333 | 1450 | 0.0005 | - |
163
- | 2.0 | 1500 | 0.0005 | - |
164
- | 2.0667 | 1550 | 0.0005 | - |
165
- | 2.1333 | 1600 | 0.0005 | - |
166
- | 2.2 | 1650 | 0.0004 | - |
167
- | 2.2667 | 1700 | 0.0004 | - |
168
- | 2.3333 | 1750 | 0.0004 | - |
169
- | 2.4 | 1800 | 0.0004 | - |
170
- | 2.4667 | 1850 | 0.0004 | - |
171
- | 2.5333 | 1900 | 0.0004 | - |
172
- | 2.6 | 1950 | 0.0004 | - |
173
- | 2.6667 | 2000 | 0.0004 | - |
174
- | 2.7333 | 2050 | 0.0004 | - |
175
- | 2.8 | 2100 | 0.0004 | - |
176
- | 2.8667 | 2150 | 0.0004 | - |
177
- | 2.9333 | 2200 | 0.0004 | - |
178
- | 3.0 | 2250 | 0.0004 | - |
179
-
180
- ### Framework Versions
181
- - Python: 3.11.14
182
- - SetFit: 1.1.3
183
- - Sentence Transformers: 4.0.2
184
- - Transformers: 4.55.2
185
- - PyTorch: 2.8.0
186
- - Datasets: 2.15.0
187
- - Tokenizers: 0.21.1
188
-
189
- ## Citation
190
-
191
- ### BibTeX
192
- ```bibtex
193
- @article{https://doi.org/10.48550/arxiv.2209.11055,
194
- doi = {10.48550/ARXIV.2209.11055},
195
- url = {https://arxiv.org/abs/2209.11055},
196
- author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
197
- keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
198
- title = {Efficient Few-Shot Learning Without Prompts},
199
- publisher = {arXiv},
200
- year = {2022},
201
- copyright = {Creative Commons Attribution 4.0 International}
202
- }
203
- ```
204
-
205
- <!--
206
- ## Glossary
207
-
208
- *Clearly define terms in order to be accessible across audiences.*
209
- -->
210
-
211
- <!--
212
- ## Model Card Authors
213
-
214
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
215
- -->
216
-
217
- <!--
218
- ## Model Card Contact
219
 
220
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
221
- -->
 
 
 
3
  - setfit
4
  - sentence-transformers
5
  - text-classification
6
+ - geospatial
7
+ - spatial-queries
8
  widget:
9
+ - text: hotel in geneva airport
10
+ - text: what payroll deduction is mpp
11
+ - text: weather in erlanger ky
12
+ - text: what is the coordinates of point p
13
+ - text: what's the weather in roseburg
14
  metrics:
15
  - accuracy
16
+ - f1
17
  pipeline_tag: text-classification
18
  library_name: setfit
19
  inference: true
20
  base_model: BAAI/bge-small-en-v1.5
21
+ license: mit
22
  ---
23
 
24
+ # Geospatial (Web Search) Query Detector
25
 
26
+ A binary [SetFit](https://github.com/huggingface/setfit) classifier that distinguishes geospatial
27
+ from non-geospatial web search queries. Trained on 1,200 gold-labelled
28
+ [MS MARCO](https://microsoft.github.io/msmarco/) web search queries with weak supervision from Llama 3.1, then manually verified.
29
 
30
+ Achieves **F1 = 0.931** on a held-out test set of 800 samples (421 non-spatial, 379 spatial),
31
+ with the evaluation model trained on 200 samples (105 non-spatial, 95 spatial). The deployed model was trained on the full 1,200.
32
 
 
 
33
 
34
+ ## What counts as a geospatial query?
35
 
36
+ As per [Mai et al. (2021)](https://agile-giss.copernicus.org/articles/2/8/2021/) and
37
+ [Kefalidis et al. (2024)](https://www.sciencedirect.com/science/article/pii/S1569843224005594),
38
+ a query is geospatial if it requires qualitative or quantitative geographic
39
+ knowledge of Earth-bound features to be answered.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
+ This is usually the case if the query involves:
42
+ - A geographic entity (named place on Earth: city, country, river, POI, address)
43
+ - A geographic concept (place type: city, lake, mountain, park, building)
44
+ - A spatial relation (near, within, north of, between, borders, crosses, distance)
45
 
46
+ Non-geospatial: anatomical, microscopic, astronomical, fictional, or abstract
47
+ 'where' questions; queries needing no geographic knowledge.
48
 
49
+ ## Model details
50
 
51
+ - **Sentence Transformer body:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
52
+ - **Classification head:** LogisticRegression
53
+ - **Training data:** 1,200 gold-labelled MS MARCO queries (632 non-spatial, 568 spatial), sampled via K-means centroids across the full embedding space of all 1M+ queries for representativeness
54
+ - **Labels:** `1` = geospatial, `0` = non-geospatial
55
 
56
+ ## Usage
57
 
58
  ```python
59
  from setfit import SetFitModel
60
 
61
+ model = SetFitModel.from_pretrained("ilyankou/is-geospatial-query")
62
+ preds = model([
63
+ "nearest hospital",
64
+ "far from the truth",
65
+ "close to my heart",
66
+ "flood risk in this area"
67
+ ])
68
+ # => [1, 0, 0, 1]
69
  ```
70
 
71
+ ## Training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
+ Weak labels were generated by running Llama 3.1 five times per query at temperature 0.3,
74
+ then manually verified. The SetFit model was trained for 3 epochs with batch size 64
75
+ and learning rate 2e-5 on 200 samples (95 positive and 105 negative) for validation,
76
+ then retrained on the full gold dataset (1,200 samples) for production inference.