Text Classification
setfit
Safetensors
sentence-transformers
bert
geospatial
spatial-queries
text-embeddings-inference
Instructions to use ilyankou/is-geospatial-query with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- setfit
How to use ilyankou/is-geospatial-query with setfit:
from setfit import SetFitModel model = SetFitModel.from_pretrained("ilyankou/is-geospatial-query") - sentence-transformers
How to use ilyankou/is-geospatial-query with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("ilyankou/is-geospatial-query") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -3,219 +3,74 @@ tags:
|
|
| 3 |
- setfit
|
| 4 |
- sentence-transformers
|
| 5 |
- text-classification
|
| 6 |
-
-
|
|
|
|
| 7 |
widget:
|
| 8 |
-
- text:
|
| 9 |
-
- text:
|
| 10 |
-
- text:
|
| 11 |
-
- text:
|
| 12 |
-
- text:
|
| 13 |
metrics:
|
| 14 |
- accuracy
|
|
|
|
| 15 |
pipeline_tag: text-classification
|
| 16 |
library_name: setfit
|
| 17 |
inference: true
|
| 18 |
base_model: BAAI/bge-small-en-v1.5
|
|
|
|
| 19 |
---
|
| 20 |
|
| 21 |
-
#
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
|
|
|
|
| 26 |
|
| 27 |
-
1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
|
| 28 |
-
2. Training a classification head with features from the fine-tuned Sentence Transformer.
|
| 29 |
|
| 30 |
-
##
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
-
|
| 36 |
-
- **Maximum Sequence Length:** 512 tokens
|
| 37 |
-
- **Number of Classes:** 2 classes
|
| 38 |
-
<!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
|
| 39 |
-
<!-- - **Language:** Unknown -->
|
| 40 |
-
<!-- - **License:** Unknown -->
|
| 41 |
-
|
| 42 |
-
### Model Sources
|
| 43 |
-
|
| 44 |
-
- **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
|
| 45 |
-
- **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
|
| 46 |
-
- **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
|
| 47 |
-
|
| 48 |
-
### Model Labels
|
| 49 |
-
| Label | Examples |
|
| 50 |
-
|:------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------|
|
| 51 |
-
| 1 | <ul><li>'what city is hurricane harbor located in texas'</li><li>'where is the hills located in lynchburg va'</li><li>'weather hurricane mexico'</li></ul> |
|
| 52 |
-
| 0 | <ul><li>'how early can you register to vote'</li><li>'us postal services track'</li><li>'how much does it cost to run a iron'</li></ul> |
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
|
|
|
|
| 57 |
|
| 58 |
-
|
| 59 |
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
|
|
|
| 63 |
|
| 64 |
-
|
| 65 |
|
| 66 |
```python
|
| 67 |
from setfit import SetFitModel
|
| 68 |
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
```
|
| 74 |
|
| 75 |
-
|
| 76 |
-
### Downstream Use
|
| 77 |
-
|
| 78 |
-
*List how someone could finetune this model on their own dataset.*
|
| 79 |
-
-->
|
| 80 |
-
|
| 81 |
-
<!--
|
| 82 |
-
### Out-of-Scope Use
|
| 83 |
-
|
| 84 |
-
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
|
| 85 |
-
-->
|
| 86 |
-
|
| 87 |
-
<!--
|
| 88 |
-
## Bias, Risks and Limitations
|
| 89 |
-
|
| 90 |
-
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
|
| 91 |
-
-->
|
| 92 |
-
|
| 93 |
-
<!--
|
| 94 |
-
### Recommendations
|
| 95 |
-
|
| 96 |
-
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
|
| 97 |
-
-->
|
| 98 |
-
|
| 99 |
-
## Training Details
|
| 100 |
-
|
| 101 |
-
### Training Set Metrics
|
| 102 |
-
| Training set | Min | Median | Max |
|
| 103 |
-
|:-------------|:----|:-------|:----|
|
| 104 |
-
| Word count | 2 | 6.3658 | 17 |
|
| 105 |
-
|
| 106 |
-
| Label | Training Sample Count |
|
| 107 |
-
|:------|:----------------------|
|
| 108 |
-
| 0 | 632 |
|
| 109 |
-
| 1 | 568 |
|
| 110 |
-
|
| 111 |
-
### Training Hyperparameters
|
| 112 |
-
- batch_size: (64, 64)
|
| 113 |
-
- num_epochs: (3, 3)
|
| 114 |
-
- max_steps: -1
|
| 115 |
-
- sampling_strategy: oversampling
|
| 116 |
-
- num_iterations: 20
|
| 117 |
-
- body_learning_rate: (2e-05, 2e-05)
|
| 118 |
-
- head_learning_rate: 0.01
|
| 119 |
-
- loss: CosineSimilarityLoss
|
| 120 |
-
- distance_metric: cosine_distance
|
| 121 |
-
- margin: 0.25
|
| 122 |
-
- end_to_end: False
|
| 123 |
-
- use_amp: False
|
| 124 |
-
- warmup_proportion: 0.1
|
| 125 |
-
- l2_weight: 0.01
|
| 126 |
-
- seed: 42
|
| 127 |
-
- eval_max_steps: -1
|
| 128 |
-
- load_best_model_at_end: False
|
| 129 |
-
|
| 130 |
-
### Training Results
|
| 131 |
-
| Epoch | Step | Training Loss | Validation Loss |
|
| 132 |
-
|:------:|:----:|:-------------:|:---------------:|
|
| 133 |
-
| 0.0013 | 1 | 0.2374 | - |
|
| 134 |
-
| 0.0667 | 50 | 0.2383 | - |
|
| 135 |
-
| 0.1333 | 100 | 0.2151 | - |
|
| 136 |
-
| 0.2 | 150 | 0.0982 | - |
|
| 137 |
-
| 0.2667 | 200 | 0.034 | - |
|
| 138 |
-
| 0.3333 | 250 | 0.0104 | - |
|
| 139 |
-
| 0.4 | 300 | 0.0056 | - |
|
| 140 |
-
| 0.4667 | 350 | 0.0035 | - |
|
| 141 |
-
| 0.5333 | 400 | 0.002 | - |
|
| 142 |
-
| 0.6 | 450 | 0.0013 | - |
|
| 143 |
-
| 0.6667 | 500 | 0.0014 | - |
|
| 144 |
-
| 0.7333 | 550 | 0.0013 | - |
|
| 145 |
-
| 0.8 | 600 | 0.0009 | - |
|
| 146 |
-
| 0.8667 | 650 | 0.0011 | - |
|
| 147 |
-
| 0.9333 | 700 | 0.0008 | - |
|
| 148 |
-
| 1.0 | 750 | 0.0009 | - |
|
| 149 |
-
| 1.0667 | 800 | 0.0008 | - |
|
| 150 |
-
| 1.1333 | 850 | 0.0008 | - |
|
| 151 |
-
| 1.2 | 900 | 0.0007 | - |
|
| 152 |
-
| 1.2667 | 950 | 0.0006 | - |
|
| 153 |
-
| 1.3333 | 1000 | 0.0009 | - |
|
| 154 |
-
| 1.4 | 1050 | 0.0006 | - |
|
| 155 |
-
| 1.4667 | 1100 | 0.0005 | - |
|
| 156 |
-
| 1.5333 | 1150 | 0.0005 | - |
|
| 157 |
-
| 1.6 | 1200 | 0.0005 | - |
|
| 158 |
-
| 1.6667 | 1250 | 0.0005 | - |
|
| 159 |
-
| 1.7333 | 1300 | 0.0005 | - |
|
| 160 |
-
| 1.8 | 1350 | 0.0005 | - |
|
| 161 |
-
| 1.8667 | 1400 | 0.0005 | - |
|
| 162 |
-
| 1.9333 | 1450 | 0.0005 | - |
|
| 163 |
-
| 2.0 | 1500 | 0.0005 | - |
|
| 164 |
-
| 2.0667 | 1550 | 0.0005 | - |
|
| 165 |
-
| 2.1333 | 1600 | 0.0005 | - |
|
| 166 |
-
| 2.2 | 1650 | 0.0004 | - |
|
| 167 |
-
| 2.2667 | 1700 | 0.0004 | - |
|
| 168 |
-
| 2.3333 | 1750 | 0.0004 | - |
|
| 169 |
-
| 2.4 | 1800 | 0.0004 | - |
|
| 170 |
-
| 2.4667 | 1850 | 0.0004 | - |
|
| 171 |
-
| 2.5333 | 1900 | 0.0004 | - |
|
| 172 |
-
| 2.6 | 1950 | 0.0004 | - |
|
| 173 |
-
| 2.6667 | 2000 | 0.0004 | - |
|
| 174 |
-
| 2.7333 | 2050 | 0.0004 | - |
|
| 175 |
-
| 2.8 | 2100 | 0.0004 | - |
|
| 176 |
-
| 2.8667 | 2150 | 0.0004 | - |
|
| 177 |
-
| 2.9333 | 2200 | 0.0004 | - |
|
| 178 |
-
| 3.0 | 2250 | 0.0004 | - |
|
| 179 |
-
|
| 180 |
-
### Framework Versions
|
| 181 |
-
- Python: 3.11.14
|
| 182 |
-
- SetFit: 1.1.3
|
| 183 |
-
- Sentence Transformers: 4.0.2
|
| 184 |
-
- Transformers: 4.55.2
|
| 185 |
-
- PyTorch: 2.8.0
|
| 186 |
-
- Datasets: 2.15.0
|
| 187 |
-
- Tokenizers: 0.21.1
|
| 188 |
-
|
| 189 |
-
## Citation
|
| 190 |
-
|
| 191 |
-
### BibTeX
|
| 192 |
-
```bibtex
|
| 193 |
-
@article{https://doi.org/10.48550/arxiv.2209.11055,
|
| 194 |
-
doi = {10.48550/ARXIV.2209.11055},
|
| 195 |
-
url = {https://arxiv.org/abs/2209.11055},
|
| 196 |
-
author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
|
| 197 |
-
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
|
| 198 |
-
title = {Efficient Few-Shot Learning Without Prompts},
|
| 199 |
-
publisher = {arXiv},
|
| 200 |
-
year = {2022},
|
| 201 |
-
copyright = {Creative Commons Attribution 4.0 International}
|
| 202 |
-
}
|
| 203 |
-
```
|
| 204 |
-
|
| 205 |
-
<!--
|
| 206 |
-
## Glossary
|
| 207 |
-
|
| 208 |
-
*Clearly define terms in order to be accessible across audiences.*
|
| 209 |
-
-->
|
| 210 |
-
|
| 211 |
-
<!--
|
| 212 |
-
## Model Card Authors
|
| 213 |
-
|
| 214 |
-
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
|
| 215 |
-
-->
|
| 216 |
-
|
| 217 |
-
<!--
|
| 218 |
-
## Model Card Contact
|
| 219 |
|
| 220 |
-
|
| 221 |
-
|
|
|
|
|
|
|
|
|
| 3 |
- setfit
|
| 4 |
- sentence-transformers
|
| 5 |
- text-classification
|
| 6 |
+
- geospatial
|
| 7 |
+
- spatial-queries
|
| 8 |
widget:
|
| 9 |
+
- text: hotel in geneva airport
|
| 10 |
+
- text: what payroll deduction is mpp
|
| 11 |
+
- text: weather in erlanger ky
|
| 12 |
+
- text: what is the coordinates of point p
|
| 13 |
+
- text: what's the weather in roseburg
|
| 14 |
metrics:
|
| 15 |
- accuracy
|
| 16 |
+
- f1
|
| 17 |
pipeline_tag: text-classification
|
| 18 |
library_name: setfit
|
| 19 |
inference: true
|
| 20 |
base_model: BAAI/bge-small-en-v1.5
|
| 21 |
+
license: mit
|
| 22 |
---
|
| 23 |
|
| 24 |
+
# Geospatial (Web Search) Query Detector
|
| 25 |
|
| 26 |
+
A binary [SetFit](https://github.com/huggingface/setfit) classifier that distinguishes geospatial
|
| 27 |
+
from non-geospatial web search queries. Trained on 1,200 gold-labelled
|
| 28 |
+
[MS MARCO](https://microsoft.github.io/msmarco/) web search queries with weak supervision from Llama 3.1, then manually verified.
|
| 29 |
|
| 30 |
+
Achieves **F1 = 0.931** on a held-out test set of 800 samples (421 non-spatial, 379 spatial),
|
| 31 |
+
with the evaluation model trained on 200 samples (105 non-spatial, 95 spatial). The deployed model was trained on the full 1,200.
|
| 32 |
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
## What counts as a geospatial query?
|
| 35 |
|
| 36 |
+
As per [Mai et al. (2021)](https://agile-giss.copernicus.org/articles/2/8/2021/) and
|
| 37 |
+
[Kefalidis et al. (2024)](https://www.sciencedirect.com/science/article/pii/S1569843224005594),
|
| 38 |
+
a query is geospatial if it requires qualitative or quantitative geographic
|
| 39 |
+
knowledge of Earth-bound features to be answered.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
This is usually the case if the query involves:
|
| 42 |
+
- A geographic entity (named place on Earth: city, country, river, POI, address)
|
| 43 |
+
- A geographic concept (place type: city, lake, mountain, park, building)
|
| 44 |
+
- A spatial relation (near, within, north of, between, borders, crosses, distance)
|
| 45 |
|
| 46 |
+
Non-geospatial: anatomical, microscopic, astronomical, fictional, or abstract
|
| 47 |
+
'where' questions; queries needing no geographic knowledge.
|
| 48 |
|
| 49 |
+
## Model details
|
| 50 |
|
| 51 |
+
- **Sentence Transformer body:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
|
| 52 |
+
- **Classification head:** LogisticRegression
|
| 53 |
+
- **Training data:** 1,200 gold-labelled MS MARCO queries (632 non-spatial, 568 spatial), sampled via K-means centroids across the full embedding space of all 1M+ queries for representativeness
|
| 54 |
+
- **Labels:** `1` = geospatial, `0` = non-geospatial
|
| 55 |
|
| 56 |
+
## Usage
|
| 57 |
|
| 58 |
```python
|
| 59 |
from setfit import SetFitModel
|
| 60 |
|
| 61 |
+
model = SetFitModel.from_pretrained("ilyankou/is-geospatial-query")
|
| 62 |
+
preds = model([
|
| 63 |
+
"nearest hospital",
|
| 64 |
+
"far from the truth",
|
| 65 |
+
"close to my heart",
|
| 66 |
+
"flood risk in this area"
|
| 67 |
+
])
|
| 68 |
+
# => [1, 0, 0, 1]
|
| 69 |
```
|
| 70 |
|
| 71 |
+
## Training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
+
Weak labels were generated by running Llama 3.1 five times per query at temperature 0.3,
|
| 74 |
+
then manually verified. The SetFit model was trained for 3 epochs with batch size 64
|
| 75 |
+
and learning rate 2e-5 on 200 samples (95 positive and 105 negative) for validation,
|
| 76 |
+
then retrained on the full gold dataset (1,200 samples) for production inference.
|