File size: 2,876 Bytes
240a355
 
 
 
 
4729d8e
 
240a355
4729d8e
 
 
 
 
240a355
 
4729d8e
240a355
 
 
 
4729d8e
240a355
 
4729d8e
240a355
4729d8e
 
83f5e21
240a355
4729d8e
 
240a355
 
4729d8e
240a355
4729d8e
 
 
 
c9bfb81
4729d8e
 
 
 
c9bfb81
4729d8e
 
c9bfb81
4729d8e
c9bfb81
4729d8e
 
 
 
240a355
4729d8e
240a355
 
 
 
4729d8e
 
 
 
 
 
 
 
240a355
 
4729d8e
240a355
4729d8e
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
tags:
- setfit
- sentence-transformers
- text-classification
- geospatial
- spatial-queries
widget:
- text: hotel in geneva airport
- text: what payroll deduction is mpp
- text: weather in erlanger ky
- text: what is the coordinates of point p
- text: what's the weather in roseburg
metrics:
- accuracy
- f1
pipeline_tag: text-classification
library_name: setfit
inference: true
base_model: BAAI/bge-small-en-v1.5
license: mit
---

# Geospatial (Web Search) Query Detector

A binary [SetFit](https://github.com/huggingface/setfit) classifier that distinguishes geospatial
from non-geospatial web search queries. Trained on 1,200 gold-labelled
[MS MARCO](https://microsoft.github.io/msmarco/) web search queries with weak supervision from Llama 3.1, then manually verified. See COSIT 2026 paper preprint here - https://arxiv.org/abs/2605.11336

Achieves **F1 = 0.931** on a held-out test set of 800 samples (421 non-spatial, 379 spatial),
with the evaluation model trained on 200 samples (105 non-spatial, 95 spatial). The deployed model was trained on the full 1,200.


## What counts as a geospatial query?

As per [Mai et al. (2021)](https://agile-giss.copernicus.org/articles/2/8/2021/) and
[Kefalidis et al. (2024)](https://www.sciencedirect.com/science/article/pii/S1569843224005594),
a query is geospatial if it requires qualitative or quantitative geographic
knowledge of Earth-bound features to be answered.

This is usually the case if the query involves:
- A geographic entity (named place on Earth: city, country, river, POI, address)
- A geographic concept (place type: city, lake, mountain, park, building)
- A spatial relation (near, within, north of, between, borders, crosses, distance)

Non-geospatial: anatomical, microscopic, astronomical, fictional, or abstract
'where' questions; queries needing no geographic knowledge.

## Model details

- **Sentence Transformer body:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
- **Classification head:** LogisticRegression
- **Training data:** 1,200 gold-labelled MS MARCO queries (632 non-spatial, 568 spatial), sampled via K-means centroids across the full embedding space of all 1M+ queries for representativeness
- **Labels:** `1` = geospatial, `0` = non-geospatial

## Usage

```python
from setfit import SetFitModel

model = SetFitModel.from_pretrained("ilyankou/is-geospatial-query")
preds = model([
  "nearest hospital",
  "far from the truth",
  "close to my heart",
  "flood risk in this area"
])
# => [1, 0, 0, 1]
```

## Training

Weak labels were generated by running Llama 3.1 five times per query at temperature 0.3,
then manually verified. The SetFit model was trained for 3 epochs with batch size 64
and learning rate 2e-5 on 200 samples (95 positive and 105 negative) for validation,
then retrained on the full gold dataset (1,200 samples) for production inference.