Geospatial (Web Search) Query Detector
A binary SetFit classifier that distinguishes geospatial from non-geospatial web search queries. Trained on 1,200 gold-labelled MS MARCO web search queries with weak supervision from Llama 3.1, then manually verified. See COSIT 2026 paper preprint here - https://arxiv.org/abs/2605.11336
Achieves F1 = 0.931 on a held-out test set of 800 samples (421 non-spatial, 379 spatial), with the evaluation model trained on 200 samples (105 non-spatial, 95 spatial). The deployed model was trained on the full 1,200.
What counts as a geospatial query?
As per Mai et al. (2021) and Kefalidis et al. (2024), a query is geospatial if it requires qualitative or quantitative geographic knowledge of Earth-bound features to be answered.
This is usually the case if the query involves:
- A geographic entity (named place on Earth: city, country, river, POI, address)
- A geographic concept (place type: city, lake, mountain, park, building)
- A spatial relation (near, within, north of, between, borders, crosses, distance)
Non-geospatial: anatomical, microscopic, astronomical, fictional, or abstract 'where' questions; queries needing no geographic knowledge.
Model details
- Sentence Transformer body: BAAI/bge-small-en-v1.5
- Classification head: LogisticRegression
- Training data: 1,200 gold-labelled MS MARCO queries (632 non-spatial, 568 spatial), sampled via K-means centroids across the full embedding space of all 1M+ queries for representativeness
- Labels:
1= geospatial,0= non-geospatial
Usage
from setfit import SetFitModel
model = SetFitModel.from_pretrained("ilyankou/is-geospatial-query")
preds = model([
"nearest hospital",
"far from the truth",
"close to my heart",
"flood risk in this area"
])
# => [1, 0, 0, 1]
Training
Weak labels were generated by running Llama 3.1 five times per query at temperature 0.3, then manually verified. The SetFit model was trained for 3 epochs with batch size 64 and learning rate 2e-5 on 200 samples (95 positive and 105 negative) for validation, then retrained on the full gold dataset (1,200 samples) for production inference.
- Downloads last month
- 54
Model tree for ilyankou/is-geospatial-query
Base model
BAAI/bge-small-en-v1.5