is-geospatial-query / README.md
ilyankou's picture
Update README.md
83f5e21 verified
metadata
tags:
  - setfit
  - sentence-transformers
  - text-classification
  - geospatial
  - spatial-queries
widget:
  - text: hotel in geneva airport
  - text: what payroll deduction is mpp
  - text: weather in erlanger ky
  - text: what is the coordinates of point p
  - text: what's the weather in roseburg
metrics:
  - accuracy
  - f1
pipeline_tag: text-classification
library_name: setfit
inference: true
base_model: BAAI/bge-small-en-v1.5
license: mit

Geospatial (Web Search) Query Detector

A binary SetFit classifier that distinguishes geospatial from non-geospatial web search queries. Trained on 1,200 gold-labelled MS MARCO web search queries with weak supervision from Llama 3.1, then manually verified. See COSIT 2026 paper preprint here - https://arxiv.org/abs/2605.11336

Achieves F1 = 0.931 on a held-out test set of 800 samples (421 non-spatial, 379 spatial), with the evaluation model trained on 200 samples (105 non-spatial, 95 spatial). The deployed model was trained on the full 1,200.

What counts as a geospatial query?

As per Mai et al. (2021) and Kefalidis et al. (2024), a query is geospatial if it requires qualitative or quantitative geographic knowledge of Earth-bound features to be answered.

This is usually the case if the query involves:

  • A geographic entity (named place on Earth: city, country, river, POI, address)
  • A geographic concept (place type: city, lake, mountain, park, building)
  • A spatial relation (near, within, north of, between, borders, crosses, distance)

Non-geospatial: anatomical, microscopic, astronomical, fictional, or abstract 'where' questions; queries needing no geographic knowledge.

Model details

  • Sentence Transformer body: BAAI/bge-small-en-v1.5
  • Classification head: LogisticRegression
  • Training data: 1,200 gold-labelled MS MARCO queries (632 non-spatial, 568 spatial), sampled via K-means centroids across the full embedding space of all 1M+ queries for representativeness
  • Labels: 1 = geospatial, 0 = non-geospatial

Usage

from setfit import SetFitModel

model = SetFitModel.from_pretrained("ilyankou/is-geospatial-query")
preds = model([
  "nearest hospital",
  "far from the truth",
  "close to my heart",
  "flood risk in this area"
])
# => [1, 0, 0, 1]

Training

Weak labels were generated by running Llama 3.1 five times per query at temperature 0.3, then manually verified. The SetFit model was trained for 3 epochs with batch size 64 and learning rate 2e-5 on 200 samples (95 positive and 105 negative) for validation, then retrained on the full gold dataset (1,200 samples) for production inference.