ilyankou
/

is-geospatial-query

Text Classification

sentence-transformers

spatial-queries

text-embeddings-inference

Model card Files Files and versions

is-geospatial-query / README.md

ilyankou's picture

Update README.md

83f5e21 verified about 21 hours ago

|

history blame contribute delete

2.88 kB

	---
	tags:
	- setfit
	- sentence-transformers
	- text-classification
	- geospatial
	- spatial-queries
	widget:
	- text: hotel in geneva airport
	- text: what payroll deduction is mpp
	- text: weather in erlanger ky
	- text: what is the coordinates of point p
	- text: what's the weather in roseburg
	metrics:
	- accuracy
	- f1
	pipeline_tag: text-classification
	library_name: setfit
	inference: true
	base_model: BAAI/bge-small-en-v1.5
	license: mit
	---

	# Geospatial (Web Search) Query Detector

	A binary [SetFit](https://github.com/huggingface/setfit) classifier that distinguishes geospatial
	from non-geospatial web search queries. Trained on 1,200 gold-labelled
	[MS MARCO](https://microsoft.github.io/msmarco/) web search queries with weak supervision from Llama 3.1, then manually verified. See COSIT 2026 paper preprint here - https://arxiv.org/abs/2605.11336

	Achieves F1 = 0.931 on a held-out test set of 800 samples (421 non-spatial, 379 spatial),
	with the evaluation model trained on 200 samples (105 non-spatial, 95 spatial). The deployed model was trained on the full 1,200.


	## What counts as a geospatial query?

	As per [Mai et al. (2021)](https://agile-giss.copernicus.org/articles/2/8/2021/) and
	[Kefalidis et al. (2024)](https://www.sciencedirect.com/science/article/pii/S1569843224005594),
	a query is geospatial if it requires qualitative or quantitative geographic
	knowledge of Earth-bound features to be answered.

	This is usually the case if the query involves:
	- A geographic entity (named place on Earth: city, country, river, POI, address)
	- A geographic concept (place type: city, lake, mountain, park, building)
	- A spatial relation (near, within, north of, between, borders, crosses, distance)

	Non-geospatial: anatomical, microscopic, astronomical, fictional, or abstract
	'where' questions; queries needing no geographic knowledge.

	## Model details

	- Sentence Transformer body: [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
	- Classification head: LogisticRegression
	- Training data: 1,200 gold-labelled MS MARCO queries (632 non-spatial, 568 spatial), sampled via K-means centroids across the full embedding space of all 1M+ queries for representativeness
	- Labels: `1` = geospatial, `0` = non-geospatial

	## Usage

	```python
	from setfit import SetFitModel

	model = SetFitModel.from_pretrained("ilyankou/is-geospatial-query")
	preds = model([
	"nearest hospital",
	"far from the truth",
	"close to my heart",
	"flood risk in this area"
	])
	# => [1, 0, 0, 1]
	```

	## Training

	Weak labels were generated by running Llama 3.1 five times per query at temperature 0.3,
	then manually verified. The SetFit model was trained for 3 epochs with batch size 64
	and learning rate 2e-5 on 200 samples (95 positive and 105 negative) for validation,
	then retrained on the full gold dataset (1,200 samples) for production inference.