Text Classification
setfit
Safetensors
sentence-transformers
bert
geospatial
spatial-queries
text-embeddings-inference
Instructions to use ilyankou/is-geospatial-query with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- setfit
How to use ilyankou/is-geospatial-query with setfit:
from setfit import SetFitModel model = SetFitModel.from_pretrained("ilyankou/is-geospatial-query") - sentence-transformers
How to use ilyankou/is-geospatial-query with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("ilyankou/is-geospatial-query") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
File size: 2,876 Bytes
240a355 4729d8e 240a355 4729d8e 240a355 4729d8e 240a355 4729d8e 240a355 4729d8e 240a355 4729d8e 83f5e21 240a355 4729d8e 240a355 4729d8e 240a355 4729d8e c9bfb81 4729d8e c9bfb81 4729d8e c9bfb81 4729d8e c9bfb81 4729d8e 240a355 4729d8e 240a355 4729d8e 240a355 4729d8e 240a355 4729d8e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | ---
tags:
- setfit
- sentence-transformers
- text-classification
- geospatial
- spatial-queries
widget:
- text: hotel in geneva airport
- text: what payroll deduction is mpp
- text: weather in erlanger ky
- text: what is the coordinates of point p
- text: what's the weather in roseburg
metrics:
- accuracy
- f1
pipeline_tag: text-classification
library_name: setfit
inference: true
base_model: BAAI/bge-small-en-v1.5
license: mit
---
# Geospatial (Web Search) Query Detector
A binary [SetFit](https://github.com/huggingface/setfit) classifier that distinguishes geospatial
from non-geospatial web search queries. Trained on 1,200 gold-labelled
[MS MARCO](https://microsoft.github.io/msmarco/) web search queries with weak supervision from Llama 3.1, then manually verified. See COSIT 2026 paper preprint here - https://arxiv.org/abs/2605.11336
Achieves **F1 = 0.931** on a held-out test set of 800 samples (421 non-spatial, 379 spatial),
with the evaluation model trained on 200 samples (105 non-spatial, 95 spatial). The deployed model was trained on the full 1,200.
## What counts as a geospatial query?
As per [Mai et al. (2021)](https://agile-giss.copernicus.org/articles/2/8/2021/) and
[Kefalidis et al. (2024)](https://www.sciencedirect.com/science/article/pii/S1569843224005594),
a query is geospatial if it requires qualitative or quantitative geographic
knowledge of Earth-bound features to be answered.
This is usually the case if the query involves:
- A geographic entity (named place on Earth: city, country, river, POI, address)
- A geographic concept (place type: city, lake, mountain, park, building)
- A spatial relation (near, within, north of, between, borders, crosses, distance)
Non-geospatial: anatomical, microscopic, astronomical, fictional, or abstract
'where' questions; queries needing no geographic knowledge.
## Model details
- **Sentence Transformer body:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)
- **Classification head:** LogisticRegression
- **Training data:** 1,200 gold-labelled MS MARCO queries (632 non-spatial, 568 spatial), sampled via K-means centroids across the full embedding space of all 1M+ queries for representativeness
- **Labels:** `1` = geospatial, `0` = non-geospatial
## Usage
```python
from setfit import SetFitModel
model = SetFitModel.from_pretrained("ilyankou/is-geospatial-query")
preds = model([
"nearest hospital",
"far from the truth",
"close to my heart",
"flood risk in this area"
])
# => [1, 0, 0, 1]
```
## Training
Weak labels were generated by running Llama 3.1 five times per query at temperature 0.3,
then manually verified. The SetFit model was trained for 3 epochs with batch size 64
and learning rate 2e-5 on 200 samples (95 positive and 105 negative) for validation,
then retrained on the full gold dataset (1,200 samples) for production inference. |