--- tags: - setfit - sentence-transformers - text-classification - geospatial - spatial-queries widget: - text: hotel in geneva airport - text: what payroll deduction is mpp - text: weather in erlanger ky - text: what is the coordinates of point p - text: what's the weather in roseburg metrics: - accuracy - f1 pipeline_tag: text-classification library_name: setfit inference: true base_model: BAAI/bge-small-en-v1.5 license: mit --- # Geospatial (Web Search) Query Detector A binary [SetFit](https://github.com/huggingface/setfit) classifier that distinguishes geospatial from non-geospatial web search queries. Trained on 1,200 gold-labelled [MS MARCO](https://microsoft.github.io/msmarco/) web search queries with weak supervision from Llama 3.1, then manually verified. See COSIT 2026 paper preprint here - https://arxiv.org/abs/2605.11336 Achieves **F1 = 0.931** on a held-out test set of 800 samples (421 non-spatial, 379 spatial), with the evaluation model trained on 200 samples (105 non-spatial, 95 spatial). The deployed model was trained on the full 1,200. ## What counts as a geospatial query? As per [Mai et al. (2021)](https://agile-giss.copernicus.org/articles/2/8/2021/) and [Kefalidis et al. (2024)](https://www.sciencedirect.com/science/article/pii/S1569843224005594), a query is geospatial if it requires qualitative or quantitative geographic knowledge of Earth-bound features to be answered. This is usually the case if the query involves: - A geographic entity (named place on Earth: city, country, river, POI, address) - A geographic concept (place type: city, lake, mountain, park, building) - A spatial relation (near, within, north of, between, borders, crosses, distance) Non-geospatial: anatomical, microscopic, astronomical, fictional, or abstract 'where' questions; queries needing no geographic knowledge. ## Model details - **Sentence Transformer body:** [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) - **Classification head:** LogisticRegression - **Training data:** 1,200 gold-labelled MS MARCO queries (632 non-spatial, 568 spatial), sampled via K-means centroids across the full embedding space of all 1M+ queries for representativeness - **Labels:** `1` = geospatial, `0` = non-geospatial ## Usage ```python from setfit import SetFitModel model = SetFitModel.from_pretrained("ilyankou/is-geospatial-query") preds = model([ "nearest hospital", "far from the truth", "close to my heart", "flood risk in this area" ]) # => [1, 0, 0, 1] ``` ## Training Weak labels were generated by running Llama 3.1 five times per query at temperature 0.3, then manually verified. The SetFit model was trained for 3 epochs with batch size 64 and learning rate 2e-5 on 200 samples (95 positive and 105 negative) for validation, then retrained on the full gold dataset (1,200 samples) for production inference.