---
title: Language Identification Demo with Polyglot Tagger
emoji: 🌍
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 6.13.0
app_file: app.py
pinned: true
short_description: 'Language Identification: Polyglot Tagger'
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

## Offline caches

The demo now uses local parquet caches for FLEURS, Tatoeba, and SIB-200.

Build the FLEURS cache once with:

```bash
./.venv/bin/python fleurs_cache.py
```

That downloads the FLEURS TSV metadata, dedupes repeated sentences, drops unused columns, and writes a reusable lean parquet file at `data/fleurs/fleurs_text_only.parquet`.
Run it once while online; after that, the app reads only the local parquet and does not need the network.

Build the Tatoeba cache once with:

```bash
./.venv/bin/python convert_tatoeba_sentences.py
```

That converts `sentences.csv` into `data/tatoeba/tatoeba_text.parquet` and keeps only the lean inference columns.

Build the SIB-200 cache once with:

```bash
./.venv/bin/python sib200_cache.py
```

That downloads the `Davlan/sib200` configs, keeps the text plus language/topic metadata, and writes a reusable lean parquet file at `data/sib200/sib200_text.parquet`.