--- title: Language Identification Demo with Polyglot Tagger emoji: 🌍 colorFrom: indigo colorTo: blue sdk: gradio sdk_version: 6.13.0 app_file: app.py pinned: true short_description: 'Language Identification: Polyglot Tagger' --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference ## Offline caches The demo now uses local parquet caches for FLEURS, Tatoeba, and SIB-200. Build the FLEURS cache once with: ```bash ./.venv/bin/python fleurs_cache.py ``` That downloads the FLEURS TSV metadata, dedupes repeated sentences, drops unused columns, and writes a reusable lean parquet file at `data/fleurs/fleurs_text_only.parquet`. Run it once while online; after that, the app reads only the local parquet and does not need the network. Build the Tatoeba cache once with: ```bash ./.venv/bin/python convert_tatoeba_sentences.py ``` That converts `sentences.csv` into `data/tatoeba/tatoeba_text.parquet` and keeps only the lean inference columns. Build the SIB-200 cache once with: ```bash ./.venv/bin/python sib200_cache.py ``` That downloads the `Davlan/sib200` configs, keeps the text plus language/topic metadata, and writes a reusable lean parquet file at `data/sib200/sib200_text.parquet`.