A newer version of the Gradio SDK is available: 6.12.0
title: Language Extractor Demo
emoji: 🌍
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 6.11.0
app_file: app.py
pinned: false
short_description: 'Language Extractor: Polyglot Tagger'
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
Offline caches
The demo now uses local parquet caches for FLEURS, Tatoeba, and SIB-200.
Build the FLEURS cache once with:
./.venv/bin/python fleurs_cache.py
That downloads the FLEURS TSV metadata, dedupes repeated sentences, drops unused columns, and writes a reusable lean parquet file at data/fleurs/fleurs_text_only.parquet.
Run it once while online; after that, the app reads only the local parquet and does not need the network.
Build the Tatoeba cache once with:
./.venv/bin/python convert_tatoeba_sentences.py
That converts sentences.csv into data/tatoeba/tatoeba_text.parquet and keeps only the lean inference columns.
Build the SIB-200 cache once with:
./.venv/bin/python sib200_cache.py
That downloads the Davlan/sib200 configs, keeps the text plus language/topic metadata, and writes a reusable lean parquet file at data/sib200/sib200_text.parquet.