Spaces:

DerivedFunction
/

language-extractor-demo

Running

App Files Files Community

language-extractor-demo / README.md

DerivedFunction

add

84e2dc1 2 days ago

preview code

raw

history blame contribute delete

1.22 kB

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

metadata

title: Language Extractor Demo
emoji: 🌍
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 6.11.0
app_file: app.py
pinned: false
short_description: 'Language Extractor: Polyglot Tagger'

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Offline caches

The demo now uses local parquet caches for FLEURS, Tatoeba, and SIB-200.

Build the FLEURS cache once with:

./.venv/bin/python fleurs_cache.py

That downloads the FLEURS TSV metadata, dedupes repeated sentences, drops unused columns, and writes a reusable lean parquet file at data/fleurs/fleurs_text_only.parquet. Run it once while online; after that, the app reads only the local parquet and does not need the network.

Build the Tatoeba cache once with:

./.venv/bin/python convert_tatoeba_sentences.py

That converts sentences.csv into data/tatoeba/tatoeba_text.parquet and keeps only the lean inference columns.

Build the SIB-200 cache once with:

./.venv/bin/python sib200_cache.py

That downloads the Davlan/sib200 configs, keeps the text plus language/topic metadata, and writes a reusable lean parquet file at data/sib200/sib200_text.parquet.