Spaces:

polyglot-tagger
/

language-extractor-demo

Running

DerivedFunction1 commited on 5 days ago

Commit

e4f7d8f

1 Parent(s): 1d100ed

add

Files changed (2) hide show

README.md CHANGED Viewed

@@ -12,11 +12,11 @@ short_description: 'Language Extractor: Polyglot Tagger'
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
-## Offline FLEURS cache
-The demo can now pull examples from a local, text-only FLEURS parquet cache instead of relying on Tatoeba.
-Build the cache once with:
 ```bash
 ./.venv/bin/python fleurs_cache.py
@@ -24,3 +24,11 @@ Build the cache once with:
 That downloads the FLEURS TSV metadata, dedupes repeated sentences, drops unused columns, and writes a reusable lean parquet file at `data/fleurs/fleurs_text_only.parquet`.
 Run it once while online; after that, the app reads only the local parquet and does not need the network.

 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+## Offline caches
+The demo now uses local parquet caches for both FLEURS and Tatoeba.
+Build the FLEURS cache once with:
 ```bash
 ./.venv/bin/python fleurs_cache.py
 That downloads the FLEURS TSV metadata, dedupes repeated sentences, drops unused columns, and writes a reusable lean parquet file at `data/fleurs/fleurs_text_only.parquet`.
 Run it once while online; after that, the app reads only the local parquet and does not need the network.
+Build the Tatoeba cache once with:
+```bash
+./.venv/bin/python convert_tatoeba_sentences.py
+```
+That converts `sentences.csv` into `data/tatoeba/tatoeba_text.parquet` and keeps only the lean inference columns.

data/tatoeba/tatoeba_text.parquet ADDED Viewed

+version https://git-lfs.github.com/spec/v1
+oid sha256:bb6078ec4bc24cde0b84d9637f9ae1c6cbe10125ab3ad188b57cf8834a090a22
+size 389231463