DerivedFunction1 commited on
Commit
e4f7d8f
·
1 Parent(s): 1d100ed
Files changed (2) hide show
  1. README.md +11 -3
  2. data/tatoeba/tatoeba_text.parquet +3 -0
README.md CHANGED
@@ -12,11 +12,11 @@ short_description: 'Language Extractor: Polyglot Tagger'
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
 
15
- ## Offline FLEURS cache
16
 
17
- The demo can now pull examples from a local, text-only FLEURS parquet cache instead of relying on Tatoeba.
18
 
19
- Build the cache once with:
20
 
21
  ```bash
22
  ./.venv/bin/python fleurs_cache.py
@@ -24,3 +24,11 @@ Build the cache once with:
24
 
25
  That downloads the FLEURS TSV metadata, dedupes repeated sentences, drops unused columns, and writes a reusable lean parquet file at `data/fleurs/fleurs_text_only.parquet`.
26
  Run it once while online; after that, the app reads only the local parquet and does not need the network.
 
 
 
 
 
 
 
 
 
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
 
15
+ ## Offline caches
16
 
17
+ The demo now uses local parquet caches for both FLEURS and Tatoeba.
18
 
19
+ Build the FLEURS cache once with:
20
 
21
  ```bash
22
  ./.venv/bin/python fleurs_cache.py
 
24
 
25
  That downloads the FLEURS TSV metadata, dedupes repeated sentences, drops unused columns, and writes a reusable lean parquet file at `data/fleurs/fleurs_text_only.parquet`.
26
  Run it once while online; after that, the app reads only the local parquet and does not need the network.
27
+
28
+ Build the Tatoeba cache once with:
29
+
30
+ ```bash
31
+ ./.venv/bin/python convert_tatoeba_sentences.py
32
+ ```
33
+
34
+ That converts `sentences.csv` into `data/tatoeba/tatoeba_text.parquet` and keeps only the lean inference columns.
data/tatoeba/tatoeba_text.parquet ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bb6078ec4bc24cde0b84d9637f9ae1c6cbe10125ab3ad188b57cf8834a090a22
3
+ size 389231463