Commit ·
e4f7d8f
1
Parent(s): 1d100ed
add
Browse files- README.md +11 -3
- data/tatoeba/tatoeba_text.parquet +3 -0
README.md
CHANGED
|
@@ -12,11 +12,11 @@ short_description: 'Language Extractor: Polyglot Tagger'
|
|
| 12 |
|
| 13 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
| 14 |
|
| 15 |
-
## Offline
|
| 16 |
|
| 17 |
-
The demo
|
| 18 |
|
| 19 |
-
Build the cache once with:
|
| 20 |
|
| 21 |
```bash
|
| 22 |
./.venv/bin/python fleurs_cache.py
|
|
@@ -24,3 +24,11 @@ Build the cache once with:
|
|
| 24 |
|
| 25 |
That downloads the FLEURS TSV metadata, dedupes repeated sentences, drops unused columns, and writes a reusable lean parquet file at `data/fleurs/fleurs_text_only.parquet`.
|
| 26 |
Run it once while online; after that, the app reads only the local parquet and does not need the network.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
| 14 |
|
| 15 |
+
## Offline caches
|
| 16 |
|
| 17 |
+
The demo now uses local parquet caches for both FLEURS and Tatoeba.
|
| 18 |
|
| 19 |
+
Build the FLEURS cache once with:
|
| 20 |
|
| 21 |
```bash
|
| 22 |
./.venv/bin/python fleurs_cache.py
|
|
|
|
| 24 |
|
| 25 |
That downloads the FLEURS TSV metadata, dedupes repeated sentences, drops unused columns, and writes a reusable lean parquet file at `data/fleurs/fleurs_text_only.parquet`.
|
| 26 |
Run it once while online; after that, the app reads only the local parquet and does not need the network.
|
| 27 |
+
|
| 28 |
+
Build the Tatoeba cache once with:
|
| 29 |
+
|
| 30 |
+
```bash
|
| 31 |
+
./.venv/bin/python convert_tatoeba_sentences.py
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
That converts `sentences.csv` into `data/tatoeba/tatoeba_text.parquet` and keeps only the lean inference columns.
|
data/tatoeba/tatoeba_text.parquet
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bb6078ec4bc24cde0b84d9637f9ae1c6cbe10125ab3ad188b57cf8834a090a22
|
| 3 |
+
size 389231463
|