Replace metadata.duckdb with optimized version (27GB → 14GB)
#1
by netzhang - opened
Replace metadata.duckdb with optimized version (27GB → 14GB)
Optimizations: ENUM types, URL prefix deduplication, taxonomy sort, UUID native type, INT downcast.
Backfilled 28.3M rows (observation.org + EOL) with recovered taxonomy and URLs.
Added in_bioclip2_training and has_url boolean columns.
⚠️ WARNING: Merge code repo PR #23 (https://github.com/Imageomics/bioclip-image-search-lite/pull/23) BEFORE merging this data change. The app code in PR #23 adapts to the new optimized schema (ENUM types, url_prefixes table, new boolean columns). Merging this data PR first will break the live app.
netzhang changed pull request title from Replace metadata.duckdb with optimized version (27GB → 14GB)
Optimizations: ENUM types, URL prefix deduplication, taxonomy sort, UUID native type, INT downcast.
Backfilled 28.3M rows (observation.org + EOL) with recovered taxonomy and URLs.
Added in_bioclip2_training and has_url boolean columns.
⚠️ WARNING: Merge code repo PR #23 (https://github.com/Imageomics/bioclip-image-search-lite/pull/23) BEFORE merging this data change. The app code in PR #23 adapts to the new optimized schema (ENUM types, url_prefixes table, new boolean columns). Merging this data PR first will break the live app. to Replace metadata.duckdb with optimized version (27GB → 14GB)
netzhang changed pull request status to merged