Replace metadata.duckdb with optimized version (27GB → 14GB)

#1
HDR Imageomics Institute org
edited Mar 13

Replace metadata.duckdb with optimized version (27GB → 14GB)

Optimizations: ENUM types, URL prefix deduplication, taxonomy sort, UUID native type, INT downcast.
Backfilled 28.3M rows (observation.org + EOL) with recovered taxonomy and URLs.
Added in_bioclip2_training and has_url boolean columns.

⚠️ WARNING: Merge code repo PR #23 (https://github.com/Imageomics/bioclip-image-search-lite/pull/23) BEFORE merging this data change. The app code in PR #23 adapts to the new optimized schema (ENUM types, url_prefixes table, new boolean columns). Merging this data PR first will break the live app.

netzhang changed pull request title from Replace metadata.duckdb with optimized version (27GB → 14GB) Optimizations: ENUM types, URL prefix deduplication, taxonomy sort, UUID native type, INT downcast. Backfilled 28.3M rows (observation.org + EOL) with recovered taxonomy and URLs. Added in_bioclip2_training and has_url boolean columns. ⚠️ WARNING: Merge code repo PR #23 (https://github.com/Imageomics/bioclip-image-search-lite/pull/23) BEFORE merging this data change. The app code in PR #23 adapts to the new optimized schema (ENUM types, url_prefixes table, new boolean columns). Merging this data PR first will break the live app. to Replace metadata.duckdb with optimized version (27GB → 14GB)
netzhang changed pull request status to merged

Sign up or log in to comment