| --- |
| language: en |
| license: apache-2.0 |
| tags: |
| - marine-biology |
| - metagenomics |
| - environmental-modeling |
| - protein-domains |
| - tara-oceans |
| - pfam |
| - pytorch |
| library_name: pytorch |
| pipeline_tag: tabular-regression |
| --- |
| |
| # ELF-NET: Environment-Linked Functional Network |
|
|
| Bidirectional neural network checkpoints linking marine environmental variables to microalgal protein domain (Pfam) abundance profiles from the TARA Oceans metagenomic dataset. |
|
|
| ## Model Description |
|
|
| ELF-NET consists of two complementary prediction directions: |
|
|
| ### env2pfam (Environment β Pfam Abundance) |
| Predicts the abundance of thousands of Pfam protein domains at a marine sampling site given 94 environmental features (30 oceanographic/atmospheric variables + 64 AlphaEarth spectral eigenvectors). |
|
|
| ### pfam2env (Pfam Abundance β Environmental Features) |
| Predicts 64 environmental features from observed Pfam domain abundance profiles (9,611 input domains). |
|
|
| ## Repository Structure |
|
|
| ``` |
| βββ env2pfam/ |
| β βββ algagpt_full/ # AlgaGPT-extracted proteomes, full architecture |
| β βββ algagpt_light/ # AlgaGPT-extracted proteomes, light architecture |
| β βββ pythia_full/ # LA4SR-Pythia-extracted proteomes, full architecture |
| β βββ pythia_light/ # LA4SR-Pythia-extracted proteomes, light architecture |
| βββ pfam2env/ |
| β βββ full/ # Full architecture |
| β βββ light/ # Light architecture |
| βββ README.md |
| ``` |
|
|
| Each subdirectory contains: |
| - `best_model.pt` β PyTorch checkpoint (model_state_dict, optimizer_state_dict, best_val_loss) |
| - `config.json` β Hyperparameters and feature lists |
| - `final_metrics.json` β Train/val/test metrics |
| - `training_history.json` β Per-epoch training curves |
|
|
| ## Architectures |
|
|
| ### env2pfam Full |
| ``` |
| Input(94) β Linear(512) + BN + ReLU + Dropout(0.2) |
| β Linear(1024) + BN + ReLU + Dropout(0.2) |
| β Linear(2048) + BN + ReLU + Dropout(0.2) |
| β Linear(4096) + BN + ReLU + Dropout(0.2) |
| β Linear(output_dim) |
| ``` |
|
|
| ### env2pfam Light |
| ``` |
| Input(94) β Linear(256) + BN + ReLU + Dropout(0.2) |
| β Linear(512) + BN + ReLU + Dropout(0.2) |
| β Linear(1024) + BN + ReLU + Dropout(0.2) |
| β Linear(2048) + BN + ReLU + Dropout(0.2) |
| β Linear(output_dim) |
| ``` |
|
|
| ### pfam2env Full |
| ``` |
| InputBatchNorm(9611) β Linear(2048) + ReLU + Dropout |
| β Linear(512) + ReLU + Dropout |
| β Linear(128) + ReLU + Dropout |
| β Linear(64) |
| ``` |
|
|
| ### pfam2env Light |
| ``` |
| InputBatchNorm(9611) β Linear(512) + ReLU + Dropout |
| β Linear(256) + ReLU + Dropout |
| β Linear(128) + ReLU + Dropout |
| β Linear(64) |
| ``` |
|
|
| ## Performance |
|
|
| ### env2pfam (Environment β Pfam) |
|
|
| | Variant | Dataset | Output Dim | LR | Test RΒ² | Test MSE | Test MAE | |
| |---|---|---|---|---|---|---| |
| | **pythia_full** | LA4SR-Pythia | 17,245 | 1e-3 | **0.1487** | 14.597 | 2.411 | |
| | pythia_light | LA4SR-Pythia | 17,245 | 1e-4 | 0.1432 | 14.561 | 2.454 | |
| | algagpt_full | AlgaGPT | 20,318 | 1e-3 | 0.1189 | 14.006 | 2.381 | |
| | algagpt_light | AlgaGPT | 20,318 | 1e-4 | 0.1070 | 14.136 | 2.415 | |
| |
| RΒ² is the mean across all output Pfam dimensions. The modest RΒ² values reflect the high dimensionality of the output space (17Kβ20K Pfam domains) and the inherent stochasticity of metagenomic sampling. |
| |
| ### pfam2env (Pfam β Environment) |
| |
| | Variant | Input Dim | LR | Test RΒ² | Test MSE | Test MAE | |
| |---|---|---|---|---|---| |
| | full | 9,611 | 1e-3 | -0.0057 | 0.00931 | 0.0724 | |
| | light | 9,611 | 1e-3 | -0.0055 | 0.00931 | 0.0724 | |
| |
| Negative RΒ² indicates performance near the mean-prediction baseline. These checkpoints document the pfamβenv direction of the bidirectional framework and are included for completeness and reproducibility. |
| |
| ## Input Features (env2pfam) |
| |
| **30 environmental variables:** |
| - Air temperature (mean, max, min, range Β°C) |
| - Precipitation (mean mm) |
| - Solar radiation (MJ/mΒ²) |
| - Elevation (m), bathymetry (m), distance to coast (km) |
| - Land cover class |
| - Sea surface temperature (SST mean, max, min, range Β°C; MODIS SST mean) |
| - Chlorophyll-a (mean, max, min mg/mΒ³) |
| - Normalized fluorescence line height (NFLH mean) |
| - Particulate organic carbon (POC mean mg/mΒ³) |
| - Remote sensing reflectance (Rrs at 412, 443, 469, 488, 531, 547, 555, 645, 667, 678 nm) |
|
|
| **64 AlphaEarth spectral eigenvectors** (A00βA63) |
|
|
| ## Datasets |
|
|
| Two LLM-mediated proteome extraction strategies were applied to TARA Oceans metagenomic assemblies: |
|
|
| - **LA4SR-Pythia**: 2,049 samples β 17,245 Pfam domains |
| - **AlgaGPT**: 2,044 samples β 20,318 Pfam domains |
|
|
| Both used SNAP gene prediction followed by hmmsearch against the Pfam database. The different extraction strategies yield different protein sets and domain profiles from the same underlying metagenomes. |
|
|
| ## Training Details |
|
|
| - **Framework**: PyTorch |
| - **Loss**: MSE |
| - **Optimizer**: Adam (weight_decay=1e-4 for pfam2env) |
| - **Scheduler**: Cosine annealing (pfam2env) |
| - **Early stopping**: Patience 20 (env2pfam) / 30 (pfam2env) |
| - **Batch size**: 32 |
| - **Max epochs**: 200 |
| - **Seed**: 42 |
| - **Hardware**: CUDA GPU |
| |
| ## Usage |
| |
| ```python |
| import torch |
| import json |
| |
| # Load model config |
| with open("env2pfam/pythia_full/config.json") as f: |
| config = json.load(f) |
| |
| # Load checkpoint |
| checkpoint = torch.load( |
| "env2pfam/pythia_full/best_model.pt", |
| map_location="cpu", |
| weights_only=False |
| ) |
| state_dict = checkpoint["model_state_dict"] |
| |
| # Reconstruct model (requires the ELF-NET model class) |
| # model.load_state_dict(state_dict) |
| ``` |
| |
| ## Citation |
| |
| If you use these checkpoints, please cite the associated manuscript (citation forthcoming). |
| |
| ## License |
| |
| Apache 2.0 |
| |