--- license: apache-2.0 language: - en tags: - marine-biology - biodiversity - oceanography - ecology - species-identification - foundation-model datasets: - OBIS - WoRMS - FishBase - GBIF pipeline_tag: text-generation --- # NatureCode Ocean Life A 9.91 billion parameter foundation model specialized for marine biodiversity, ocean ecosystems, and aquatic life sciences. ## Model Status: In Training (Checkpoint 11000/50000) This is an intermediate checkpoint at step 11,000 of 50,000 total training steps (22% complete). ### Current Training Progress - **Current Step:** 11,000 - **Target Steps:** 50,000 - **Progress:** 22% ## What's Needed to Complete the Model ### Remaining Training - Continue training from step 11,000 to step 50,000 (39,000 more steps) - Estimated compute: ~30 hours on 8x H100 GPUs - Training data: 871,304 marine biology examples ### Infrastructure Requirements - 8x NVIDIA H100 80GB GPUs (a3-highgpu-8g) - Training config: batch_size=8, grad_accum=4, lr=3e-4 - Mixed precision: BF16 with gradient checkpointing ### Data Sources Already Included - OBIS (Ocean Biodiversity Information System): 62GB - WoRMS (World Register of Marine Species): 4K records - FishBase: 4K records - GBIF (marine subset): 1.8K records - arXiv marine papers: ~1,000 papers - OceanInstruct Q&A: 4.9MB - Curated marine text: 3.4GB ### Data Gaps (Future Improvements) - Freshwater ecosystems - Climate/ocean interaction data - Satellite imagery integration - Acoustic/bioacoustic data - eDNA sequences ## Model Architecture | Specification | Value | |--------------|-------| | Parameters | 9.91B | | Hidden Size | 4096 | | Layers | 48 | | Attention Heads | 32 | | FFN Size | 16384 | | Context Length | 4096 tokens | | Vocab Size | 50257 (GPT-2) | ## Usage ```python import torch from transformers import GPT2Tokenizer # Load model weights state_dict = torch.load('pytorch_model.bin', map_location='cpu') # The model uses a custom architecture - see train.py for model class # Or wait for the final release with HuggingFace transformers integration ``` ## License Apache 2.0 ## Citation ```bibtex @misc{naturecode-ocean-life-2026, title={NatureCode Ocean Life: A Foundation Model for Marine Biodiversity}, author={NatureCode Team}, year={2026}, url={https://huggingface.co/naturecodeproject/ocean-life} } ```