maxxxzdn commited on 1 day ago

Commit

5f226eb

verified ·

1 Parent(s): 71ee02b

Initial release: Mosaic weather model (era5 + hres variants)

Browse files

Files changed (21) hide show

.gitattributes +6 -0
.gitignore +10 -0
README.md +218 -0
base.py +17 -0
config.py +25 -0
dataset.py +33 -0
era5_best.pt +3 -0
figures_weather/bsa.jpg +3 -0
figures_weather/bsa_runtime.jpg +3 -0
figures_weather/healpix.jpg +0 -0
figures_weather/hurricane_tracking.jpg +3 -0
figures_weather/main.jpg +3 -0
figures_weather/results_hres.jpg +3 -0
figures_weather/results_spectra_pareto.jpg +3 -0
hres_best.pt +3 -0
inference.py +527 -0
mosaic.py +337 -0
ops.py +566 -0
primitives.py +359 -0
requirements.txt +11 -0
utils.py +40 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,9 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+figures_weather/bsa.jpg filter=lfs diff=lfs merge=lfs -text
+figures_weather/bsa_runtime.jpg filter=lfs diff=lfs merge=lfs -text
+figures_weather/hurricane_tracking.jpg filter=lfs diff=lfs merge=lfs -text
+figures_weather/main.jpg filter=lfs diff=lfs merge=lfs -text
+figures_weather/results_hres.jpg filter=lfs diff=lfs merge=lfs -text
+figures_weather/results_spectra_pareto.jpg filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,10 @@

+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+*.egg-info/
+dist/
+build/
+.env
+*.npz
+checkpoints/

README.md ADDED Viewed

	@@ -0,0 +1,218 @@

+---
+license: cc-by-nc-4.0
+library_name: pytorch
+tags:
+- weather
+- weather-forecasting
+- climate
+- atmospheric-science
+- sparse-attention
+- transformer
+- probabilistic-forecasting
+---
+# Mosaic — Block-Sparse Attention for Weather Forecasting
+**Mosaic** is a probabilistic weather forecasting model that operates on native-resolution grids via mesh-aligned block-sparse attention. At 1.5° resolution with 214M parameters, Mosaic matches or outperforms models trained on 6× finer resolution on key variables, and individual ensemble members exhibit near-perfect spectral alignment across all resolved frequencies. A 24-member, 10-day forecast takes under 12 s on a single H100 GPU.
+> **(Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models** \
+> Maksim Zhdanov, Ana Lucic, Max Welling, Jan-Willem van de Meent \
+> *ICML 2026* · [arXiv:2604.16429](https://arxiv.org/abs/2604.16429) · [GitHub](https://github.com/maxxxzdn/mosaic)
+![Spectral fidelity and skill–speed Pareto](figures_weather/results_spectra_pareto.jpg)
+## TL;DR
+Mosaic addresses two distinct failure modes of spectral degradation in ML-based weather prediction:
+1. **Spectral damping** caused by deterministic training against ensemble means. Mosaic addresses this with learned functional perturbations that produce ensemble members preserving realistic spectral variability.
+2. **High-frequency aliasing** caused by compressive encoding onto a coarse latent grid. Mosaic operates at native resolution via block-sparse attention before any coarsening, eliminating the compress-first bottleneck.
+The block-sparse attention captures long-range dependencies at **linear** cost by sharing keys and values across spatially adjacent queries arranged on the HEALPix mesh. Each query block jointly selects which key blocks to attend to.
+## Published Variants
+This repository ships **two trained variants**, distinguished primarily by the data they were tuned on. They share the same Mosaic architecture and 82-channel variable set, but differ in training data, time cadence, history length, and normalization statistics.
+| Variant | Training data | Native step | Input history | k-neighbours | Suggested input zarr |
+|---------|---------------|-------------|---------------|---------------|---------------------|
+| `era5`  | ERA5 reanalysis only       | 24 h | 2 states (48 h) | 24 | WB2 ERA5 1.5° |
+| `hres`  | ERA5 pretrain + HRES finetune | 6 h  | 4 states (24 h) | 20 | WB2 HRES-fc0 1.5° |
+Choose `era5` when initializing from reanalysis (matches the training distribution); choose `hres` when initializing from HRES analysis or a similar operational state.
+## Architecture
+**Inputs.** 82 atmospheric channels at 1.5° equiangular resolution (240 lon × 121 lat = 29 040 points) plus 3 static channels and sinusoidal day/year time encodings.
+**Backbone.** A U-Net of transformer blocks over the HEALPix mesh, where spatial neighbours occupy contiguous memory and queries can be grouped into hardware-aligned blocks:
+| Stage      | Nside | Hidden dim | Heads | Enc / Dec depth |
+|------------|------:|----------:|------:|----------------:|
+| Stage 1    | 64    | 768       | 12    | 4 / 2 |
+| Stage 2    | 32    | 1024      | 16    | 4 / 2 |
+| Bottleneck | 16    | 1280      | 20    | 2     |
+Grouped-Query Attention with ratio 4 (3 KV heads per stage), 2D RoPE on (longitude, latitude), and additive noise injection in SwiGLU gates for ensemble generation. ~214M parameters total.
+![Block-sparse attention for weather forecasting](figures_weather/bsa.jpg)
+**Block-sparse attention.** Three branches combined by learned gates: (i) **compression** — block-to-block coarse attention captures broad synoptic patterns at $\mathcal{O}(N^2/b^2)$; (ii) **selection** — each query block top-k-selects fine-scale key blocks at $\mathcal{O}(Nnb)$; (iii) **local** — full attention inside each block at $\mathcal{O}(Nb)$. Spatially close points occupy contiguous memory on the HEALPix mesh, enabling coalesced GPU reads and hardware-aligned block computation. Implemented as a single Triton kernel; in practice up to **61.8× faster than dense attention** and **9.4× faster than NSA**.
+<p align="center">
+  <img src="figures_weather/healpix.jpg" alt="HEALPix mesh" width="48%">
+  <img src="figures_weather/bsa_runtime.jpg" alt="Runtime scaling" width="48%">
+</p>
+### Variables (82 channels)
+- **Surface (4):** `2m_temperature`, `10m_u_component_of_wind`, `10m_v_component_of_wind`, `mean_sea_level_pressure`
+- **Pressure level (6 × 13 = 78):** `geopotential`, `specific_humidity`, `temperature`, `u_component_of_wind`, `v_component_of_wind`, `vertical_velocity` at levels [50, 100, 150, 200, 250, 300, 400, 500, 600, 700, 850, 925, 1000] hPa
+- **Static (3, conditioning only — not in output):** `geopotential_at_surface`, `land_sea_mask`, `soil_type`
+## Results
+Headline plot is at the top of this page: individual ensemble members preserve realistic kinetic-energy spectra (left, 1.5°; centre, 0.25°), and Mosaic sits on the favourable end of the skill–speed–memory Pareto (right). All metrics computed at 240 h lead time, 720 initial conditions throughout the 2020 test year (1.5° benchmark) and a single 6 h forecast (0.25° benchmark).
+On the 0.25° HRES benchmark, Mosaic competes with state-of-the-art 0.25° models despite operating at 1.5°:
+![HRES benchmark results](figures_weather/results_hres.jpg)
+And a case study of **Hurricane Ian (2022)** — Mosaic's ensemble correctly brackets the observed track 7 days ahead, with progressive narrowing of spread as lead time decreases:
+![Hurricane Ian ensemble tracks](figures_weather/hurricane_tracking.jpg)
+See the paper for full benchmark tables, CRPS curves, and spread-to-skill ratios.
+## Hardware Requirements
+- **GPU:** any CUDA GPU; 16 GB is enough for a 1-member rollout, A100/H100 recommended for multi-member ensembles
+- **Memory:** ~9 GB GPU RAM for a 1-member, 40-step (10-day) rollout in float16
+- **Throughput:** 24-member, 10-day forecast in under 12 s on a single H100
+- **CUDA:** 11.8+ with matching `triton` and `flash-attn` versions
+## Installation
+```bash
+pip install -r requirements.txt
+pip install flash-attn --no-build-isolation   # built separately; needs nvcc
+```
+For reading data from Google Cloud Storage (WeatherBench2 zarr stores):
+```bash
+pip install gcsfs
+```
+## Quick Start
+```bash
+# ERA5 variant — 10-day forecast at 24 h resolution from ERA5 reanalysis
+python inference.py --variant era5 \
+    --zarr gs://weatherbench2/datasets/era5/1959-2023_01_10-6h-240x121_equiangular_with_poles_conservative.zarr \
+    --init-time "2020-01-01T00:00" \
+    --steps 10 --members 1 \
+    --output forecast_era5.npz
+# HRES variant — 10-day forecast at 6 h resolution from HRES initial conditions
+python inference.py --variant hres \
+    --zarr gs://weatherbench2/datasets/hres_t0/2016-2022-6h-240x121_equiangular_with_poles_conservative.zarr \
+    --init-time "2022-01-01T00:00" \
+    --steps 40 --members 1 \
+    --output forecast_hres.npz
+# Ensemble forecast (16 members) — change --members
+python inference.py --variant hres --zarr <...> --init-time "2020-06-15T12:00" \
+    --steps 40 --members 16 --output ensemble.npz
+```
+`--variant` selects the checkpoint, normalization statistics, history length, time stride, and neighbour count automatically. Pass `--checkpoint` or `--norm-stats` to override the bundled defaults.
+## Output Format
+The output `.npz` file contains:
+| Array | Shape | Description |
+|-------|-------|-------------|
+| `forecasts` | `(members, steps, 240, 121, 82)` | Predicted states in physical units |
+| `variables` | `(82,)` | Variable names |
+| `lead_time_hours` | `(steps,)` | Lead times (era5: 24, 48, …; hres: 6, 12, …) |
+| `init_time` | scalar | Initialization timestamp |
+| `longitude` | `(240,)` | Longitude values (0 to 358.5°) |
+| `latitude` | `(121,)` | Latitude values, South→North (−90 to 90°) |
+### Reading the output
+```python
+import numpy as np
+data = np.load("forecast_era5.npz", allow_pickle=True)
+forecasts = data['forecasts']                          # (members, steps, 240, 121, 82)
+variables = list(data['variables'])                    # ['2m_temperature', ...]
+lead_hours = data['lead_time_hours']                   # e.g. [24, 48, ..., 240]
+# Extract 500 hPa geopotential at 24 h lead time
+z500_idx = variables.index('geopotential_500')
+i_24h = int(np.where(lead_hours == 24)[0][0])
+z500_24h = forecasts[0, i_24h, :, :, z500_idx]         # (240, 121) lon × lat
+```
+## Input Data Format
+The model accepts ERA5 or HRES data in zarr format at 1.5° resolution with:
+- **Grid:** 240 lon × 121 lat equiangular with poles
+- **Time:** 6-hourly timesteps (integer hours since an arbitrary origin, parsed from the `units` zarr attribute, or as datetime64)
+- **Variables:** all 10 atmospheric variables listed above; per-variable layout is auto-detected from `_ARRAY_DIMENSIONS` (either `(time, latitude, longitude)` or `(time, longitude, latitude)`), and latitude is flipped if stored North→South
+Compatible zarr stores from [WeatherBench2](https://weatherbench2.readthedocs.io/):
+```
+gs://weatherbench2/datasets/era5/1959-2023_01_10-6h-240x121_equiangular_with_poles_conservative.zarr
+gs://weatherbench2/datasets/hres_t0/2016-2022-6h-240x121_equiangular_with_poles_conservative.zarr
+```
+## Repository Contents
+| File | Description |
+|------|-------------|
+| `inference.py` | Main inference script (variant-aware via `--variant {era5,hres}`) |
+| `mosaic.py` | Mosaic U-Net transformer |
+| `primitives.py` | Attention blocks, RoPE, HEALPix sampling, noise generator |
+| `ops.py` | Triton block-sparse attention kernels |
+| `utils.py` | HEALPix grid utilities |
+| `base.py` | `WeatherModel` wrapper |
+| `config.py` | Variable / level definitions |
+| `dataset.py` | Metadata dataclasses |
+| `norm_stats_era5.npz` | Normalization statistics for the `era5` variant |
+| `norm_stats_hres.npz` | Normalization statistics for the `hres` variant |
+| `static_vars.npz` | Static fields (orography, land–sea mask, soil type) — shared between variants |
+| `era5_best.pt` | Trained checkpoint, `era5` variant (~1.7 GB) |
+| `hres_best.pt` | Trained checkpoint, `hres` variant (~1.7 GB) |
+| `figures_weather/` | Figures from the paper |
+## Limitations
+Mosaic operates at 1.5° (~166 km), which cannot resolve mesoscale phenomena such as tropical-cyclone inner-core structure or individual severe thunderstorms. The block-sparse attention is designed to scale linearly with sequence length, so finer grids (e.g. 0.25°, ~700k tokens) are a natural next step but are not part of this release.
+## Citation
+If you use Mosaic, please cite:
+```bibtex
+@inproceedings{zhdanov2026mosaic,
+  title     = {(Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models},
+  author    = {Zhdanov, Maksim and Lucic, Ana and Welling, Max and van de Meent, Jan-Willem},
+  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
+  year      = {2026},
+  url       = {https://arxiv.org/abs/2604.16429}
+}
+```
+## License
+Released under [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/). Free for non-commercial research and educational use with attribution; commercial use requires a separate license. Underlying training data (ERA5, HRES) is subject to its own licensing terms set by ECMWF.
+## Acknowledgements
+MZ acknowledges support from Microsoft Research AI4Science. JWvdM acknowledges support from the European Union Horizon Framework Programme (Grant agreement ID: 101120237). This work used the Dutch national e-infrastructure with the support of the SURF Cooperative using grant no. EINF-16923. Computations were partially performed using the UvA/FNWI HPC Facility.

base.py ADDED Viewed

	@@ -0,0 +1,17 @@

+import torch
+from torch import nn
+from dataset import WeatherMetadata
+class WeatherModel(nn.Module):
+    """Weather forecasting model wrapper."""
+    def __init__(self, model: nn.Module, weather_metadata: WeatherMetadata):
+        super().__init__()
+        self.model = model
+        self.model.initialize_static_vars(weather_metadata.static_data, weather_metadata.longitude, weather_metadata.latitude)
+        self.model.initialize_interpolation(weather_metadata.longitude, weather_metadata.latitude)
+        self.weather_metadata = weather_metadata
+    def forward(self, norm_state: torch.Tensor, day_year_time: torch.Tensor, num_noise_samples: int):
+        return self.model(norm_state, day_year_time, num_noise_samples)

config.py ADDED Viewed

	@@ -0,0 +1,25 @@

+# Variable names and pressure levels for the Mosaic weather forecasting model.
+SL_VARS: list[str] = [
+    "2m_temperature",
+    "10m_u_component_of_wind",
+    "10m_v_component_of_wind",
+    "mean_sea_level_pressure",
+]
+PL_VARS: list[str] = [
+    "geopotential",
+    "specific_humidity",
+    "temperature",
+    "u_component_of_wind",
+    "v_component_of_wind",
+    "vertical_velocity",
+]
+ST_VARS: list[str] = [
+    "geopotential_at_surface",
+    "land_sea_mask",
+    "soil_type",
+]
+LEVELS: list[int] = [50, 100, 150, 200, 250, 300, 400, 500, 600, 700, 850, 925, 1000]

dataset.py ADDED Viewed

	@@ -0,0 +1,33 @@

+"""Metadata dataclasses for weather forecasting inference."""
+import torch
+from dataclasses import dataclass
+@dataclass(frozen=True)
+class NormalizationStats:
+    """Normalization statistics for state variables."""
+    state_mean: torch.Tensor
+    state_std: torch.Tensor
+    residual_mean: torch.Tensor
+    residual_std: torch.Tensor
+    def to(self, device) -> 'NormalizationStats':
+        return NormalizationStats(
+            state_mean=self.state_mean.to(device),
+            state_std=self.state_std.to(device),
+            residual_mean=self.residual_mean.to(device),
+            residual_std=self.residual_std.to(device),
+        )
+@dataclass(frozen=True)
+class WeatherMetadata:
+    """Metadata for the weather dataset."""
+    variables: list[str]
+    static_variables: list[str]
+    longitude: torch.Tensor
+    latitude: torch.Tensor
+    static_data: torch.Tensor
+    day_year_delta: torch.Tensor
+    norm_stats: NormalizationStats

era5_best.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:187df31a0caef3e934e4eb8a11506c9fea518ff0e02ce6d1f804fc6c8a78a940
+size 1713557607

figures_weather/bsa.jpg ADDED Viewed

Git LFS Details

SHA256: 8da4601faed4832cc5339d78ef80f149ec841493146b91c9bdee7a623d83d0e9
Pointer size: 131 Bytes
Size of remote file: 302 kB

figures_weather/bsa_runtime.jpg ADDED Viewed

Git LFS Details

SHA256: 266b2199d88bf159212c3e26c232cc273258ce561a601f28d565e419d5e6299a
Pointer size: 131 Bytes
Size of remote file: 111 kB

figures_weather/healpix.jpg ADDED Viewed

figures_weather/hurricane_tracking.jpg ADDED Viewed

Git LFS Details

SHA256: c11d5a0477d2b5f3ad591e6077bb09176aaa6abbe7b4bb5ca62f42c06315ef98
Pointer size: 131 Bytes
Size of remote file: 268 kB

figures_weather/main.jpg ADDED Viewed

Git LFS Details

SHA256: 45fb47af2cef25871e5cad8f57b77a55dec907fdb3e42de5076e50b3f67c93dc
Pointer size: 131 Bytes
Size of remote file: 223 kB

figures_weather/results_hres.jpg ADDED Viewed

Git LFS Details

SHA256: fc651643d9605fec7545734943bc5d1f19c95caa85df384f441afee3fc22f900
Pointer size: 131 Bytes
Size of remote file: 509 kB

figures_weather/results_spectra_pareto.jpg ADDED Viewed

Git LFS Details

SHA256: 2aa7afae00e1a55312820883978caf1f5e8fbdf2a2340e0ea4fcce839606cb89
Pointer size: 131 Bytes
Size of remote file: 361 kB

hres_best.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6e0bc244382fa1b0f09ecdbd527c601d352cc5989697149742e8983f38a25e5e
+size 1714571315

inference.py ADDED Viewed

	@@ -0,0 +1,527 @@

+"""
+Run autoregressive global weather forecasts with the Mosaic 1.5° model.
+The model predicts 6-hourly atmospheric states autoregressively, supporting
+both deterministic (1 member) and probabilistic (N members) forecasts.
+Usage:
+    # ERA5 variant (24h steps), default checkpoint and norm stats inferred from --variant
+    python inference.py --variant era5 \\
+        --zarr gs://weatherbench2/datasets/era5/1959-2023_01_10-6h-240x121_equiangular_with_poles_conservative.zarr \\
+        --init-time "2020-01-01T00:00" --steps 10 --output forecast_era5.npz
+    # HRES variant (6h steps)
+    python inference.py --variant hres \\
+        --zarr gs://weatherbench2/datasets/hres_t0/2016-2022-6h-240x121_equiangular_with_poles_conservative.zarr \\
+        --init-time "2022-01-01T00:00" --steps 40 --output forecast_hres.npz
+Input zarr format:
+    The zarr store must contain the following variables at 1.5° resolution
+    (240 lon × 121 lat, 6-hourly timesteps):
+    - Surface:  2m_temperature, 10m_u_component_of_wind, 10m_v_component_of_wind,
+                mean_sea_level_pressure
+    - Pressure-level (at 13 levels 50..1000 hPa): geopotential, specific_humidity,
+                temperature, u_component_of_wind, v_component_of_wind, vertical_velocity
+    - Coordinates: longitude (240,), latitude (121,), time (hours since 1959-01-01)
+Output npz:
+    forecasts       float32 (members, steps, 240, 121, 82) – physical units
+    variables       list of 82 variable names
+    lead_time_hours int32   (steps,)  – multiples of step_stride*6h
+                                       (era5: 24, 48, ...; hres: 6, 12, ...)
+    init_time       str     – initialization timestamp
+Hardware:
+    Requires a CUDA GPU. A 16 GB GPU is enough for 1 member; A100 80 GB recommended
+    for multi-member ensembles. float16 inference (~9 GB for 1 member, 40-step rollout).
+"""
+import argparse
+import math
+import os
+import sys
+from dataclasses import dataclass
+from pathlib import Path
+import numpy as np
+import pandas as pd
+import torch
+import zarr
+# ---------------------------------------------------------------------------
+# Model imports
+# ---------------------------------------------------------------------------
+from config import SL_VARS, PL_VARS, ST_VARS, LEVELS
+from dataset import NormalizationStats, WeatherMetadata
+from mosaic import Transformer, ModelConfig, StageConfig, BottleneckConfig
+from base import WeatherModel
+DTYPE = torch.float16
+# ---------------------------------------------------------------------------
+# Model variant presets
+# ---------------------------------------------------------------------------
+# The two published variants share the same Mosaic architecture (stage / bottleneck
+# sizes) but differ in training data, time cadence, history length, neighbour
+# count, and normalisation statistics:
+#   - `era5`: ERA5-only training, 24h steps (4 x 6h), 2 input states, k=24 neighbours
+#   - `hres`: ERA5 pretrain + HRES finetune, 6h steps, 4 input states, k=20 neighbours
+# ---------------------------------------------------------------------------
+_STAGE_CFGS_COMMON = [
+    StageConfig(
+        nside=64, dim=768, num_heads=12,
+        block_attn_size=1024, sparse_block_size=128, sparse_block_count=24,
+        encoder_depth=4, decoder_depth=2, mlp_ratio=4.0, gqa_ratio=4,
+    ),
+    StageConfig(
+        nside=32, dim=1024, num_heads=16,
+        block_attn_size=1024, sparse_block_size=128, sparse_block_count=12,
+        encoder_depth=4, decoder_depth=2, mlp_ratio=4.0, gqa_ratio=4,
+    ),
+]
+_BOTTLENECK_CFG_COMMON = BottleneckConfig(
+    nside=16, dim=1280, num_heads=20,
+    block_attn_size=1024, sparse_block_size=128, sparse_block_count=4,
+    depth=2, mlp_ratio=4.0, gqa_ratio=4,
+)
+@dataclass
+class Preset:
+    step_stride: int            # number of native 6h timesteps per model step
+    num_history_steps: int      # number of input states fed to the model
+    k_neighbors: int            # neighbours used in cross-attention interpolation
+    default_checkpoint: str
+    default_norm_stats: str
+    stage_cfgs: list
+    bottleneck_cfg: BottleneckConfig
+PRESETS = {
+    "era5": Preset(
+        step_stride=4, num_history_steps=2, k_neighbors=24,
+        default_checkpoint="era5_best.pt",
+        default_norm_stats="norm_stats_era5.npz",
+        stage_cfgs=_STAGE_CFGS_COMMON,
+        bottleneck_cfg=_BOTTLENECK_CFG_COMMON,
+    ),
+    "hres": Preset(
+        step_stride=1, num_history_steps=4, k_neighbors=20,
+        default_checkpoint="hres_best.pt",
+        default_norm_stats="norm_stats_hres.npz",
+        stage_cfgs=_STAGE_CFGS_COMMON,
+        bottleneck_cfg=_BOTTLENECK_CFG_COMMON,
+    ),
+}
+# ---------------------------------------------------------------------------
+# Time utilities
+# ---------------------------------------------------------------------------
+def compute_day_year_progress(timestamp: pd.Timestamp):
+    """Return (day_progress, year_progress) fractions for a single timestamp."""
+    day_progress = (timestamp.hour * 3600 + timestamp.minute * 60 + timestamp.second) / 86400.0
+    days_in_year = 366 if timestamp.is_leap_year else 365
+    year_progress = (timestamp.day_of_year - 1) / days_in_year
+    return float(day_progress), float(year_progress)
+# ---------------------------------------------------------------------------
+# Zarr loading
+# ---------------------------------------------------------------------------
+def _load_zarr_times(store) -> pd.DatetimeIndex:
+    """Load and decode the time coordinate from the zarr store, honouring its units attr."""
+    time_raw = np.asarray(store['time'])
+    if not np.issubdtype(time_raw.dtype, np.integer):
+        return pd.to_datetime(time_raw)
+    # Integer encoding: parse 'units' attr e.g. "hours since 1959-01-01"
+    units = store['time'].attrs.get('units', 'hours since 1959-01-01')
+    try:
+        unit_word, _, origin = units.partition(' since ')
+    except Exception:
+        unit_word, origin = 'hours', '1959-01-01'
+    unit_map = {'hours': 'h', 'hour': 'h', 'days': 'D', 'day': 'D',
+                'minutes': 'm', 'minute': 'm', 'seconds': 's', 'second': 's'}
+    unit = unit_map.get(unit_word.strip().lower(), 'h')
+    return pd.to_datetime(time_raw, unit=unit, origin=origin.strip() or '1959-01-01')
+def load_initial_state(zarr_path: str, init_time: str, num_history_steps: int = 4, step_stride: int = 1):
+    """
+    Load `num_history_steps` timesteps ending at `init_time` from a zarr store,
+    spaced `step_stride * 6h` apart (so step_stride=4 -> 24h spacing).
+    Returns:
+        state: np.ndarray of shape (num_history_steps, 240, 121, 82) in physical units
+        day_year_time: tuple (day_progress, year_progress) for init_time
+        longitude: np.ndarray (240,)
+        latitude: np.ndarray (121,) South→North
+    """
+    # Open zarr (supports local paths, gs://, s3://, etc.)
+    if zarr_path.startswith('gs://'):
+        import gcsfs
+        fs = gcsfs.GCSFileSystem(token='anon')
+        store_obj = zarr.open(fs.get_mapper(zarr_path), mode='r')
+    else:
+        store_obj = zarr.open(zarr_path, mode='r')
+    times = _load_zarr_times(store_obj)
+    init_ts = pd.Timestamp(init_time)
+    # Find the index of init_time
+    idx = times.searchsorted(init_ts)
+    if idx >= len(times) or times[idx] != init_ts:
+        raise ValueError(
+            f"init_time '{init_time}' not found in zarr store. "
+            f"Available range: {times[0]} to {times[-1]}"
+        )
+    # history indices: [idx - (H-1)*S, idx - (H-2)*S, ..., idx]
+    history_indices = [idx - (num_history_steps - 1 - i) * step_stride for i in range(num_history_steps)]
+    if history_indices[0] < 0:
+        raise ValueError(
+            f"Not enough history: need {num_history_steps} steps spaced {step_stride*6}h apart "
+            f"before {init_time}, but data starts at {times[0]}"
+        )
+    # Load longitude/latitude in the canonical (lon, lat S→N) order the model expects.
+    longitude = np.asarray(store_obj['longitude'])           # (240,) 0..358.5
+    latitude_raw = np.asarray(store_obj['latitude'])         # (121,)
+    if latitude_raw[0] > latitude_raw[-1]:                   # N→S in store → flip
+        latitude = latitude_raw[::-1].copy()
+        flip_lat = True
+    else:
+        latitude = latitude_raw.copy()
+        flip_lat = False
+    n_lon, n_lat = len(longitude), len(latitude)
+    n_vars = len(SL_VARS) + len(PL_VARS) * len(LEVELS)
+    state = np.empty((num_history_steps, n_lon, n_lat, n_vars), dtype=np.float32)
+    def _to_lon_lat(arr: np.ndarray, dims: list) -> np.ndarray:
+        """Normalise a (lat,lon) or (lon,lat) slice to (lon, lat S→N)."""
+        if dims[-2:] == ['latitude', 'longitude']:
+            arr = arr.T                                       # (lat,lon) -> (lon,lat)
+        elif dims[-2:] != ['longitude', 'latitude']:
+            raise ValueError(f"unexpected dim order: {dims}")
+        if flip_lat:
+            arr = arr[:, ::-1]
+        return np.ascontiguousarray(arr)
+    all_levels_in_store = list(np.asarray(store_obj['level'])) if 'level' in store_obj else None
+    for step_i, t_idx in enumerate(history_indices):
+        ch = 0
+        for var in SL_VARS:
+            dims = list(store_obj[var].attrs.get('_ARRAY_DIMENSIONS', ['time', 'latitude', 'longitude']))
+            arr = np.asarray(store_obj[var][t_idx])          # 2D
+            state[step_i, :, :, ch] = _to_lon_lat(arr, dims)
+            ch += 1
+        for var in PL_VARS:
+            dims = list(store_obj[var].attrs.get('_ARRAY_DIMENSIONS', ['time', 'level', 'latitude', 'longitude']))
+            arr_full = np.asarray(store_obj[var][t_idx])     # 3D (level, ...)
+            spatial_dims = [d for d in dims if d != 'time']  # drop time (already indexed)
+            for level in LEVELS:
+                lev_idx = all_levels_in_store.index(level) if all_levels_in_store is not None else LEVELS.index(level)
+                arr = arr_full[lev_idx]                       # 2D
+                # spatial_dims still includes 'level' at the front; pass just the 2D part
+                state[step_i, :, :, ch] = _to_lon_lat(arr, spatial_dims[1:] if spatial_dims[0] == 'level' else spatial_dims)
+                ch += 1
+    day_progress, year_progress = compute_day_year_progress(init_ts)
+    return state, (day_progress, year_progress), longitude, latitude
+# ---------------------------------------------------------------------------
+# Model building
+# ---------------------------------------------------------------------------
+def build_model(
+    checkpoint_path: str,
+    variables: list,
+    longitude: np.ndarray,
+    latitude: np.ndarray,
+    preset: Preset,
+    norm_stats_path: str = "norm_stats.npz",
+    static_vars_path: str = "static_vars.npz",
+    device: str = "cuda",
+):
+    """Build and return the WeatherModel with loaded checkpoint and metadata."""
+    # Load normalization statistics
+    _ns = np.load(norm_stats_path)
+    norm_stats = NormalizationStats(
+        state_mean=torch.from_numpy(_ns['state_mean'].astype(np.float32)),
+        state_std=torch.from_numpy(_ns['state_std'].astype(np.float32)),
+        residual_mean=torch.from_numpy(_ns['residual_mean'].astype(np.float32)) if 'residual_mean' in _ns else torch.zeros(len(variables)),
+        residual_std=torch.from_numpy(_ns['residual_std'].astype(np.float32)) if 'residual_std' in _ns else torch.ones(len(variables)),
+    )
+    # Load static variables
+    _sv = np.load(static_vars_path)
+    static_data = torch.from_numpy(_sv['data'].astype(np.float32))    # (lon, lat, 3)
+    lon_tensor = torch.from_numpy(longitude.astype(np.float32))
+    lat_tensor = torch.from_numpy(latitude.astype(np.float32))
+    day_year_delta = torch.tensor(
+        [preset.step_stride / 4.0, preset.step_stride / 365.25], dtype=torch.float32
+    )
+    metadata = WeatherMetadata(
+        variables=variables,
+        static_variables=list(ST_VARS),
+        longitude=lon_tensor,
+        latitude=lat_tensor,
+        static_data=static_data,
+        day_year_delta=day_year_delta,
+        norm_stats=norm_stats,
+    )
+    # Build model
+    model_config = ModelConfig(
+        dim=preset.stage_cfgs[0].dim,
+        num_heads=preset.stage_cfgs[0].num_heads,
+        variables=variables,
+        static_variables=list(ST_VARS),
+        k_neighbors=preset.k_neighbors,
+        qk_norm=False,
+        rope=True,
+        rope_theta=10000,
+        sparse_every=1,
+        qkv_compress_ratio=1,
+        num_history_steps=preset.num_history_steps,
+        noise_dim=32,
+        rmsnorm_elementwise_affine=False,
+        cg_stage_cfgs=preset.stage_cfgs,
+        bottleneck_cfg=preset.bottleneck_cfg,
+    )
+    backbone = Transformer(model_config)
+    model = WeatherModel(backbone, metadata).to(device).eval()
+    # Load checkpoint. The model registers several deterministic buffers (RoPE
+    # tables, HEALPix neighbour indices, static_vars) that are recomputed at
+    # __init__ from the metadata/config and therefore aren't expected in the
+    # saved checkpoint — so we load non-strictly and only warn on *unexpected*
+    # keys, which would indicate a real architecture mismatch.
+    ckpt = torch.load(checkpoint_path, map_location='cpu', weights_only=False)
+    state_dict = ckpt.get('model_state_dict', ckpt)
+    result = model.load_state_dict(state_dict, strict=False)
+    if result.unexpected_keys:
+        raise RuntimeError(
+            f"Unexpected keys in checkpoint (architecture mismatch): {result.unexpected_keys[:5]}"
+            + (f" ... and {len(result.unexpected_keys)-5} more" if len(result.unexpected_keys) > 5 else "")
+        )
+    print(f"Loaded checkpoint from {checkpoint_path} (epoch {ckpt.get('epoch', '?')})")
+    print(f"  {len(state_dict)} keys loaded, {len(result.missing_keys)} buffer keys re-computed from config")
+    return model, metadata
+# ---------------------------------------------------------------------------
+# Autoregressive rollout (direct state prediction)
+# ---------------------------------------------------------------------------
+@torch.no_grad()
+def unroll_direct(
+    model: WeatherModel,
+    initial_unnorm_state: torch.Tensor,
+    day_year_time: torch.Tensor,
+    day_year_delta: torch.Tensor,
+    norm_stats: NormalizationStats,
+    num_unroll_steps: int,
+    num_ensemble_members: int,
+    dtype: torch.dtype = torch.float16,
+) -> torch.Tensor:
+    """
+    Autoregressively forecast using direct state prediction (learn_direct=True).
+    Args:
+        model: WeatherModel
+        initial_unnorm_state: (B, num_history_steps, lon, lat, channels) in physical units
+        day_year_time: (B, 2) day/year progress fractions at init_time
+        day_year_delta: (2,) increment per step
+        norm_stats: NormalizationStats on the target device
+        num_unroll_steps: number of 6-hourly steps to forecast
+        num_ensemble_members: number of ensemble members (noise samples on step 0)
+        dtype: computation dtype (float16 recommended)
+    Returns:
+        trajectory: (B, members, num_history_steps + num_unroll_steps, lon, lat, channels)
+    """
+    batch_size = initial_unnorm_state.shape[0]
+    num_history_steps = initial_unnorm_state.shape[1]
+    device = initial_unnorm_state.device
+    trajectory = torch.empty(
+        (batch_size, num_ensemble_members, num_unroll_steps + num_history_steps)
+        + initial_unnorm_state.shape[2:],
+        dtype=initial_unnorm_state.dtype,
+        device=device,
+    )
+    # Expand initial state to ensemble dimension
+    current_unnorm_state = initial_unnorm_state.unsqueeze(1)   # (B, 1, H, lon, lat, C)
+    current_day_year_time = day_year_time.unsqueeze(1)          # (B, 1, 2)
+    trajectory[:, :, :num_history_steps] = current_unnorm_state
+    for t in range(num_unroll_steps):
+        # Expand ensemble only on the first step
+        num_ens_step = num_ensemble_members if t == 0 else 1
+        current_norm_state = (current_unnorm_state - norm_stats.state_mean) / norm_stats.state_std
+        with torch.amp.autocast('cuda', dtype=dtype):
+            norm_next_state = model(current_norm_state, current_day_year_time, num_ens_step)
+        next_unnorm_state = norm_next_state * norm_stats.state_std + norm_stats.state_mean
+        current_day_year_time = current_day_year_time + day_year_delta.unsqueeze(0).unsqueeze(0).expand(
+            batch_size, num_ens_step, -1
+        )
+        trajectory[:, :, t + num_history_steps] = next_unnorm_state
+        current_unnorm_state = trajectory[:, :, t + 1 : t + 1 + num_history_steps]
+    return trajectory
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    parser = argparse.ArgumentParser(description="Mosaic 1.5° Weather Forecast Inference")
+    parser.add_argument("--variant", type=str, required=True, choices=sorted(PRESETS.keys()),
+                        help="Model variant: 'era5' (ERA5-only, 24h steps) or 'hres' (ERA5+HRES finetune, 6h steps)")
+    parser.add_argument("--checkpoint", type=str, default=None,
+                        help="Path to model checkpoint (.pt). Default: preset's default_checkpoint")
+    parser.add_argument("--zarr", type=str, required=True,
+                        help="Path or GCS URI to zarr store with ERA5/HRES data at 1.5°")
+    parser.add_argument("--init-time", type=str, required=True,
+                        help="Initialization time (ISO 8601), e.g. '2020-01-01T00:00'")
+    parser.add_argument("--steps", type=int, default=10,
+                        help="Number of forecast steps (each step = step_stride*6h; e.g. era5 step=24h, hres step=6h)")
+    parser.add_argument("--members", type=int, default=1,
+                        help="Number of ensemble members (default: 1)")
+    parser.add_argument("--output", type=str, default="forecast.npz",
+                        help="Output file path (default: forecast.npz)")
+    parser.add_argument("--norm-stats", type=str, default=None,
+                        help="Path to norm_stats .npz. Default: preset's default_norm_stats")
+    parser.add_argument("--static-vars", type=str, default="static_vars.npz",
+                        help="Path to static_vars.npz (default: static_vars.npz in current dir)")
+    parser.add_argument("--k-neighbors", type=int, default=None,
+                        help="Override preset's k_neighbors (advanced — for ablation only)")
+    parser.add_argument("--no-compile", action="store_true",
+                        help="Disable torch.compile (slower but easier to debug)")
+    parser.add_argument("--device", type=str, default="cuda",
+                        help="Device (default: cuda)")
+    args = parser.parse_args()
+    preset = PRESETS[args.variant]
+    if args.k_neighbors is not None and args.k_neighbors != preset.k_neighbors:
+        from dataclasses import replace
+        preset = replace(preset, k_neighbors=args.k_neighbors)
+    checkpoint_path = args.checkpoint or preset.default_checkpoint
+    norm_stats_path = args.norm_stats or preset.default_norm_stats
+    print(f"Variant: {args.variant}  "
+          f"(step_stride={preset.step_stride}, num_history_steps={preset.num_history_steps}, "
+          f"k_neighbors={preset.k_neighbors})")
+    device = args.device
+    torch.set_float32_matmul_precision('high')
+    # Build variable list: 4 surface + 6*13 pressure-level = 82 channels
+    variables = list(SL_VARS)
+    for var in PL_VARS:
+        for level in LEVELS:
+            variables.append(f"{var}_{level}")
+    print(f"Variables: {len(variables)} channels")
+    # Load initial state from zarr
+    print(f"Loading initial state from zarr: {args.zarr}")
+    print(f"  Init time: {args.init_time}  (history: {preset.num_history_steps} x {preset.step_stride*6}h steps)")
+    initial_state_np, (day_prog, year_prog), longitude, latitude = load_initial_state(
+        args.zarr, args.init_time,
+        num_history_steps=preset.num_history_steps,
+        step_stride=preset.step_stride,
+    )
+    print(f"  State shape: {initial_state_np.shape}  (steps, lon, lat, channels)")
+    # Build model and load checkpoint
+    print(f"\nBuilding model and loading checkpoint: {checkpoint_path}")
+    model, metadata = build_model(
+        checkpoint_path=checkpoint_path,
+        variables=variables,
+        longitude=longitude,
+        latitude=latitude,
+        preset=preset,
+        norm_stats_path=norm_stats_path,
+        static_vars_path=args.static_vars,
+        device=device,
+    )
+    num_params = sum(p.numel() for p in model.parameters()) / 1e6
+    print(f"  Parameters: {num_params:.1f}M")
+    # Optionally compile
+    if not args.no_compile:
+        print("Compiling model with torch.compile (reduce-overhead)...")
+        unroll_fn = torch.compile(unroll_direct, mode='reduce-overhead')
+    else:
+        unroll_fn = unroll_direct
+    # Prepare tensors
+    initial_state = torch.from_numpy(initial_state_np).unsqueeze(0).to(device)  # (1, H, lon, lat, C)
+    day_year_time = torch.tensor([[day_prog, year_prog]], dtype=torch.float32, device=device)  # (1, 2)
+    norm_stats_d = metadata.norm_stats.to(device)
+    day_year_delta_d = metadata.day_year_delta.to(device)
+    # Run forecast
+    total_hours = args.steps * preset.step_stride * 6
+    print(f"\nRunning {args.steps}-step forecast ({total_hours}h) with {args.members} member(s)...")
+    if device == 'cuda':
+        torch.cuda.reset_peak_memory_stats()
+    with torch.no_grad():
+        trajectory = unroll_fn(
+            model=model,
+            initial_unnorm_state=initial_state,
+            day_year_time=day_year_time,
+            day_year_delta=day_year_delta_d,
+            norm_stats=norm_stats_d,
+            num_unroll_steps=args.steps,
+            num_ensemble_members=args.members,
+            dtype=DTYPE,
+        )
+    if device == 'cuda':
+        torch.cuda.synchronize()
+        peak_gb = torch.cuda.max_memory_allocated() / 1e9
+        print(f"  Peak GPU memory: {peak_gb:.1f} GB")
+    # Extract forecast steps (skip history)
+    forecasts = trajectory[0, :, preset.num_history_steps:].cpu().numpy()  # (members, steps, lon, lat, C)
+    print(f"  Forecast shape: {forecasts.shape}")
+    # Save output
+    lead_time_hours = np.arange(1, args.steps + 1) * 6 * preset.step_stride
+    np.savez(
+        args.output,
+        forecasts=forecasts,
+        variables=np.array(variables),
+        lead_time_hours=lead_time_hours,
+        init_time=np.str_(args.init_time),
+        longitude=longitude,
+        latitude=latitude,
+    )
+    print(f"\nSaved forecast to: {args.output}")
+    print(f"  Shape: forecasts {forecasts.shape}  (members, steps, lon=240, lat=121, channels=82)")
+    print(f"  Lead times: {lead_time_hours[0]}h to {lead_time_hours[-1]}h")
+if __name__ == "__main__":
+    main()

mosaic.py ADDED Viewed

	@@ -0,0 +1,337 @@

+"""
+Mosaic: U-Net transformer with block-sparse attention for weather forecasting.
+Architecture:
+- Cross-attention interpolation between lon/lat and HEALPix grids
+- Block-sparse attention (local block + compressed + top-k selection branches)
+  arranged in a U-Net encoder–bottleneck–decoder
+- Probabilistic training with noise injection
+"""
+import math
+import torch
+import torch.nn as nn
+from einops import rearrange, repeat
+from dataclasses import dataclass
+from torch.nn import RMSNorm
+from utils import get_healpix_grid, rad_to_xyz
+from primitives import (
+    MosaicBlock as _MosaicBlock,
+    CrossAttentionInterpolate,
+    NoiseGenerator,
+    HEALPixDownsample,
+    HEALPixUpsample,
+)
+@dataclass
+class StageConfig:
+    """Configuration for a U-Net encoder/decoder stage."""
+    nside: int
+    dim: int
+    num_heads: int
+    block_attn_size: int
+    sparse_block_size: int
+    sparse_block_count: int
+    encoder_depth: int
+    decoder_depth: int
+    mlp_ratio: float
+    gqa_ratio: int
+@dataclass
+class BottleneckConfig:
+    """Configuration for the U-Net bottleneck stage."""
+    nside: int
+    dim: int
+    num_heads: int
+    block_attn_size: int
+    sparse_block_size: int
+    sparse_block_count: int
+    depth: int
+    mlp_ratio: float
+    gqa_ratio: int
+@dataclass
+class ModelConfig:
+    """Configuration for the Mosaic model."""
+    dim: int
+    num_heads: int
+    k_neighbors: int
+    qk_norm: bool
+    rope: bool
+    rope_theta: int
+    sparse_every: int
+    variables: list[str]
+    static_variables: list[str]
+    qkv_compress_ratio: int
+    cg_stage_cfgs: list[StageConfig]
+    bottleneck_cfg: BottleneckConfig
+    num_history_steps: int = 1
+    noise_dim: int = 32
+    ortho_init: bool = False
+    rmsnorm_elementwise_affine: bool = True
+    no_compression: bool = False
+@dataclass
+class _MergedStageConfig:
+    """Merges ModelConfig and StageConfig for compatibility with MosaicBlock."""
+    dim: int
+    num_heads: int
+    block_attn_size: int
+    sparse_block_size: int
+    sparse_block_count: int
+    gqa_ratio: int
+    qkv_compress_ratio: int
+    rope: bool
+    rope_theta: int
+    mlp_ratio: float
+    noise_dim: int
+    rmsnorm_elementwise_affine: bool
+def _merge_configs(config: ModelConfig, stage_cfg) -> _MergedStageConfig:
+    return _MergedStageConfig(
+        dim=stage_cfg.dim,
+        num_heads=stage_cfg.num_heads,
+        block_attn_size=stage_cfg.block_attn_size,
+        sparse_block_size=stage_cfg.sparse_block_size,
+        sparse_block_count=stage_cfg.sparse_block_count,
+        gqa_ratio=stage_cfg.gqa_ratio,
+        qkv_compress_ratio=config.qkv_compress_ratio,
+        rope=config.rope,
+        rope_theta=config.rope_theta,
+        mlp_ratio=stage_cfg.mlp_ratio,
+        noise_dim=config.noise_dim,
+        rmsnorm_elementwise_affine=config.rmsnorm_elementwise_affine,
+    )
+def _make_mosaic_block(config: ModelConfig, stage_cfg, block_attn_only: bool) -> _MosaicBlock:
+    return _MosaicBlock(_merge_configs(config, stage_cfg), block_attn_only, no_compression=config.no_compression)
+class UNetStage(nn.Module):
+    def __init__(self, config, stage_cfg, depth):
+        super().__init__()
+        self.nside = stage_cfg.nside
+        self.blocks = nn.ModuleList([
+            _make_mosaic_block(
+                config=config,
+                stage_cfg=stage_cfg,
+                block_attn_only=(config.sparse_every <= 0) or not (i % config.sparse_every == 0),
+            )
+            for i in range(depth)
+        ])
+    def forward(self, x, z=None):
+        for block in self.blocks:
+            x = block(x, z)
+        return x
+class Transformer(nn.Module):
+    """U-Net style Transformer for weather forecasting on HEALPix grids."""
+    space_dim = 3
+    time_dim = 4
+    def __init__(self, config: ModelConfig, seed: int = 42):
+        super().__init__()
+        self.config = config
+        self.nside = config.cg_stage_cfgs[0].nside
+        self.noise_dim = config.noise_dim
+        initial_dim = config.dim
+        feature_dim = (len(config.variables) * config.num_history_steps
+                       + len(config.static_variables) + self.space_dim + self.time_dim)
+        if self.noise_dim > 0:
+            self.noise_generator = NoiseGenerator(self.noise_dim, seed)
+        self.preprocess = nn.Sequential(
+            nn.Linear(feature_dim, initial_dim, bias=False),
+            RMSNorm(initial_dim, elementwise_affine=config.rmsnorm_elementwise_affine),
+            nn.SiLU(),
+            nn.Linear(initial_dim, initial_dim, bias=False),
+            RMSNorm(initial_dim, elementwise_affine=config.rmsnorm_elementwise_affine),
+        )
+        self.interp_to_hp = CrossAttentionInterpolate(config)
+        self.interp_to_ll = CrossAttentionInterpolate(config)
+        self.encoder_stages = nn.ModuleList()
+        self.downsample_layers = nn.ModuleList()
+        all_stages = [*config.cg_stage_cfgs, config.bottleneck_cfg]
+        for i in range(len(config.cg_stage_cfgs)):
+            current_stage = all_stages[i]
+            next_stage = all_stages[i + 1]
+            self.encoder_stages.append(UNetStage(config=config, stage_cfg=current_stage, depth=current_stage.encoder_depth))
+            self.downsample_layers.append(
+                HEALPixDownsample(
+                    in_dim=current_stage.dim,
+                    out_dim=next_stage.dim,
+                    nside_before=current_stage.nside,
+                    nside_after=next_stage.nside,
+                    rmsnorm_elementwise_affine=config.rmsnorm_elementwise_affine,
+                )
+            )
+        self.bottleneck = UNetStage(config=config, stage_cfg=config.bottleneck_cfg, depth=config.bottleneck_cfg.depth)
+        self.decoder_stages = nn.ModuleList()
+        self.upsample_layers = nn.ModuleList()
+        for i in reversed(range(len(config.cg_stage_cfgs))):
+            prev_stage = all_stages[i + 1]
+            current_stage = all_stages[i]
+            self.upsample_layers.append(
+                HEALPixUpsample(
+                    in_dim=prev_stage.dim,
+                    out_dim=current_stage.dim,
+                    nside_before=prev_stage.nside,
+                    nside_after=current_stage.nside,
+                    rmsnorm_elementwise_affine=config.rmsnorm_elementwise_affine,
+                )
+            )
+            self.decoder_stages.append(UNetStage(config=config, stage_cfg=current_stage, depth=current_stage.decoder_depth))
+        self.norm_before_interp_ll = RMSNorm(initial_dim, elementwise_affine=config.rmsnorm_elementwise_affine)
+        self.postprocess = nn.Sequential(
+            RMSNorm(initial_dim, elementwise_affine=config.rmsnorm_elementwise_affine),
+            nn.Linear(initial_dim, initial_dim, bias=False),
+            nn.SiLU(),
+            nn.Linear(initial_dim, len(config.variables), bias=False),
+        )
+        self.apply(self._initialize_weights)
+        self._zero_init_residual_layers()
+        self.initialize_rope()
+    def _initialize_weights(self, module):
+        if module is self:
+            return
+        ortho_init = self.config.ortho_init
+        if isinstance(module, nn.Linear):
+            fan_in, fan_out = module.weight.size(1), module.weight.size(0)
+            std = 1.0 / math.sqrt(fan_in) * min(1.0, math.sqrt(fan_out / fan_in))
+            if ortho_init:
+                nn.init.orthogonal_(module.weight); module.weight.data.mul_(std)
+            else:
+                nn.init.normal_(module.weight, mean=0.0, std=std)
+            if module.bias is not None: nn.init.zeros_(module.bias)
+    def _zero_init_residual_layers(self):
+        ortho_init = self.config.ortho_init
+        for stage in [*self.encoder_stages, self.bottleneck, *self.decoder_stages]:
+            for block in stage.blocks:
+                if ortho_init:
+                    nn.init.orthogonal_(block.attention.to_o.weight)
+                    block.attention.to_o.weight.data.mul_(0.01)
+                    nn.init.orthogonal_(block.ffn.w2.weight)
+                    block.ffn.w2.weight.data.mul_(0.01)
+                else:
+                    nn.init.normal_(block.attention.to_o.weight, mean=0.0, std=0.01)
+                    nn.init.normal_(block.ffn.w2.weight, mean=0.0, std=0.01)
+                if self.noise_dim > 0:
+                    nn.init.normal_(block.ffn.noise_bias.weight, mean=0.0, std=0.01)
+        for upsample in self.upsample_layers:
+            if ortho_init:
+                nn.init.orthogonal_(upsample.proj_x.weight); upsample.proj_x.weight.data.mul_(0.01)
+                nn.init.orthogonal_(upsample.proj_pos.weight); upsample.proj_pos.weight.data.mul_(0.01)
+            else:
+                nn.init.normal_(upsample.proj_x.weight, mean=0.0, std=0.01)
+                nn.init.normal_(upsample.proj_pos.weight, mean=0.0, std=0.01)
+        if self.noise_dim > 0:
+            nn.init.normal_(self.noise_generator.to_noise.weight, mean=0.0, std=0.01)
+    def initialize_rope(self):
+        if not self.config.rope:
+            return
+        for stage in [*self.encoder_stages, self.bottleneck, *self.decoder_stages]:
+            hp_grid = get_healpix_grid(stage.nside)
+            for block in stage.blocks:
+                if block.attention.q_rope is not None:
+                    block.attention.q_rope.initialize_rope(hp_grid)
+                    block.attention.k_rope.initialize_rope(hp_grid)
+    def initialize_interpolation(self, longitude: torch.Tensor, latitude: torch.Tensor):
+        ll_grid_rad = torch.deg2rad(torch.stack(torch.meshgrid(longitude, latitude, indexing='ij'), -1).reshape(-1, 2))
+        hp_grid_rad = torch.deg2rad(get_healpix_grid(self.nside)).to(longitude.device)
+        self.interp_to_hp.initialize_interpolation_scheme(ll_grid_rad, hp_grid_rad)
+        self.interp_to_ll.initialize_interpolation_scheme(hp_grid_rad, ll_grid_rad)
+    @torch.no_grad()
+    def initialize_static_vars(self, static_vars: torch.Tensor, longitude: torch.Tensor, latitude: torch.Tensor):
+        ll_grid_rad = torch.deg2rad(torch.stack(torch.meshgrid(longitude, latitude, indexing='ij'), -1))
+        ll_grid_xyz = rad_to_xyz(ll_grid_rad)
+        static_vars = torch.concat([static_vars, ll_grid_xyz], dim=-1)
+        static_vars_mean = static_vars.mean(dim=(0, 1), keepdim=True)
+        static_vars_std = static_vars.std(dim=(0, 1), keepdim=True) + 1e-6
+        static_vars_norm = (static_vars - static_vars_mean) / static_vars_std
+        static_vars = rearrange(static_vars_norm, 'lon lat c -> (lon lat) 1 c').contiguous()
+        self.register_buffer('static_vars', static_vars, persistent=True)
+    @torch.no_grad()
+    def time_embedding(self, day_year_time: torch.Tensor):
+        day = day_year_time[:, 0:1]
+        year = day_year_time[:, 1:2]
+        day_sin = torch.sin(2 * math.pi * day)
+        day_cos = torch.cos(2 * math.pi * day)
+        year_sin = torch.sin(2 * math.pi * year)
+        year_cos = torch.cos(2 * math.pi * year)
+        return torch.cat([day_sin, day_cos, year_sin, year_cos], dim=-1)
+    def forward(self, x: torch.Tensor, day_year_time: torch.Tensor, num_noise_samples: int):
+        b, n, _, lon, lat, _ = x.shape
+        batch_size = b * num_noise_samples * n
+        if self.noise_dim > 0:
+            z = self.noise_generator(batch_size, x.device, x.dtype)
+        else:
+            z = None
+        x = repeat(x, 'b n t lon lat c -> (lon lat) (b s n) (t c)', s=num_noise_samples)
+        day_year_time = repeat(day_year_time, 'b n d -> (b s n) d', s=num_noise_samples)
+        x = torch.cat([
+            x,
+            self.static_vars.expand(-1, batch_size, -1),
+            self.time_embedding(day_year_time).unsqueeze(0).expand(x.shape[0], -1, -1)
+        ], dim=-1)
+        x = self.preprocess(x)
+        x = self.interp_to_hp(x)
+        skip_connections = []
+        for encoder_stage, downsample in zip(self.encoder_stages, self.downsample_layers):
+            x = encoder_stage(x, z)
+            skip_connections.append(x)
+            x = downsample(x)
+        x = self.bottleneck(x, z)
+        for decoder_stage, upsample, skip in zip(self.decoder_stages, self.upsample_layers, reversed(skip_connections)):
+            x = upsample(x, skip)
+            x = decoder_stage(x, z)
+        x = self.norm_before_interp_ll(x)
+        x = self.interp_to_ll(x)
+        x = self.postprocess(x)
+        x = rearrange(x, '(lon lat) (b n s) c -> b (n s) lon lat c', lon=lon, lat=lat, b=b, s=num_noise_samples)
+        return x

ops.py ADDED Viewed

	@@ -0,0 +1,566 @@

+import torch
+import triton
+import triton.language as tl
+def get_autotuning_configs(q_tile_sizes: list):
+    """Generate autotuning configurations optimized for H100."""
+    warps = [4, 8]
+    stages = [2, 3]
+    return [
+        triton.Config({'q_tile_size': t}, num_warps=w, num_stages=s)
+        for t in q_tile_sizes
+        for w in warps
+        for s in stages
+    ]
+@triton.autotune(
+    configs=get_autotuning_configs([64, 128]),
+    key=['seq_len', 'feature_dim'],
+)
+@triton.jit
+def mosaic_attn_fwd_kernel(
+    q_ptr, k_ptr, v_ptr, output_ptr, lse_ptr, block_indices_ptr,
+    softmax_scale: tl.constexpr,
+    seq_len: tl.constexpr,
+    num_kv_heads: tl.constexpr,
+    num_q_heads: tl.constexpr,
+    q_heads_per_kv_head: tl.constexpr,
+    feature_dim: tl.constexpr,
+    kv_block_size: tl.constexpr,
+    num_kv_blocks_per_q_block: tl.constexpr,
+    q_tile_size: tl.constexpr,
+):
+    """
+    Sparse attention forward kernel:
+        for each query tile (i.e. block chunk), for each query head, attend to a subset of key/value blocks.
+    """
+    LOG2_E: tl.constexpr = 1.44269504089
+    q_tile_id = tl.program_id(0)
+    q_head_id = tl.program_id(1)
+    batch_kv_head_id = tl.program_id(2)
+    batch_idx = batch_kv_head_id // num_kv_heads
+    kv_head_idx = batch_kv_head_id % num_kv_heads
+    q_head_idx = kv_head_idx * q_heads_per_kv_head + q_head_id
+    batch_offset = batch_idx * seq_len
+    q_tile_start = q_tile_id * q_tile_size
+    num_blocks_in_seq = seq_len // kv_block_size
+    tiles_per_block = kv_block_size // q_tile_size
+    q_block_id = q_tile_id // tiles_per_block
+    block_indices_offset = (
+        batch_idx * num_blocks_in_seq * num_kv_heads * num_kv_blocks_per_q_block +
+        q_block_id * num_kv_heads * num_kv_blocks_per_q_block +
+        kv_head_idx * num_kv_blocks_per_q_block
+    )
+    q_base_ptr = q_ptr + batch_offset * num_q_heads  * feature_dim + q_head_idx  * feature_dim
+    k_base_ptr = k_ptr + batch_offset * num_kv_heads * feature_dim + kv_head_idx * feature_dim
+    v_base_ptr = v_ptr + batch_offset * num_kv_heads * feature_dim + kv_head_idx * feature_dim
+    q_tile_ptr = tl.make_block_ptr(
+        base=q_base_ptr,
+        shape=(seq_len, feature_dim),
+        strides=(num_q_heads * feature_dim, 1),
+        offsets=(q_tile_start, 0),
+        block_shape=(q_tile_size, feature_dim),
+        order=(1, 0)
+    )
+    output_tile_ptr = tl.make_block_ptr(
+        base=output_ptr + batch_offset * num_q_heads * feature_dim + q_head_idx * feature_dim,
+        shape=(seq_len, feature_dim),
+        strides=(num_q_heads * feature_dim, 1),
+        offsets=(q_tile_start, 0),
+        block_shape=(q_tile_size, feature_dim),
+        order=(1, 0)
+    )
+    lse_base_ptr = lse_ptr + (batch_offset + q_tile_start) * num_q_heads + tl.arange(0, q_tile_size) * num_q_heads + q_head_idx
+    output_accum = tl.zeros([q_tile_size, feature_dim], dtype=tl.float32)
+    max_scores = tl.full([q_tile_size], float('-inf'), dtype=tl.float32)
+    sum_exp_scores = tl.zeros([q_tile_size], dtype=tl.float32)
+    q_tile = tl.load(q_tile_ptr)
+    q_tile = (q_tile * softmax_scale * LOG2_E).to(tl.bfloat16)
+    for i in range(num_kv_blocks_per_q_block):
+        kv_block_start = kv_block_size * tl.load(block_indices_ptr + block_indices_offset + i).to(tl.int32)
+        k_block_ptr = tl.make_block_ptr(
+            base=k_base_ptr,
+            shape=(feature_dim, seq_len),
+            strides=(1, num_kv_heads * feature_dim),
+            offsets=(0, kv_block_start),
+            block_shape=(feature_dim, kv_block_size),
+            order=(1, 0)
+        )
+        v_block_ptr = tl.make_block_ptr(
+            base=v_base_ptr,
+            shape=(seq_len, feature_dim),
+            strides=(num_kv_heads * feature_dim, 1),
+            offsets=(kv_block_start, 0),
+            block_shape=(kv_block_size, feature_dim),
+            order=(1, 0)
+        )
+        k_block = tl.load(k_block_ptr).to(tl.bfloat16)
+        v_block = tl.load(v_block_ptr).to(tl.bfloat16)
+        attention_scores = tl.dot(q_tile, k_block)
+        new_max = tl.max(attention_scores, axis=1)
+        old_max = max_scores
+        max_scores = tl.maximum(max_scores, new_max)
+        rescale = tl.exp2(old_max - max_scores)
+        attention_probs = tl.exp2(attention_scores - max_scores[:, None])
+        sum_exp_scores = sum_exp_scores * rescale + tl.sum(attention_probs, axis=1)
+        output_accum = output_accum * rescale[:, None]
+        output_accum += tl.dot(attention_probs.to(tl.bfloat16), v_block)
+    final_output = output_accum / sum_exp_scores[:, None]
+    log_sum_exp = (max_scores + tl.log2(sum_exp_scores))
+    tl.store(output_tile_ptr, final_output.to(q_ptr.dtype.element_ty))
+    tl.store(lse_base_ptr, log_sum_exp.to(tl.float32))
+def mosaic_attn_fwd(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    block_indices: torch.LongTensor,
+    block_size: int,
+    softmax_scale: float,
+):
+    batch_size, seq_len, num_kv_heads, feature_dim = k.shape
+    num_q_heads = q.shape[2]
+    num_kv_blocks_per_q_block = block_indices.shape[-1]
+    q_heads_per_kv_head = num_q_heads // num_kv_heads
+    output = torch.empty(batch_size, seq_len, num_q_heads, feature_dim, dtype=v.dtype, device=q.device)
+    lse = torch.empty(batch_size, seq_len, num_q_heads, dtype=torch.float32, device=q.device)
+    grid = lambda META: (
+        triton.cdiv(seq_len, META['q_tile_size']),
+        q_heads_per_kv_head,
+        batch_size * num_kv_heads
+    )
+    mosaic_attn_fwd_kernel[grid](
+        q_ptr = q,
+        k_ptr = k,
+        v_ptr = v,
+        output_ptr = output,
+        lse_ptr = lse,
+        block_indices_ptr = block_indices,
+        softmax_scale = softmax_scale,
+        seq_len = seq_len,
+        num_kv_heads = num_kv_heads,
+        num_q_heads = num_q_heads,
+        q_heads_per_kv_head = q_heads_per_kv_head,
+        feature_dim = feature_dim,
+        kv_block_size = block_size,
+        num_kv_blocks_per_q_block = num_kv_blocks_per_q_block,
+    )
+    return output, lse
+@triton.autotune(
+    configs=get_autotuning_configs([64, 128]),
+    key=['seq_len', 'feature_dim'],
+)
+@triton.jit
+def mosaic_attn_bwd_q_kernel(
+    q_ptr, k_ptr, v_ptr, lse_ptr, delta_ptr, grad_o_ptr, grad_q_ptr, block_indices_ptr,
+    softmax_scale: tl.constexpr,
+    seq_len: tl.constexpr,
+    num_kv_heads: tl.constexpr,
+    num_q_heads: tl.constexpr,
+    q_heads_per_kv_head: tl.constexpr,
+    feature_dim: tl.constexpr,
+    kv_block_size: tl.constexpr,
+    num_kv_blocks_per_q_block: tl.constexpr,
+    q_tile_size: tl.constexpr,
+):
+    LOG2_E: tl.constexpr = 1.44269504089
+    LN_2: tl.constexpr = 0.69314718056
+    q_tile_id = tl.program_id(0)
+    q_head_id = tl.program_id(1)
+    batch_kv_head_id = tl.program_id(2)
+    batch_idx = batch_kv_head_id // num_kv_heads
+    kv_head_idx = batch_kv_head_id % num_kv_heads
+    q_head_idx = kv_head_idx * q_heads_per_kv_head + q_head_id
+    batch_offset = batch_idx * seq_len
+    q_tile_start = q_tile_id * q_tile_size
+    tiles_per_block = kv_block_size // q_tile_size
+    q_block_id = q_tile_id // tiles_per_block
+    num_q_blocks = seq_len // kv_block_size
+    block_indices_offset = (
+        batch_idx * num_q_blocks * num_kv_heads * num_kv_blocks_per_q_block +
+        q_block_id * num_kv_heads * num_kv_blocks_per_q_block +
+        kv_head_idx * num_kv_blocks_per_q_block
+    )
+    q_offsets = (
+        tl.arange(0, q_tile_size)[:, None] * num_q_heads * feature_dim +
+        q_head_idx * feature_dim +
+        tl.arange(0, feature_dim)[None, :]
+    )
+    lse_offsets = tl.arange(0, q_tile_size) * num_q_heads + q_head_idx
+    q_base_ptr = q_ptr + (batch_offset + q_tile_start) * num_q_heads * feature_dim
+    grad_o_base_ptr = grad_o_ptr + (batch_offset + q_tile_start) * num_q_heads * feature_dim
+    delta_base_ptr = delta_ptr + (batch_offset + q_tile_start) * num_q_heads
+    lse_base_ptr = lse_ptr + (batch_offset + q_tile_start) * num_q_heads
+    grad_q_base_ptr = grad_q_ptr + (batch_offset + q_tile_start) * num_q_heads * feature_dim
+    grad_q_accum = tl.zeros([q_tile_size, feature_dim], dtype=tl.float32)
+    q_tile = tl.load(q_base_ptr + q_offsets)
+    q_tile = (q_tile * softmax_scale * LOG2_E).to(tl.bfloat16)
+    grad_o_tile = tl.load(grad_o_base_ptr + q_offsets).to(tl.bfloat16)
+    delta_vals = tl.load(delta_base_ptr + lse_offsets)
+    lse_vals = tl.load(lse_base_ptr + lse_offsets).to(tl.float32)
+    for i in range(num_kv_blocks_per_q_block):
+        kv_block_idx = tl.load(block_indices_ptr + block_indices_offset + i).to(tl.int32)
+        k_block_ptr = tl.make_block_ptr(
+            base=k_ptr + (batch_offset * num_kv_heads + kv_head_idx) * feature_dim,
+            shape=(feature_dim, seq_len),
+            strides=(1, num_kv_heads * feature_dim),
+            offsets=(0, kv_block_idx * kv_block_size),
+            block_shape=(feature_dim, kv_block_size),
+            order=(0, 1)
+        )
+        v_block_ptr = tl.make_block_ptr(
+            base=v_ptr + (batch_offset * num_kv_heads + kv_head_idx) * feature_dim,
+            shape=(feature_dim, seq_len),
+            strides=(1, num_kv_heads * feature_dim),
+            offsets=(0, kv_block_idx * kv_block_size),
+            block_shape=(feature_dim, kv_block_size),
+            order=(0, 1)
+        )
+        k_block = tl.load(k_block_ptr).to(tl.bfloat16)
+        v_block = tl.load(v_block_ptr).to(tl.bfloat16)
+        attention_scores = tl.dot(q_tile, k_block)
+        attention_probs = tl.exp2(attention_scores - lse_vals[:, None]) * LN_2
+        grad_times_v = tl.dot(grad_o_tile, v_block)
+        grad_scores = attention_probs * (grad_times_v - delta_vals[:, None])
+        grad_q_accum += tl.dot(grad_scores.to(tl.bfloat16), tl.trans(k_block.to(tl.bfloat16)))
+    grad_q_accum = grad_q_accum * softmax_scale * LOG2_E
+    tl.store(grad_q_base_ptr + q_offsets, grad_q_accum.to(q_ptr.dtype.element_ty))
+@torch.compile
+@torch.no_grad()
+def mosaic_block_mask(
+    block_indices: torch.LongTensor,
+):
+    batch_size, num_blocks, num_heads, _ = block_indices.shape
+    block_mask = torch.zeros(
+        batch_size, num_blocks, num_heads, num_blocks,
+        dtype=torch.bool, device=block_indices.device
+    )
+    batch_idx = torch.arange(batch_size, device=block_indices.device)[:, None, None, None]
+    q_block_idx = torch.arange(num_blocks, device=block_indices.device)[None, :, None, None]
+    head_idx = torch.arange(num_heads, device=block_indices.device)[None, None, :, None]
+    block_mask[batch_idx, q_block_idx, head_idx, block_indices] = True
+    block_mask_transposed = block_mask.permute(0, 2, 3, 1).contiguous()
+    return block_mask_transposed
+@triton.autotune(
+    configs=get_autotuning_configs([16, 32]),
+    key=['seq_len', 'feature_dim'],
+)
+@triton.jit
+def mosaic_attn_bwd_kv_kernel(
+    q_ptr, k_ptr, v_ptr, lse_ptr, delta_ptr,
+    grad_o_ptr, grad_k_ptr, grad_v_ptr,
+    block_mask_ptr,
+    softmax_scale: tl.constexpr,
+    seq_len: tl.constexpr,
+    num_kv_heads: tl.constexpr,
+    num_q_heads: tl.constexpr,
+    q_heads_per_kv_head: tl.constexpr,
+    feature_dim: tl.constexpr,
+    kv_block_size: tl.constexpr,
+    q_tile_size: tl.constexpr,
+):
+    LOG2_E: tl.constexpr = 1.44269504089
+    LN_2: tl.constexpr = 0.69314718056
+    kv_block_id = tl.program_id(0)
+    batch_kv_head_id = tl.program_id(1)
+    batch_idx = batch_kv_head_id // num_kv_heads
+    kv_head_idx = batch_kv_head_id % num_kv_heads
+    batch_offset = batch_idx * seq_len
+    num_blocks_in_seq = seq_len // kv_block_size
+    tiles_per_block = kv_block_size // q_tile_size
+    fine_mask_start = (
+        batch_idx * num_kv_heads * num_blocks_in_seq * num_blocks_in_seq +
+        kv_head_idx * num_blocks_in_seq * num_blocks_in_seq +
+        kv_block_id * num_blocks_in_seq
+    )
+    k_block_ptr = tl.make_block_ptr(
+        k_ptr + (batch_offset * num_kv_heads + kv_head_idx) * feature_dim,
+        (seq_len, feature_dim), (num_kv_heads * feature_dim, 1),
+        (kv_block_id * kv_block_size, 0), (kv_block_size, feature_dim), (1, 0)
+    )
+    v_block_ptr = tl.make_block_ptr(
+        v_ptr + (batch_offset * num_kv_heads + kv_head_idx) * feature_dim,
+        (seq_len, feature_dim), (num_kv_heads * feature_dim, 1),
+        (kv_block_id * kv_block_size, 0), (kv_block_size, feature_dim), (1, 0)
+    )
+    grad_k_ptr = tl.make_block_ptr(
+        grad_k_ptr + (batch_offset * num_kv_heads + kv_head_idx) * feature_dim,
+        (seq_len, feature_dim), (num_kv_heads * feature_dim, 1),
+        (kv_block_id * kv_block_size, 0), (kv_block_size, feature_dim), (1, 0)
+    )
+    grad_v_ptr = tl.make_block_ptr(
+        grad_v_ptr + (batch_offset * num_kv_heads + kv_head_idx) * feature_dim,
+        (seq_len, feature_dim), (num_kv_heads * feature_dim, 1),
+        (kv_block_id * kv_block_size, 0), (kv_block_size, feature_dim), (1, 0)
+    )
+    k_block = tl.load(k_block_ptr).to(tl.bfloat16)
+    v_block = tl.load(v_block_ptr).to(tl.bfloat16)
+    grad_k_accum = tl.zeros([kv_block_size, feature_dim], dtype=tl.float32)
+    grad_v_accum = tl.zeros([kv_block_size, feature_dim], dtype=tl.float32)
+    for q_block_id in range(num_blocks_in_seq):
+        is_connected = tl.load(block_mask_ptr + fine_mask_start + q_block_id)
+        if is_connected:
+            for tile_in_block in range(tiles_per_block):
+                tile_idx = q_block_id * tiles_per_block + tile_in_block
+                q_tile_start = tile_idx * q_tile_size
+                q_tile_ptr = tl.make_block_ptr(
+                    base=q_ptr + (batch_offset + q_tile_start) * num_q_heads * feature_dim,
+                    shape=(q_tile_size, num_q_heads, feature_dim),
+                    strides=(num_q_heads * feature_dim, feature_dim, 1),
+                    offsets=(0, kv_head_idx * q_heads_per_kv_head, 0),
+                    block_shape=(q_tile_size, q_heads_per_kv_head, feature_dim),
+                    order=(0, 1, 2),
+                )
+                grad_o_tile_ptr = tl.make_block_ptr(
+                    base=grad_o_ptr + (batch_offset + q_tile_start) * num_q_heads * feature_dim,
+                    shape=(q_tile_size, num_q_heads, feature_dim),
+                    strides=(num_q_heads * feature_dim, feature_dim, 1),
+                    offsets=(0, kv_head_idx * q_heads_per_kv_head, 0),
+                    block_shape=(q_tile_size, q_heads_per_kv_head, feature_dim),
+                    order=(0, 1, 2),
+                )
+                lse_tile_ptr = tl.make_block_ptr(
+                    base=lse_ptr + (batch_offset + q_tile_start) * num_q_heads,
+                    shape=(q_tile_size, num_q_heads),
+                    strides=(num_q_heads, 1),
+                    offsets=(0, kv_head_idx * q_heads_per_kv_head),
+                    block_shape=(q_tile_size, q_heads_per_kv_head),
+                    order=(1, 0),
+                )
+                delta_tile_ptr = tl.make_block_ptr(
+                    base=delta_ptr + (batch_offset + q_tile_start) * num_q_heads,
+                    shape=(q_tile_size, num_q_heads),
+                    strides=(num_q_heads, 1),
+                    offsets=(0, kv_head_idx * q_heads_per_kv_head),
+                    block_shape=(q_tile_size, q_heads_per_kv_head),
+                    order=(1, 0),
+                )
+                q_tile = tl.load(q_tile_ptr) * softmax_scale * LOG2_E
+                q_tile = tl.reshape(q_tile, (q_tile_size * q_heads_per_kv_head, feature_dim))
+                q_tile = q_tile.to(tl.bfloat16)
+                grad_o_block = tl.load(grad_o_tile_ptr)
+                grad_o_block = tl.reshape(grad_o_block, (q_tile_size * q_heads_per_kv_head, feature_dim))
+                grad_o_block = grad_o_block.to(tl.bfloat16)
+                lse_vals = tl.load(lse_tile_ptr)
+                lse_vals = tl.reshape(lse_vals, (q_tile_size * q_heads_per_kv_head,))
+                delta_vals = tl.load(delta_tile_ptr)
+                delta_vals = tl.reshape(delta_vals, (q_tile_size * q_heads_per_kv_head,))
+                attention_scores = tl.dot(k_block, tl.trans(q_tile))
+                attention_probs = tl.exp2(attention_scores - lse_vals[None, :])
+                grad_v_accum += tl.dot(attention_probs.to(tl.bfloat16), grad_o_block)
+                grad_times_v = tl.dot(v_block, tl.trans(grad_o_block))
+                grad_scores = attention_probs * (grad_times_v - delta_vals[None, :]) * LN_2
+                grad_k_accum += tl.dot(grad_scores.to(tl.bfloat16), q_tile)
+    tl.store(grad_k_ptr, grad_k_accum.to(grad_k_ptr.dtype.element_ty))
+    tl.store(grad_v_ptr, grad_v_accum.to(grad_v_ptr.dtype.element_ty))
+def mosaic_attn_bwd(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    output: torch.Tensor,
+    lse: torch.Tensor,
+    grad_o: torch.Tensor,
+    softmax_scale: float,
+    block_indices: torch.LongTensor,
+    block_size: int,
+):
+    batch_size, seq_len, num_kv_heads, feature_dim = k.shape
+    num_q_heads = q.shape[2]
+    num_kv_blocks_per_q_block = block_indices.shape[-1]
+    q_heads_per_kv_head = num_q_heads // num_kv_heads
+    num_blocks_in_seq = seq_len // block_size
+    grad_q = torch.empty_like(q)
+    grad_k = torch.empty_like(k)
+    grad_v = torch.empty_like(v)
+    block_mask = mosaic_block_mask(block_indices)
+    delta = (output * grad_o).sum(dim=-1)
+    grid_dq = lambda META: (
+        triton.cdiv(seq_len, META['q_tile_size']),
+        q_heads_per_kv_head,
+        batch_size * num_kv_heads
+    )
+    mosaic_attn_bwd_q_kernel[grid_dq](
+        q_ptr=q,
+        k_ptr=k,
+        v_ptr=v,
+        lse_ptr=lse,
+        delta_ptr=delta,
+        grad_o_ptr=grad_o,
+        grad_q_ptr=grad_q,
+        block_indices_ptr=block_indices,
+        softmax_scale=softmax_scale,
+        seq_len=seq_len,
+        num_kv_heads=num_kv_heads,
+        num_q_heads=num_q_heads,
+        q_heads_per_kv_head=q_heads_per_kv_head,
+        feature_dim=feature_dim,
+        kv_block_size=block_size,
+        num_kv_blocks_per_q_block=num_kv_blocks_per_q_block,
+    )
+    grid_dkv = (num_blocks_in_seq, batch_size * num_kv_heads)
+    mosaic_attn_bwd_kv_kernel[grid_dkv](
+        q_ptr=q,
+        k_ptr=k,
+        v_ptr=v,
+        lse_ptr=lse,
+        delta_ptr=delta,
+        grad_o_ptr=grad_o,
+        grad_k_ptr=grad_k,
+        grad_v_ptr=grad_v,
+        block_mask_ptr=block_mask,
+        softmax_scale=softmax_scale,
+        seq_len=seq_len,
+        num_kv_heads=num_kv_heads,
+        num_q_heads=num_q_heads,
+        q_heads_per_kv_head=q_heads_per_kv_head,
+        feature_dim=feature_dim,
+        kv_block_size=block_size,
+    )
+    return grad_q, grad_k, grad_v
+class MosaicAttnFunction(torch.autograd.Function):
+    @staticmethod
+    @torch.amp.custom_fwd(device_type='cuda')
+    def forward(
+        ctx: torch.autograd.function.FunctionCtx,
+        q: torch.Tensor,
+        k: torch.Tensor,
+        v: torch.Tensor,
+        block_indices: torch.Tensor,
+        block_size: int,
+        softmax_scale: float
+    ):
+        q, k, v, block_indices = map(lambda x: x.contiguous(), (q, k, v, block_indices))
+        ctx.dtype = q.dtype
+        output, lse = mosaic_attn_fwd(
+            q=q, k=k, v=v,
+            block_indices=block_indices,
+            block_size=block_size,
+            softmax_scale=softmax_scale,
+        )
+        ctx.save_for_backward(q, k, v, output, lse, block_indices)
+        ctx.block_size = block_size
+        ctx.softmax_scale = softmax_scale
+        return output.to(q.dtype)
+    @staticmethod
+    @torch.amp.custom_bwd(device_type='cuda')
+    def backward(
+        ctx: torch.autograd.function.FunctionCtx,
+        grad_o: torch.Tensor
+    ):
+        q, k, v, output, lse, block_indices = ctx.saved_tensors
+        grad_o = grad_o.contiguous()
+        grad_q, grad_k, grad_v = mosaic_attn_bwd(
+            q=q, k=k, v=v, output=output, lse=lse, grad_o=grad_o,
+            softmax_scale=ctx.softmax_scale,
+            block_indices=block_indices,
+            block_size=ctx.block_size,
+        )
+        return grad_q, grad_k, grad_v, None, None, None
+def mosaic_sparse_attn(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    v: torch.Tensor,
+    block_indices: torch.LongTensor,
+    block_size: int,
+    softmax_scale: float = None,
+):
+    softmax_scale = q.shape[-1] ** -0.5 if softmax_scale is None else softmax_scale
+    return MosaicAttnFunction.apply(q, k, v, block_indices, block_size, softmax_scale)

primitives.py ADDED Viewed

	@@ -0,0 +1,359 @@

+"""
+Primitive building blocks for the Mosaic transformer.
+Components:
+- Block-sparse attention with learned strategy weighting (local block, compressed,
+  and top-k selection branches combined with a learned gate)
+- Rotary positional embeddings (RoPE) for 2D lon/lat
+- Cross-attention interpolation between grids
+- HEALPix spatial up/downsampling
+- Conditional SwiGLU FFN with noise injection
+"""
+import torch
+import torch.nn as nn
+from einops import rearrange, reduce, repeat
+from torch.nn import RMSNorm
+try:
+    from flash_attn import flash_attn_func  # FlashAttention v2
+except ImportError:
+    import flash_attn_interface as fa       # FlashAttention v3
+    flash_attn_func = fa.flash_attn_func
+from utils import get_healpix_grid, get_neighbors, rad_to_xyz
+from ops import mosaic_sparse_attn
+def block_attention(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, block_size: int):
+    batch_size = q.shape[0]
+    q, k, v = map(lambda x: rearrange(x, 'b (nb bs) h d -> (b nb) bs h d', bs=block_size), (q, k, v))
+    o_ba = flash_attn_func(q, k, v)
+    return rearrange(o_ba, '(b nb) bs h d -> b (nb bs) h d', b=batch_size)
+@torch.no_grad()
+def attn_topk(q: torch.Tensor, k: torch.Tensor, block_count: int):
+    Hq, Hk = q.shape[2], k.shape[2]
+    G = Hq // Hk
+    k = k.repeat_interleave(G, dim=2)
+    scores = torch.matmul(
+        rearrange(q, 'b t h d -> b h t d'),
+        rearrange(k, 'b t h d -> b h d t')
+    )
+    if Hq != Hk:
+        scores = reduce(scores, 'b (g h) t k -> b h t k', 'mean', g=G)
+    scores = rearrange(scores, 'b h t k -> b t h k')
+    top_indices = scores.topk(k=block_count, dim=-1, largest=True)[1]
+    return top_indices
+def mosaic_attn_func(
+    q, k, v,
+    weight_ba_cmp_slc,
+    block_attn_size, sparse_block_size, sparse_block_count,
+    block_attn_only, no_compression=False,
+):
+    o_ba = block_attention(q, k, v, block_attn_size)
+    if block_attn_only:
+        return o_ba
+    q_cmp = reduce(q, 'b (nb bs) h d -> b nb h d', 'mean', bs=sparse_block_size)
+    k_cmp = reduce(k, 'b (nb bs) h d -> b nb h d', 'mean', bs=sparse_block_size)
+    if no_compression:
+        block_indices = attn_topk(q_cmp, k_cmp, sparse_block_count)
+        o_slc = mosaic_sparse_attn(q, k, v, block_indices, sparse_block_size)
+        w_ba = weight_ba_cmp_slc[0]
+        w_slc = weight_ba_cmp_slc[2]
+        w_sum = w_ba + w_slc + 1e-8
+        return o_ba * (w_ba / w_sum) + o_slc * (w_slc / w_sum)
+    v_cmp = reduce(v, 'b (nb bs) h d -> b nb h d', 'mean', bs=sparse_block_size)
+    o_cmp = flash_attn_func(q_cmp, k_cmp, v_cmp)
+    o_cmp = o_cmp.repeat_interleave(sparse_block_size, dim=1)
+    if sparse_block_count == 0:
+        w_ba = weight_ba_cmp_slc[0]
+        w_cmp = weight_ba_cmp_slc[1]
+        w_sum = w_ba + w_cmp + 1e-8
+        return o_ba * (w_ba / w_sum) + o_cmp * (w_cmp / w_sum)
+    block_indices = attn_topk(q_cmp, k_cmp, sparse_block_count)
+    o_slc = mosaic_sparse_attn(q, k, v, block_indices, sparse_block_size)
+    return o_ba * weight_ba_cmp_slc[0] + o_cmp * weight_ba_cmp_slc[1] + o_slc * weight_ba_cmp_slc[2]
+class cSwiGLU(nn.Module):
+    def __init__(self, dim: int, hidden_dim: int, noise_dim: int):
+        super().__init__()
+        self.w13 = nn.Linear(dim, 2 * hidden_dim, bias=False)
+        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
+        self.act_fn = nn.SiLU()
+        if noise_dim > 0:
+            self.noise_bias = nn.Linear(noise_dim, hidden_dim, bias=False)
+    def forward(self, x: torch.Tensor, z: torch.Tensor = None):
+        noise = self.noise_bias(z).unsqueeze(0) if z is not None else 0
+        x1, x3 = self.w13(x).chunk(2, dim=-1)
+        return self.w2(self.act_fn(x1 + noise) * x3)
+class RoPE(nn.Module):
+    def __init__(self, dim, theta=10000):
+        super().__init__()
+        assert dim % 2 == 0
+        self.dim = dim
+        self.theta = theta
+    def initialize_rope(self, positions):
+        base_freqs = 1. / (self.theta ** (torch.arange(0, self.dim // 2, 2).float() / (self.dim // 2)))
+        lon_pos = torch.deg2rad(positions[:, 0:1])
+        lat_pos = torch.deg2rad(positions[:, 1:2])
+        lon_freqs = torch.matmul(lon_pos, base_freqs.unsqueeze(0))
+        lat_freqs = torch.matmul(lat_pos, base_freqs.unsqueeze(0))
+        freqs = torch.cat([lon_freqs, lat_freqs], dim=-1)
+        self.register_buffer('cos_freqs', freqs.cos().contiguous(), persistent=True)
+        self.register_buffer('sin_freqs', freqs.sin().contiguous(), persistent=True)
+    @staticmethod
+    def rotate_half(x):
+        x = rearrange(x, '... (d r) -> ... d r', r=2)
+        x1, x2 = x.unbind(dim=-1)
+        x = torch.stack((-x2, x1), dim=-1)
+        return rearrange(x, '... d r -> ... (d r)')
+    def forward(self, x):
+        cos = self.cos_freqs.unsqueeze(0).unsqueeze(2).repeat_interleave(2, dim=-1)
+        sin = self.sin_freqs.unsqueeze(0).unsqueeze(2).repeat_interleave(2, dim=-1)
+        return (x.float() * cos + self.rotate_half(x.float()) * sin).to(x.dtype)
+class MosaicAttention(nn.Module):
+    def __init__(self, config, block_attn_only: bool, no_compression: bool = False):
+        super().__init__()
+        self.block_attn_only = block_attn_only
+        self.no_compression = no_compression
+        self.block_attn_size = config.block_attn_size
+        self.sparse_block_size = config.sparse_block_size
+        self.sparse_block_count = config.sparse_block_count
+        q_heads = config.num_heads
+        gqa_ratio = config.gqa_ratio
+        dim = config.dim
+        qkv_compress_ratio = config.qkv_compress_ratio
+        rope = config.rope
+        rope_theta = config.rope_theta
+        kv_heads = q_heads // gqa_ratio
+        head_dim = int(dim // q_heads // qkv_compress_ratio)
+        self.q_heads = q_heads
+        self.kv_heads = kv_heads
+        self.to_q = nn.Linear(dim, q_heads * head_dim, bias=False)
+        self.to_k = nn.Linear(dim, kv_heads * head_dim, bias=False)
+        self.to_v = nn.Linear(dim, kv_heads * head_dim, bias=False)
+        self.to_o = nn.Linear(q_heads * head_dim, dim, bias=False)
+        self.q_rope = RoPE(head_dim, rope_theta) if rope else None
+        self.k_rope = RoPE(head_dim, rope_theta) if rope else None
+        if block_attn_only:
+            self.to_strategy_combine_mlp = None
+        else:
+            self.to_strategy_combine_mlp = nn.Linear(dim, 3 * q_heads, bias=False)
+    def generate_strategy_weights(self, x):
+        if self.block_attn_only:
+            return [None, None, None]
+        strategy_logits = self.to_strategy_combine_mlp(x)
+        strategy_logits = rearrange(strategy_logits, 't b (h s) -> s b t h', h=self.q_heads)
+        strategy_weights = torch.softmax(strategy_logits.float(), dim=0).type_as(x)
+        strategy_weights = strategy_weights.unsqueeze(-1)
+        return strategy_weights
+    def forward(self, x):
+        q = self.to_q(x)
+        k = self.to_k(x)
+        v = self.to_v(x)
+        strategy_weights = self.generate_strategy_weights(x)
+        q = rearrange(q, 's b (h d) -> b s h d', h=self.q_heads)
+        k = rearrange(k, 's b (h d) -> b s h d', h=self.kv_heads)
+        v = rearrange(v, 's b (h d) -> b s h d', h=self.kv_heads)
+        if self.q_rope is not None:
+            q = self.q_rope(q)
+            k = self.k_rope(k)
+        output = mosaic_attn_func(
+            q=q, k=k, v=v,
+            weight_ba_cmp_slc=strategy_weights,
+            block_attn_size=self.block_attn_size,
+            sparse_block_size=self.sparse_block_size,
+            sparse_block_count=self.sparse_block_count,
+            block_attn_only=self.block_attn_only,
+            no_compression=self.no_compression,
+        )
+        output = rearrange(output, 'b s h d -> s b (h d)')
+        output = self.to_o(output)
+        return output
+class MosaicBlock(nn.Module):
+    def __init__(self, config, block_attn_only: bool, no_compression: bool = False):
+        super().__init__()
+        dim = config.dim
+        noise_dim = config.noise_dim
+        mlp_ratio = config.mlp_ratio
+        self.attention = MosaicAttention(config, block_attn_only, no_compression)
+        self.norm1 = RMSNorm(dim, elementwise_affine=config.rmsnorm_elementwise_affine)
+        self.norm2 = RMSNorm(dim, elementwise_affine=config.rmsnorm_elementwise_affine)
+        self.ffn = cSwiGLU(dim, int(dim * mlp_ratio), noise_dim)
+    def forward(self, x: torch.Tensor, z: torch.Tensor = None):
+        x = x + self.attention(self.norm1(x))
+        x = x + self.ffn(self.norm2(x), z)
+        return x
+class CrossAttentionInterpolate(nn.Module):
+    space_dim = 3
+    def __init__(self, config):
+        super().__init__()
+        self.k_neighbors = config.k_neighbors
+        dim = config.dim
+        num_heads = config.num_heads
+        head_dim = dim // num_heads
+        self.num_heads = num_heads
+        self.head_dim = head_dim
+        self.scale = head_dim ** -0.5
+        self.kv_norm = RMSNorm(dim, elementwise_affine=config.rmsnorm_elementwise_affine)
+        self.to_q = nn.Linear(self.space_dim, dim, bias=False)
+        self.to_kv = nn.Linear(dim, 2 * dim, bias=False)
+        self.to_o = nn.Linear(dim, dim, bias=False)
+    @torch.no_grad()
+    def initialize_interpolation_scheme(self, pos_from_rad, pos_to_rad):
+        neighbors_np = get_neighbors(pos_from_rad.cpu().numpy(), pos_to_rad.cpu().numpy(), k=self.k_neighbors)
+        neighbors = torch.from_numpy(neighbors_np).long().to(pos_from_rad.device).contiguous()
+        pos_to_xyz = rad_to_xyz(pos_to_rad)
+        pos_from_xyz = rad_to_xyz(pos_from_rad)
+        rel_pos_xyz = (pos_to_xyz.unsqueeze(1) - pos_from_xyz[neighbors]).contiguous()
+        norm_rel_pos_xyz = torch.nn.functional.normalize(rel_pos_xyz, dim=-1).contiguous()
+        self.register_buffer('neighbors', neighbors, persistent=True)
+        self.register_buffer('rel_pos', norm_rel_pos_xyz, persistent=True)
+    def forward(self, x_from: torch.Tensor):
+        if self.neighbors is None or self.rel_pos is None:
+            raise ValueError("Interpolation scheme not initialized.")
+        q = self.to_q(self.rel_pos)
+        q = rearrange(q, 's k (h d) -> s k 1 h d', h=self.num_heads)
+        x = self.kv_norm(x_from)
+        kv = self.to_kv(x)
+        kv = rearrange(kv, 's b (n h d) -> n s b h d', h=self.num_heads, n=2)
+        k, v = kv[:, self.neighbors]
+        attn_scores = (q * k).sum(dim=-1, keepdim=True) * self.scale
+        attn_weights = torch.softmax(attn_scores, dim=1, dtype=torch.float32).type_as(k)
+        out = (attn_weights * v).sum(dim=1)
+        out = rearrange(out, 's b h d -> s b (h d)')
+        out = self.to_o(out)
+        return out
+class NoiseGenerator(nn.Module):
+    def __init__(self, noise_dim: int, seed: int):
+        super().__init__()
+        self.seed = seed
+        self.to_noise = nn.Linear(noise_dim, noise_dim, bias=False)
+        self.generator = None
+    def forward(self, num_samples: int, device: torch.device, dtype: torch.dtype):
+        if self.generator is None:
+            self.generator = torch.Generator(device=device)
+            self.generator.manual_seed(self.seed)
+        noise = torch.randn((num_samples, self.to_noise.in_features),
+                            generator=self.generator, device=device, dtype=dtype)
+        noise = self.to_noise(noise)
+        return noise
+class HEALPixDownsample(nn.Module):
+    space_dim: int = 3
+    def __init__(self, in_dim, out_dim, nside_before, nside_after,
+                 rmsnorm_elementwise_affine=True):
+        super().__init__()
+        self.factor = (nside_before // nside_after) ** 2
+        self.proj_x = nn.Linear(self.factor * in_dim, out_dim, bias=False)
+        self.proj_pos = nn.Linear(self.factor * self.space_dim, out_dim, bias=False)
+        self.norm = RMSNorm(out_dim, elementwise_affine=rmsnorm_elementwise_affine)
+        hp_grid_fine_xyz = rad_to_xyz(torch.deg2rad(get_healpix_grid(nside_before)))
+        hp_grid_coarse_xyz = rad_to_xyz(torch.deg2rad(get_healpix_grid(nside_after)))
+        pos = rearrange(hp_grid_fine_xyz, '(n f) d -> n f d', f=self.factor)
+        rel_pos = rearrange(pos - hp_grid_coarse_xyz[:, None], 'n f d -> n (f d)')
+        rel_pos = (rel_pos - rel_pos.mean(dim=0, keepdim=True)) / (rel_pos.std(dim=0, keepdim=True) + 1e-6)
+        self.register_buffer('rel_pos', rel_pos.contiguous(), persistent=True)
+    def forward(self, x: torch.Tensor):
+        x = rearrange(x, '(n f) b c -> n b (f c)', f=self.factor)
+        x = self.proj_x(x) + self.proj_pos(self.rel_pos).unsqueeze(1)
+        x = self.norm(x)
+        return x
+class HEALPixUpsample(nn.Module):
+    space_dim: int = 3
+    def __init__(self, in_dim, out_dim, nside_before, nside_after,
+                 rmsnorm_elementwise_affine=True):
+        super().__init__()
+        self.factor = (nside_after // nside_before) ** 2
+        self.proj_x = nn.Linear(in_dim, out_dim * self.factor, bias=False)
+        self.proj_pos = nn.Linear(self.factor * self.space_dim, out_dim * self.factor, bias=False)
+        self.norm = RMSNorm(out_dim, elementwise_affine=rmsnorm_elementwise_affine)
+        hp_grid_fine_xyz = rad_to_xyz(torch.deg2rad(get_healpix_grid(nside_after)))
+        hp_grid_coarse_xyz = rad_to_xyz(torch.deg2rad(get_healpix_grid(nside_before)))
+        children_pos_reshaped = rearrange(hp_grid_fine_xyz, '(n f) d -> n f d', f=self.factor)
+        rel_pos = rearrange(children_pos_reshaped - hp_grid_coarse_xyz[:, None], 'n f d -> n (f d)')
+        rel_pos = (rel_pos - rel_pos.mean(dim=0, keepdim=True)) / (rel_pos.std(dim=0, keepdim=True) + 1e-6)
+        self.register_buffer('rel_pos', rel_pos.contiguous(), persistent=True)
+    def forward(self, x: torch.Tensor, shortcut: torch.Tensor):
+        x = self.proj_x(x) + self.proj_pos(self.rel_pos).unsqueeze(1)
+        x = rearrange(x, 'n b (f d) -> (n f) b d', f=self.factor)
+        x = x + shortcut
+        x = self.norm(x)
+        return x

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+torch>=2.0
+einops>=0.6
+healpy>=1.16
+scikit-learn>=1.0
+numpy>=1.24
+zarr>=2.10
+pandas>=1.5
+triton>=2.0
+flash-attn>=2.0
+# For reading WeatherBench2 zarr stores directly from gs://
+gcsfs>=2023.0

utils.py ADDED Viewed

	@@ -0,0 +1,40 @@

+import torch
+import numpy as np
+import healpy as hp
+from sklearn.neighbors import BallTree
+def rad_to_xyz(lonlat: torch.Tensor):
+    """Convert lon-lat (in radians) to unit sphere xyz."""
+    lon = lonlat[..., 0]
+    lat = lonlat[..., 1]
+    x = torch.cos(lat) * torch.cos(lon)
+    y = torch.cos(lat) * torch.sin(lon)
+    z = torch.sin(lat)
+    return torch.stack([x, y, z], axis=-1)
+def get_healpix_grid(nside: int) -> torch.Tensor:
+    """Return HEALPix grid coordinates as array of shape (npix, 2)."""
+    indices = np.arange(hp.nside2npix(nside))
+    theta, phi = hp.pix2ang(nside, indices, nest=True)
+    phi = np.rad2deg(phi)
+    theta = (90. - np.rad2deg(theta))
+    phi = torch.from_numpy(phi)
+    theta = torch.from_numpy(theta)
+    return torch.stack((phi, theta), axis=-1).float()
+def get_neighbors(pos_from: np.ndarray, pos_to: np.ndarray, k: int = 8) -> tuple:
+    """Build a BallTree and query k nearest neighbors with haversine metric."""
+    pos_from_rad = pos_from[:, ::-1]
+    pos_to_rad = pos_to[:, ::-1]
+    tree = BallTree(pos_from_rad, metric='haversine')
+    _, neighbors = tree.query(pos_to_rad, k=k)
+    return neighbors