File size: 3,156 Bytes

---
license: cc-by-4.0
tags:
  - mars
  - remote-sensing
  - vision-transformer
  - foundation-model
  - model-merging
  - planetary-science
---

# MOMO: Mars Orbital Model

**MOMO** is the first multi-sensor foundation model for Mars remote sensing, accepted at **CVPR 2026**.

It integrates representations learned independently from three Martian orbital sensors (HiRISE, CTX, and THEMIS) spanning resolutions from 0.25 m/pixel to 100 m/pixel, using task arithmetic model merging with a novel **Equal Validation Loss (EVL)** checkpoint selection strategy.

[![arXiv](https://img.shields.io/badge/arXiv-2604.02719-b31b1b.svg?logo=arxiv&logoColor=white)](https://arxiv.org/abs/2604.02719) [![GitHub](https://img.shields.io/badge/GitHub-kerner--lab%2FMOMO-black?logo=github&logoColor=white)](https://github.com/kerner-lab/MOMO)

---

## Checkpoints

Each model size includes 5 checkpoints:

| File | Description |
|------|-------------|
| `ctx.pth` | Pre-trained on CTX (ConTeXt Camera) |
| `hirise.pth` | Pre-trained on HiRISE (High Resolution Imaging Science Experiment) |
| `themis.pth` | Pre-trained on THEMIS (THermal EMission Imaging System) |
| `hirise_ctx_themis.pth` | Pre-trained jointly on all three sensors |
| `momo.pth` | **MOMO** merged model via task arithmetic + EVL (main contribution) |

Each checkpoint is available for three ViT architectures (all with patch size 16):

| Folder | Architecture |
|--------|-------------|
| `vit-s-16/` | ViT-small |
| `vit-b-16/` | ViT-base |
| `vit-l-16/` | ViT-large |

ViT-base is the primary model reported in the main paper. ViT-small and ViT-large results are reported in the supplementary material.

---

## Usage

```python
import torch
from huggingface_hub import hf_hub_download

# Download MOMO ViT-Base checkpoint
path = hf_hub_download(repo_id="Mirali33/MOMO", filename="vit-b-16/momo.pth")
checkpoint = torch.load(path, map_location="cpu", weights_only=False)
```

For full training and fine-tuning code, see the [MOMO GitHub repository](https://github.com/kerner-lab/MOMO).

---

## Training Data

MOMO is pre-trained on approximately 12 million samples (4M per sensor) from Mars orbital imagery:
- **HiRISE**: 0.25 m/pixel high-resolution visible spectrum images
- **CTX**: 5 m/pixel context camera images
- **THEMIS**: 100 m/pixel thermal infrared images

---

## Evaluation

MOMO is evaluated on 9 downstream tasks from [Mars-Bench](https://mars-bench.github.io/) (4 classification, 5 segmentation), outperforming ImageNet pre-training, earth observation foundation models (SatMAE, CROMA, Prithvi, TerraFM), sensor-specific pre-training, and fully-supervised baselines.

---

## Citation

```bibtex
@inproceedings{purohit2026momo,
    title={MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications},
    author={Mirali Purohit and Bimal Gajera and Irish Mehta and Bhanu Tokas and Jacob Adler and Steven Lu and Scott Dickenshied and Serina Diniega and Brian Bue and Umaa Rebbapragada and Hannah Kerner},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2026},
    url={https://arxiv.org/abs/2604.02719}
}
```