Galaxy-CLIP (Fine-tuned)

Model Description

Galaxy-CLIP is a domain-adapted vision-language model fine-tuned from OpenAI’s CLIP on astronomical galaxy imagery paired with synthetic natural language descriptions.

The model learns a shared embedding space between galaxy images and descriptive text, enabling tasks such as:

  • Zero-shot galaxy classification
  • Image–text retrieval (e.g. “spiral galaxy with prominent arms”)
  • Semantic search over astronomical datasets
  • Embedding generation for downstream astronomy models

This model is designed to improve CLIP’s understanding of astronomical morphology, which is poorly represented in generic internet-scale datasets.


Model Details

  • Base model: OpenAI CLIP (ViT-based)
  • Architecture: Dual encoder (image encoder + text encoder)
  • Task: Contrastive image-text learning
  • Parameters: ~200M ([Hugging Face][1])
  • Framework: Hugging Face Transformers
  • Author: Michael Jupp (juppy44)

Training Data

The model was fine-tuned on:

  • Galaxy images (astronomy datasets, e.g. Galaxy Zoo-style data)
  • Text descriptions: astronolan/galaxy-descriptions

These descriptions are VLM-generated captions describing galaxy morphology and structure, such as:

  • “barred spiral galaxy with tightly wound arms”
  • “elliptical galaxy with smooth light distribution”
  • “irregular galaxy with asymmetrical structure”

This setup enables weakly-supervised multimodal learning without requiring manual annotation.


Training Procedure

The model was fine-tuned using standard CLIP-style contrastive learning:

  • Paired (image, text) samples
  • Cross-modal similarity objective (InfoNCE loss)
  • Batch-wise matching of correct vs incorrect pairs

Typical pipeline:

  • Image encoder → visual embeddings
  • Text encoder → language embeddings
  • Cosine similarity used for alignment

Fine-tuning allows the model to shift from generic vision-language understanding → astronomy-specific semantics, which is critical since base CLIP has little exposure to galaxy morphology.


Intended Uses

Direct Use

  • Zero-shot classification of galaxy morphology
  • Image–text similarity scoring
  • Semantic search over galaxy datasets
  • Embedding generation for clustering or retrieval

Downstream Use

  • Astronomy-focused RAG systems
  • Scientific dataset indexing
  • Feature extraction for classification models
  • Input embeddings for reasoning systems (LLMs + astronomy heads)

Out-of-Scope Use

  • High-precision scientific measurement (e.g. photometry, redshift estimation)
  • Astrophysical inference requiring calibrated data
  • Medical or non-astronomy domains

This is a representation model, not a physics model.


Limitations

  • Trained on synthetic descriptions, not human-verified annotations
  • May inherit biases or inaccuracies from caption generation
  • Performance depends heavily on dataset quality and diversity
  • Not robust to non-astronomical imagery

Example Usage

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

model = CLIPModel.from_pretrained("juppy44/galaxy-clip-finetuned")
processor = CLIPProcessor.from_pretrained("juppy44/galaxy-clip-finetuned")

image = Image.open("galaxy.jpg")

texts = [
    "spiral galaxy with arms",
    "elliptical galaxy",
    "irregular galaxy"
]

inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

print(probs)

Evaluation

No formal benchmark evaluation has been conducted yet.

However, expected improvements over base CLIP:

  • Better alignment with astronomy-specific language
  • Improved retrieval for morphology-based queries
  • More meaningful embedding clusters for galaxy types

Bias & Risks

  • Synthetic captions may introduce hallucinated or oversimplified features
  • Model may overfit to text patterns rather than physical structure
  • Not suitable for scientific conclusions without validation

Future Work

  • Replace synthetic captions with expert-labelled datasets
  • Add spectral + tabular modalities
  • Train multi-head architectures for morphology + physics
  • Integrate with astronomy reasoning systems (LLMs + structured inputs)

Citation

If you use this model:

@misc{jupp2026galaxyclip,
  title={Galaxy-CLIP: Domain-adapted vision-language model for galaxy morphology},
  author={Jupp, Michael},
  year={2026},
  howpublished={\url{https://huggingface.co/juppy44/galaxy-clip-finetuned}}
}

Downloads last month
21
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support