Galaxy-CLIP (Fine-tuned)
Model Description
Galaxy-CLIP is a domain-adapted vision-language model fine-tuned from OpenAI’s CLIP on astronomical galaxy imagery paired with synthetic natural language descriptions.
The model learns a shared embedding space between galaxy images and descriptive text, enabling tasks such as:
- Zero-shot galaxy classification
- Image–text retrieval (e.g. “spiral galaxy with prominent arms”)
- Semantic search over astronomical datasets
- Embedding generation for downstream astronomy models
This model is designed to improve CLIP’s understanding of astronomical morphology, which is poorly represented in generic internet-scale datasets.
Model Details
- Base model: OpenAI CLIP (ViT-based)
- Architecture: Dual encoder (image encoder + text encoder)
- Task: Contrastive image-text learning
- Parameters: ~200M ([Hugging Face][1])
- Framework: Hugging Face Transformers
- Author: Michael Jupp (juppy44)
Training Data
The model was fine-tuned on:
- Galaxy images (astronomy datasets, e.g. Galaxy Zoo-style data)
- Text descriptions: astronolan/galaxy-descriptions
These descriptions are VLM-generated captions describing galaxy morphology and structure, such as:
- “barred spiral galaxy with tightly wound arms”
- “elliptical galaxy with smooth light distribution”
- “irregular galaxy with asymmetrical structure”
This setup enables weakly-supervised multimodal learning without requiring manual annotation.
Training Procedure
The model was fine-tuned using standard CLIP-style contrastive learning:
- Paired (image, text) samples
- Cross-modal similarity objective (InfoNCE loss)
- Batch-wise matching of correct vs incorrect pairs
Typical pipeline:
- Image encoder → visual embeddings
- Text encoder → language embeddings
- Cosine similarity used for alignment
Fine-tuning allows the model to shift from generic vision-language understanding → astronomy-specific semantics, which is critical since base CLIP has little exposure to galaxy morphology.
Intended Uses
Direct Use
- Zero-shot classification of galaxy morphology
- Image–text similarity scoring
- Semantic search over galaxy datasets
- Embedding generation for clustering or retrieval
Downstream Use
- Astronomy-focused RAG systems
- Scientific dataset indexing
- Feature extraction for classification models
- Input embeddings for reasoning systems (LLMs + astronomy heads)
Out-of-Scope Use
- High-precision scientific measurement (e.g. photometry, redshift estimation)
- Astrophysical inference requiring calibrated data
- Medical or non-astronomy domains
This is a representation model, not a physics model.
Limitations
- Trained on synthetic descriptions, not human-verified annotations
- May inherit biases or inaccuracies from caption generation
- Performance depends heavily on dataset quality and diversity
- Not robust to non-astronomical imagery
Example Usage
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
model = CLIPModel.from_pretrained("juppy44/galaxy-clip-finetuned")
processor = CLIPProcessor.from_pretrained("juppy44/galaxy-clip-finetuned")
image = Image.open("galaxy.jpg")
texts = [
"spiral galaxy with arms",
"elliptical galaxy",
"irregular galaxy"
]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
print(probs)
Evaluation
No formal benchmark evaluation has been conducted yet.
However, expected improvements over base CLIP:
- Better alignment with astronomy-specific language
- Improved retrieval for morphology-based queries
- More meaningful embedding clusters for galaxy types
Bias & Risks
- Synthetic captions may introduce hallucinated or oversimplified features
- Model may overfit to text patterns rather than physical structure
- Not suitable for scientific conclusions without validation
Future Work
- Replace synthetic captions with expert-labelled datasets
- Add spectral + tabular modalities
- Train multi-head architectures for morphology + physics
- Integrate with astronomy reasoning systems (LLMs + structured inputs)
Citation
If you use this model:
@misc{jupp2026galaxyclip,
title={Galaxy-CLIP: Domain-adapted vision-language model for galaxy morphology},
author={Jupp, Michael},
year={2026},
howpublished={\url{https://huggingface.co/juppy44/galaxy-clip-finetuned}}
}
- Downloads last month
- 21