astrom3 / README.md

Upload README.md with huggingface_hub

bea61da verified 6 days ago

5.91 kB

	---
	tags:
	- astronomy
	- time-series
	- light-curves
	- variable-stars
	- onnx
	library_name: onnx
	license: cc-by-4.0
	---

	# AstroM3 (photo encoder)

	HuggingFace: [light-curve/astrom3](https://huggingface.co/light-curve/astrom3)

	## Paper

	Rizhko, M. et al. (2024). AstroM³: A self-supervised multimodal model for astronomy. arXiv:2411.08842.

	```bibtex
	@article{rizhko2024astrom3,
	author = {Rizhko, Mariia and Bloom, Joshua S.},
	title = {{AstroM³}: A self-supervised multimodal model for astronomy},
	journal = {arXiv preprint arXiv:2411.08842},
	year = {2024}
	}
	```

	## Original code

	<https://github.com/MeriDK/AstroM3> (git submodule at `models/astrom3/code/`)

	## License

	- Code (this repository): MIT — see [LICENSE](LICENSE).
	- Model weights (`AstroMLCore/AstroM3-CLIP-photo`): Creative Commons Attribution 4.0 (CC BY 4.0).

	## Model overview

	AstroM3 is a self-supervised multimodal contrastive model for variable-star classification that jointly trains photometry (light-curve), spectra, and metadata encoders using a CLIP-style objective. This integration exports the photo-only encoder from the pretrained CLIP checkpoint (`AstroMLCore/AstroM3-CLIP-photo`) as an ONNX embedding model.

	The photo encoder is an [Informer](https://ojs.aaai.org/index.php/AAAI/article/view/17325/17132) transformer (ProbSparse attention, 8 layers, d_model=128) trained on ZTF variable-star light curves from the MACC dataset. For ONNX export, the ProbSparse attention layers are replaced with standard scaled dot-product attention, which is equivalent in expectation and fully ONNX-exportable.

	## Inputs

	\| Tensor \| Shape \| Description \|
	\|--------\|-------\|-------------\|
	\| `x_enc` \| `[batch, 200, 9]` \| Padded photometry features (9 channels per timestep — see preprocessing) \|
	\| `mask` \| `[batch, 200]` \| `1` for valid timesteps, `0` for padding \|

	## Outputs (ONNX)

	Single file `astrom3.onnx` with two named outputs:

	\| Output \| Shape \| Aggregation \|
	\|--------\|-------\|-------------\|
	\| `mean` \| `[batch, 128]` \| Masked mean pool of encoder outputs \|
	\| `sequence` \| `[batch, 200, 128]` \| Full per-timestep encoder outputs (unmasked) \|

	## Preprocessing steps

	The 9 input channels per timestep are built by `preprocess_lc()` in the
	upstream dataset (`AstroMLCore/AstroM3Dataset`):

	\| Index \| Feature \| How obtained \|
	\|-------\|---------\|--------------\|
	\| 0 \| `time` (HJD scaled to [0, 1]) \| per-observation \|
	\| 1 \| `flux` = `(flux − mean) / MAD` \| per-observation \|
	\| 2 \| `flux_err` = `flux_err / MAD` \| per-observation \|
	\| 3 \| `amplitude` \| ASAS-SN catalog scalar, replicated to every timestep \|
	\| 4 \| `period` \| ASAS-SN catalog scalar, replicated \|
	\| 5 \| `lksl_statistic` (Lafler-Kinman string length) \| ASAS-SN catalog scalar, replicated \|
	\| 6 \| `rfr_score` (Random Forest Regressor R² for phase-folded LC) \| ASAS-SN catalog scalar, replicated \|
	\| 7 \| `log10(MAD_flux)` \| global scalar computed from LC, replicated \|
	\| 8 \| `delta_t` = `(max_HJD − min_HJD) / 365` \| global scalar computed from LC, replicated \|

	Features 3–6 come directly from the ASAS-SN v-band variable-star catalog
	(Jayasinghe et al. 2019) and are not recomputed from the light curve by
	this codebase. Users applying this model to non-ASAS-SN data must provide
	equivalent values (e.g. run a Lomb-Scargle period finder and compute
	peak-to-peak amplitude themselves).

	Preprocessing recipe for a single light curve:

	1. Deduplicate and sort observations by HJD.
	2. Compute `mean` and `MAD` of the flux column; normalize flux and flux_err.
	3. Scale HJD to [0, 1] over the span of the light curve.
	4. Compute `log10(MAD_flux)` and `delta_t = (max_HJD − min_HJD) / 365`.
	5. Obtain `amplitude`, `period`, `lksl_statistic`, `rfr_score` from the
	ASAS-SN catalog (or compute equivalents).
	6. Tile the 6 global scalars across all timesteps; concatenate with columns
	0–2 to produce an `(N, 9)` array.
	7. Pad or center-crop to 200 timesteps; set `mask = 0` for padded positions.
	8. Use `float32` for all tensors.

	## Weights

	Source: <https://huggingface.co/AstroMLCore/AstroM3-CLIP-photo>

	The `model.safetensors` file is a standalone Informer checkpoint (classification head present but unused; loaded with `strict=False`).

	Dataset: ASAS-SN v-band variable-star light curves (`AstroMLCore/AstroM3Processed`).

	## Applying the model without ASAS-SN catalog features

	Features 3–6 require the ASAS-SN catalog. For users applying the model to
	other surveys, we measured the sensitivity of the mean embedding to each
	feature being replaced. `rfr_score` was studied in detail.

	### rfr_score substitution

	`rfr_score` is the R² of a Random Forest Regressor fit to the phase-folded
	light curve; it quantifies period quality
	(Jayasinghe et al. 2019, MNRAS 486 1907, §5; arXiv:1809.07329).
	In the ASAS-SN test set it ranges from −3.5 to 1.18 (median ≈ 0.38).

	Setting all timesteps to the constant 0.392 (the empirical optimum,
	equal to the dataset median) minimises mean cosine distance from the
	true-feature embeddings:

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Overall mean cosine distance \| 0.049 ± 0.091 \|
	\| Macro-average per class \| 0.049 ± 0.058 \|

	Per-class breakdown (5 samples per class from the ASAS-SN test split):

	\| Class \| Mean dist \| Std \| True rfr mean \|
	\|-------\|-----------\|-----\|---------------\|
	\| EW \| 0.005 \| 0.005 \| −0.07 \|
	\| SR \| 0.004 \| 0.003 \| +0.50 \|
	\| EA \| 0.060 \| 0.032 \| +0.95 \|
	\| RRAB \| 0.020 \| 0.011 \| +0.83 \|
	\| EB \| 0.016 \| 0.011 \| +0.90 \|
	\| ROT \| 0.002 \| 0.002 \| +0.85 \|
	\| RRC \| 0.147 \| 0.115 \| −0.79 \|
	\| HADS \| 0.016 \| 0.011 \| +0.59 \|
	\| M \| 0.050 \| 0.020 \| +0.18 \|
	\| DSCT \| 0.170 \| 0.182 \| −0.86 \|

	Classes whose true rfr mean is far from 0.39 (RRC, DSCT) are most affected.
	Using an out-of-range value (e.g. ±100) causes cosine distances ~0.93–0.97,
	so staying within the training distribution is important.