hombit commited on
Commit
750e3a9
Β·
verified Β·
1 Parent(s): b52920d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - astronomy
5
+ - time-series
6
+ - light-curves
7
+ - onnx
8
+ library_name: onnx
9
+ ---
10
+
11
+ # Astromer 2
12
+
13
+ ## Paper
14
+
15
+ Donoso-Oliva, C., Becker, I., Protopapas, P., Cabrera-Vives, G., CΓ‘diz-Leyton, M., & Moreno-Cartagena, D. (2026). *Generalizing across astronomical surveys: Few-shot light curve classification with Astromer 2*. Astronomy & Astrophysics (in press).
16
+
17
+ ```bibtex
18
+ @article{astromer2,
19
+ author = {Donoso-Oliva, C. and Becker, I. and Protopapas, P. and
20
+ Cabrera-Vives, G. and C{\'a}diz-Leyton, M. and Moreno-Cartagena, D.},
21
+ title = {Generalizing across astronomical surveys: Few-shot light curve
22
+ classification with {Astromer} 2},
23
+ journal = {Astronomy \& Astrophysics},
24
+ year = {2026},
25
+ note = {In press},
26
+ }
27
+ ```
28
+
29
+ ## Original code
30
+
31
+ <https://github.com/astromer-science/main-code> (git submodule at `models/astromer2/code/`)
32
+
33
+ ## License
34
+
35
+ MIT β€” see [LICENSE](LICENSE).
36
+
37
+ ## Model overview
38
+
39
+ Astromer 2 is a BERT-inspired transformer encoder pretrained on 1.5 million MACHO light curves via masked magnitude prediction. The encoder processes irregularly-sampled photometric time series (time, magnitude) using MJD-aware positional encoding and a trainable mask token. It produces per-timestep contextual embeddings that can be aggregated into a fixed-size representation for downstream tasks such as few-shot classification.
40
+
41
+ Default configuration: 6 attention blocks, 4 heads, head dimension 64 (d_model = 256), sequence length 200, embedding dimension 256.
42
+
43
+ ## Input data format
44
+
45
+ The model was pretrained on MACHO survey photometry. MACHO light curves consist of triples `(mjd, mag, err)` where:
46
+ - `mjd` β€” Modified Julian Date of each observation (~48800–51700 for MACHO)
47
+ - `mag` β€” MACHO instrumental magnitude (typically negative values, e.g. βˆ’10 to βˆ’3 in the MACHO system)
48
+ - `err` β€” photometric error; some observations carry large negative sentinel values (e.g. βˆ’3000, βˆ’9000) indicating bad data β€” **these are passed through the pipeline as-is without filtering**
49
+
50
+ ## Preprocessing steps
51
+
52
+ All steps are implemented in `code/src/data/loaders.py` (`get_loader`) and `code/src/data/preprocessing.py`.
53
+
54
+ ### Step 1 β€” Windowing
55
+
56
+ If the light curve has more than 200 observations, take the first 200 (non-random, sequential window). If it has fewer than 200, use all observations and pad in step 3.
57
+
58
+ Source: `src/data/preprocessing.py:to_windows` with `sampling=False`.
59
+
60
+ ### Step 2 β€” Zero-mean normalization
61
+
62
+ Subtract the per-light-curve column mean from **all three columns** (time, magnitude, error):
63
+
64
+ ```
65
+ x_norm = x - mean(x, axis=0) # x has shape [n_obs, 3]
66
+ ```
67
+
68
+ After this step, `times` and `input` (magnitudes) are centred around zero. The error column is also normalised but is discarded before the encoder (see step 4).
69
+
70
+ Source: `src/data/preprocessing.py:standardize`.
71
+
72
+ ### Step 3 β€” Padding and mask construction
73
+
74
+ Right-pad the normalised sequence to exactly 200 time steps with zeros. Construct `mask_in`:
75
+
76
+ ```
77
+ mask_in[i] = 0 for i < n_obs (real observation β€” visible to encoder)
78
+ mask_in[i] = 1 for i >= n_obs (padding β€” hidden from encoder)
79
+ ```
80
+
81
+ > **Note on mask convention:** the internal pipeline uses `mask_in=0` for visible positions and `mask_in=1` for padding/hidden positions. This is the opposite of the ONNX interface (see below).
82
+
83
+ Source: `src/data/masking.py:mask_sample`, padding block at the end.
84
+
85
+ ### Step 4 β€” Format encoder inputs
86
+
87
+ Extract the two encoder inputs from the normalised, padded array:
88
+
89
+ | Tensor | Source | Shape |
90
+ |--------|--------|-------|
91
+ | `input` | normalised magnitude column | `[batch, 200, 1]` |
92
+ | `times` | normalised time column | `[batch, 200, 1]` |
93
+ | `mask_in` | constructed in step 3 | `[batch, 200, 1]` |
94
+
95
+ The normalised error column is **not** fed to the encoder. Errors appear only in the pretraining reconstruction loss.
96
+
97
+ Source: `src/data/loaders.py:format_inp_astromer` (`aversion='base'`).
98
+
99
+ ## Inputs (ONNX)
100
+
101
+ The exported ONNX models use a **user-friendly mask convention** that is the inverse of the internal pipeline:
102
+
103
+ | Tensor | Shape | Description |
104
+ |--------|-------|-------------|
105
+ | `input` | `[batch, 200, 1]` | Zero-mean normalised magnitudes (step 2 above) |
106
+ | `times` | `[batch, 200, 1]` | Zero-mean normalised times (step 2 above) |
107
+ | `mask_in` | `[batch, 200, 1]` | **1 = valid observation, 0 = padding** |
108
+
109
+ The ONNX wrapper inverts `mask_in` internally before passing it to the encoder, so consumers can use the intuitive convention.
110
+
111
+ ## Outputs (ONNX)
112
+
113
+ | File | Output shape | Aggregation |
114
+ |------|-------------|-------------|
115
+ | `astromer2_mean.onnx` | `[batch, 256]` | Masked mean pooling: `sum(z * mask_in) / sum(mask_in)` |
116
+ | `astromer2_max.onnx` | `[batch, 256]` | Masked max pooling over valid timesteps |
117
+ | `astromer2_full.onnx` | `[batch, 200, 256]` | Full per-timestep sequence; consumer aggregates |
118
+
119
+ ONNX opset: 13.
120
+
121
+ ## Weights
122
+
123
+ Source: [Zenodo record 18207945](https://zenodo.org/records/18207945)
124
+ Training dataset: MACHO (1.5 million light curves, V and R bands)
125
+ Checkpoint: `astromer_v2/macho/`