hombit commited on
Commit
9d26076
Β·
verified Β·
1 Parent(s): 8b4a8d2

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +19 -11
README.md CHANGED
@@ -42,10 +42,11 @@ Default configuration: 6 attention blocks, 4 heads, head dimension 64 (d_model =
42
 
43
  ## Input data format
44
 
45
- The model was pretrained on MACHO survey photometry. MACHO light curves consist of triples `(mjd, mag, err)` where:
46
- - `mjd` β€” Modified Julian Date of each observation (~48800–51700 for MACHO)
47
- - `mag` β€” MACHO instrumental magnitude (typically negative values, e.g. βˆ’10 to βˆ’3 in the MACHO system)
48
- - `err` β€” photometric error; some observations carry large negative sentinel values (e.g. βˆ’3000, βˆ’9000) indicating bad data β€” **these are passed through the pipeline as-is without filtering**
 
49
 
50
  ## Preprocessing steps
51
 
@@ -53,19 +54,24 @@ All steps are implemented in `code/src/data/loaders.py` (`get_loader`) and `code
53
 
54
  ### Step 1 β€” Windowing
55
 
56
- If the light curve has more than 200 observations, take the first 200 (non-random, sequential window). If it has fewer than 200, use all observations and pad in step 3.
 
 
 
57
 
58
- Source: `src/data/preprocessing.py:to_windows` with `sampling=False`.
 
 
59
 
60
  ### Step 2 β€” Zero-mean normalization
61
 
62
- Subtract the per-light-curve column mean from **all three columns** (time, magnitude, error):
63
 
64
  ```
65
- x_norm = x - mean(x, axis=0) # x has shape [n_obs, 3]
66
  ```
67
 
68
- After this step, `times` and `input` (magnitudes) are centred around zero. The error column is also normalised but is discarded before the encoder (see step 4).
69
 
70
  Source: `src/data/preprocessing.py:standardize`.
71
 
@@ -102,8 +108,8 @@ The exported ONNX models use a **user-friendly mask convention** that is the inv
102
 
103
  | Tensor | Shape | Description |
104
  |--------|-------|-------------|
105
- | `input` | `[batch, 200, 1]` | Zero-mean normalised magnitudes (step 2 above) |
106
- | `times` | `[batch, 200, 1]` | Zero-mean normalised times (step 2 above) |
107
  | `mask_in` | `[batch, 200, 1]` | **1 = valid observation, 0 = padding** |
108
 
109
  The ONNX wrapper inverts `mask_in` internally before passing it to the encoder, so consumers can use the intuitive convention.
@@ -123,3 +129,5 @@ ONNX opset: 13.
123
  Source: [Zenodo record 18207945](https://zenodo.org/records/18207945)
124
  Training dataset: MACHO (1.5 million light curves, V and R bands)
125
  Checkpoint: `astromer_v2/macho/`
 
 
 
42
 
43
  ## Input data format
44
 
45
+ Raw light curves are pairs `(time, mag)`:
46
+ - `time` β€” observation time in days. Need not be absolute MJD; any consistent time axis in days works because the pipeline subtracts the per-window mean before the encoder sees it. The pretrained weights were produced from MACHO data with MJD ~48800–51700.
47
+ - `mag` β€” magnitude. MACHO instrumental magnitudes are typically negative (e.g. βˆ’10 to βˆ’3); the pipeline is not restricted to that range.
48
+
49
+ Photometric errors are **not used** at inference. The upstream preprocessing code expects a 3-column `[time, mag, err]` array internally, but errors only appear in the pretraining reconstruction-loss weights (`outputs['w_error']`), which are never passed to the encoder. Pass dummy zeros if you run the pipeline directly.
50
 
51
  ## Preprocessing steps
52
 
 
54
 
55
  ### Step 1 β€” Windowing
56
 
57
+ The upstream code supports two windowing strategies via the `sampling` flag of `to_windows`:
58
+
59
+ - **`sampling=True` β€” random window** (used during pretraining): a single contiguous window of 200 observations is drawn at a uniformly random starting position. Light curves shorter than 200 observations are used in full.
60
+ - **`sampling=False` β€” sequential windows** (used for test-data generation): the light curve is divided into sequential, non-overlapping windows of 200 observations. A light curve of length *L* yields ⌊*L*/200βŒ‹ + 1 windows; the last window may be shorter than 200 and is padded in step 3. Light curves shorter than 200 observations produce a single window. When a light curve produces multiple windows, each window yields a separate embedding vector; to obtain a single per-light-curve embedding, average the per-window embeddings.
61
 
62
+ Test-data is generated with `sampling=False`.
63
+
64
+ Source: `src/data/preprocessing.py:to_windows`.
65
 
66
  ### Step 2 β€” Zero-mean normalization
67
 
68
+ Subtract the per-window column mean from each column:
69
 
70
  ```
71
+ x_norm = x - mean(x, axis=0) # x has shape [n_obs, 3]; columns: time, mag, err
72
  ```
73
 
74
+ After this step `times` = time βˆ’ mean(time) and `input` = mag βˆ’ mean(mag) are centred around zero.
75
 
76
  Source: `src/data/preprocessing.py:standardize`.
77
 
 
108
 
109
  | Tensor | Shape | Description |
110
  |--------|-------|-------------|
111
+ | `input` | `[batch, 200, 1]` | `mag βˆ’ mean(mag)` over the window (step 2 above) |
112
+ | `times` | `[batch, 200, 1]` | `time βˆ’ mean(time)` over the window (step 2 above) |
113
  | `mask_in` | `[batch, 200, 1]` | **1 = valid observation, 0 = padding** |
114
 
115
  The ONNX wrapper inverts `mask_in` internally before passing it to the encoder, so consumers can use the intuitive convention.
 
129
  Source: [Zenodo record 18207945](https://zenodo.org/records/18207945)
130
  Training dataset: MACHO (1.5 million light curves, V and R bands)
131
  Checkpoint: `astromer_v2/macho/`
132
+
133
+ The test-data parquet file was generated with these MACHO weights and `sampling=False` (sequential windows).