Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -42,10 +42,11 @@ Default configuration: 6 attention blocks, 4 heads, head dimension 64 (d_model =
|
|
| 42 |
|
| 43 |
## Input data format
|
| 44 |
|
| 45 |
-
|
| 46 |
-
- `
|
| 47 |
-
- `mag` β MACHO instrumental
|
| 48 |
-
|
|
|
|
| 49 |
|
| 50 |
## Preprocessing steps
|
| 51 |
|
|
@@ -53,19 +54,24 @@ All steps are implemented in `code/src/data/loaders.py` (`get_loader`) and `code
|
|
| 53 |
|
| 54 |
### Step 1 β Windowing
|
| 55 |
|
| 56 |
-
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
-
|
|
|
|
|
|
|
| 59 |
|
| 60 |
### Step 2 β Zero-mean normalization
|
| 61 |
|
| 62 |
-
Subtract the per-
|
| 63 |
|
| 64 |
```
|
| 65 |
-
x_norm = x - mean(x, axis=0) # x has shape [n_obs, 3]
|
| 66 |
```
|
| 67 |
|
| 68 |
-
After this step
|
| 69 |
|
| 70 |
Source: `src/data/preprocessing.py:standardize`.
|
| 71 |
|
|
@@ -102,8 +108,8 @@ The exported ONNX models use a **user-friendly mask convention** that is the inv
|
|
| 102 |
|
| 103 |
| Tensor | Shape | Description |
|
| 104 |
|--------|-------|-------------|
|
| 105 |
-
| `input` | `[batch, 200, 1]` |
|
| 106 |
-
| `times` | `[batch, 200, 1]` |
|
| 107 |
| `mask_in` | `[batch, 200, 1]` | **1 = valid observation, 0 = padding** |
|
| 108 |
|
| 109 |
The ONNX wrapper inverts `mask_in` internally before passing it to the encoder, so consumers can use the intuitive convention.
|
|
@@ -123,3 +129,5 @@ ONNX opset: 13.
|
|
| 123 |
Source: [Zenodo record 18207945](https://zenodo.org/records/18207945)
|
| 124 |
Training dataset: MACHO (1.5 million light curves, V and R bands)
|
| 125 |
Checkpoint: `astromer_v2/macho/`
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
## Input data format
|
| 44 |
|
| 45 |
+
Raw light curves are pairs `(time, mag)`:
|
| 46 |
+
- `time` β observation time in days. Need not be absolute MJD; any consistent time axis in days works because the pipeline subtracts the per-window mean before the encoder sees it. The pretrained weights were produced from MACHO data with MJD ~48800β51700.
|
| 47 |
+
- `mag` β magnitude. MACHO instrumental magnitudes are typically negative (e.g. β10 to β3); the pipeline is not restricted to that range.
|
| 48 |
+
|
| 49 |
+
Photometric errors are **not used** at inference. The upstream preprocessing code expects a 3-column `[time, mag, err]` array internally, but errors only appear in the pretraining reconstruction-loss weights (`outputs['w_error']`), which are never passed to the encoder. Pass dummy zeros if you run the pipeline directly.
|
| 50 |
|
| 51 |
## Preprocessing steps
|
| 52 |
|
|
|
|
| 54 |
|
| 55 |
### Step 1 β Windowing
|
| 56 |
|
| 57 |
+
The upstream code supports two windowing strategies via the `sampling` flag of `to_windows`:
|
| 58 |
+
|
| 59 |
+
- **`sampling=True` β random window** (used during pretraining): a single contiguous window of 200 observations is drawn at a uniformly random starting position. Light curves shorter than 200 observations are used in full.
|
| 60 |
+
- **`sampling=False` β sequential windows** (used for test-data generation): the light curve is divided into sequential, non-overlapping windows of 200 observations. A light curve of length *L* yields β*L*/200β + 1 windows; the last window may be shorter than 200 and is padded in step 3. Light curves shorter than 200 observations produce a single window. When a light curve produces multiple windows, each window yields a separate embedding vector; to obtain a single per-light-curve embedding, average the per-window embeddings.
|
| 61 |
|
| 62 |
+
Test-data is generated with `sampling=False`.
|
| 63 |
+
|
| 64 |
+
Source: `src/data/preprocessing.py:to_windows`.
|
| 65 |
|
| 66 |
### Step 2 β Zero-mean normalization
|
| 67 |
|
| 68 |
+
Subtract the per-window column mean from each column:
|
| 69 |
|
| 70 |
```
|
| 71 |
+
x_norm = x - mean(x, axis=0) # x has shape [n_obs, 3]; columns: time, mag, err
|
| 72 |
```
|
| 73 |
|
| 74 |
+
After this step `times` = time β mean(time) and `input` = mag β mean(mag) are centred around zero.
|
| 75 |
|
| 76 |
Source: `src/data/preprocessing.py:standardize`.
|
| 77 |
|
|
|
|
| 108 |
|
| 109 |
| Tensor | Shape | Description |
|
| 110 |
|--------|-------|-------------|
|
| 111 |
+
| `input` | `[batch, 200, 1]` | `mag β mean(mag)` over the window (step 2 above) |
|
| 112 |
+
| `times` | `[batch, 200, 1]` | `time β mean(time)` over the window (step 2 above) |
|
| 113 |
| `mask_in` | `[batch, 200, 1]` | **1 = valid observation, 0 = padding** |
|
| 114 |
|
| 115 |
The ONNX wrapper inverts `mask_in` internally before passing it to the encoder, so consumers can use the intuitive convention.
|
|
|
|
| 129 |
Source: [Zenodo record 18207945](https://zenodo.org/records/18207945)
|
| 130 |
Training dataset: MACHO (1.5 million light curves, V and R bands)
|
| 131 |
Checkpoint: `astromer_v2/macho/`
|
| 132 |
+
|
| 133 |
+
The test-data parquet file was generated with these MACHO weights and `sampling=False` (sequential windows).
|