TDLight Hierarchical Variable Star Classifier
A hierarchical LightGBM classifier for astronomical variable star classification. The model employs a four-level decision tree of 7 specialized sub-models, achieving 93.6% overall accuracy (weighted F1 = 93.7%) on 10 variable star classes via 5-fold cross-validation on 1.07 million samples.
⚡ Update (2026-03): This is the new hierarchical model replacing the previous flat LightGBM. The hierarchical architecture improves accuracy from 92.5% → 93.6% while maintaining fast CPU inference.
Architecture
The classifier uses a hierarchical decision structure where each node is an independent LightGBM model:
init
/ \
Non-var Variable
/ \
Extrinsic Intrinsic
/ \ / | \
ROT EB CEP DSCT RR LPV
/ \ / \ / \
EA EW RRAB RRC M SR
| Sub-model | Task | Classes |
|---|---|---|
init |
Variable vs Non-variable | Non-var, Variable |
variable |
Extrinsic vs Intrinsic | Extrinsic, Intrinsic |
extrinsic |
Extrinsic subtypes | EB, ROT |
intrinsic |
Intrinsic subtypes | CEP, DSCT, RR, LPV |
eb |
Eclipsing binary subtypes | EA, EW |
rr |
RR Lyrae subtypes | RRAB, RRC |
lpv |
Long-period variable subtypes | M, SR |
Performance
5-fold cross-validation on the full training set (1,068,220 samples):
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| CEP | 0.8886 | 0.8103 | 0.8476 | 2,451 |
| DSCT | 0.9132 | 0.8234 | 0.8660 | 18,600 |
| EA | 0.9638 | 0.9515 | 0.9576 | 48,156 |
| EW | 0.9593 | 0.9426 | 0.9509 | 334,751 |
| M | 0.9080 | 0.9878 | 0.9462 | 18,874 |
| Non-var | 0.8678 | 0.9629 | 0.9129 | 134,592 |
| ROT | 0.8642 | 0.8876 | 0.8758 | 119,116 |
| RRAB | 0.9753 | 0.9796 | 0.9775 | 41,822 |
| RRC | 0.8987 | 0.8272 | 0.8615 | 18,748 |
| SR | 0.9680 | 0.9393 | 0.9534 | 331,110 |
| Overall | 0.9378 | 0.9362 | 0.9365 | 1,068,220 |
Early Classification (Truncated Light Curves)
The model supports early classification on incomplete light curves:
| Completeness | Accuracy | Weighted F1 |
|---|---|---|
| 30% | 74.1% | 74.3% |
| 50% | 85.0% | 85.2% |
| 70% | 89.4% | 89.5% |
| 100% | 93.6% | 93.7% |
Files
| File | Format | Size | Description |
|---|---|---|---|
init.pkl |
pickle | 110 MB | Level 1: Non-var vs Variable |
variable.pkl |
pickle | 100 MB | Level 2: Extrinsic vs Intrinsic |
extrinsic.pkl |
pickle | 62 MB | Level 3: ROT vs EB |
intrinsic.pkl |
pickle | 51 MB | Level 3: CEP/DSCT/RR/LPV |
eb.pkl |
pickle | 18 MB | Level 4: EA vs EW |
rr.pkl |
pickle | 3.9 MB | Level 4: RRAB vs RRC |
lpv.pkl |
pickle | 5.6 MB | Level 4: M vs SR |
label_encoders.pkl |
pickle | 698 B | Label encoders for all levels |
*.onnx |
ONNX | ~250 MB | ONNX versions for cross-language inference |
Input Features
The model expects 15 features extracted from light curves using feets:
| # | Feature | Description |
|---|---|---|
| 1 | PeriodLS |
Lomb-Scargle period |
| 2 | Mean |
Mean magnitude |
| 3 | Rcs |
Range of cumulative sum |
| 4 | Psi_eta |
Psi-eta statistic |
| 5 | StetsonK_AC |
Stetson K with autocorrelation |
| 6 | Gskew |
Skewness of magnitude differences |
| 7 | Psi_CS |
Psi cumulative sum |
| 8 | Skew |
Skewness |
| 9 | Freq1_harmonics_amplitude_1 |
First harmonic amplitude (1st) |
| 10 | Eta_e |
Eta-e variability index |
| 11 | LinearTrend |
Linear trend coefficient |
| 12 | Freq1_harmonics_amplitude_0 |
First harmonic amplitude (0th) |
| 13 | AndersonDarling |
Anderson-Darling statistic |
| 14 | MaxSlope |
Maximum slope |
| 15 | StetsonK |
Stetson K index |
Usage
Python (pickle)
import pickle
import joblib
import numpy as np
# Load all sub-models
MODEL_NAMES = ['init', 'variable', 'extrinsic', 'intrinsic', 'eb', 'rr', 'lpv']
models = {}
for name in MODEL_NAMES:
models[name] = joblib.load(f'{name}.pkl')
with open('label_encoders.pkl', 'rb') as f:
label_encoders = pickle.load(f)
def hierarchical_predict(X):
"""Predict 10-class labels using the hierarchical model."""
n = len(X)
preds = np.array([''] * n, dtype=object)
# Level 1: Non-var vs Variable
le = label_encoders['init']
p1 = le.inverse_transform(models['init'].predict(X))
preds[p1 == 'Non-var'] = 'Non-var'
mask_var = p1 == 'Variable'
if not mask_var.any():
return preds
iv = np.where(mask_var)[0]
# Level 2: Extrinsic vs Intrinsic
le = label_encoders['variable']
p2 = le.inverse_transform(models['variable'].predict(X[mask_var]))
# Extrinsic branch
me = np.zeros(n, dtype=bool)
me[iv[p2 == 'Extrinsic']] = True
if me.any():
le = label_encoders['extrinsic']
p3 = le.inverse_transform(models['extrinsic'].predict(X[me]))
ie = np.where(me)[0]
preds[ie[p3 == 'ROT']] = 'ROT'
mb = np.zeros(n, dtype=bool)
mb[ie[p3 == 'EB']] = True
if mb.any():
le = label_encoders['eb']
preds[mb] = le.inverse_transform(models['eb'].predict(X[mb]))
# Intrinsic branch
mi = np.zeros(n, dtype=bool)
mi[iv[p2 == 'Intrinsic']] = True
if mi.any():
le = label_encoders['intrinsic']
p3 = le.inverse_transform(models['intrinsic'].predict(X[mi]))
ii = np.where(mi)[0]
preds[ii[p3 == 'CEP']] = 'CEP'
preds[ii[p3 == 'DSCT']] = 'DSCT'
mr = np.zeros(n, dtype=bool)
mr[ii[p3 == 'RR']] = True
if mr.any():
le = label_encoders['rr']
preds[mr] = le.inverse_transform(models['rr'].predict(X[mr]))
ml = np.zeros(n, dtype=bool)
ml[ii[p3 == 'LPV']] = True
if ml.any():
le = label_encoders['lpv']
preds[ml] = le.inverse_transform(models['lpv'].predict(X[ml]))
return preds
# Example
X = np.random.randn(5, 15) # Replace with real features
predictions = hierarchical_predict(X)
print(predictions)
Using with Hugging Face Hub
from huggingface_hub import hf_hub_download
import joblib, pickle, numpy as np
repo_id = "bestdo77/Lightcurve_lgbm_111w_15_model"
# Download model files
MODEL_NAMES = ['init', 'variable', 'extrinsic', 'intrinsic', 'eb', 'rr', 'lpv']
models = {}
for name in MODEL_NAMES:
path = hf_hub_download(repo_id=repo_id, filename=f"{name}.pkl")
models[name] = joblib.load(path)
le_path = hf_hub_download(repo_id=repo_id, filename="label_encoders.pkl")
with open(le_path, 'rb') as f:
label_encoders = pickle.load(f)
ONNX Runtime (cross-language)
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("init.onnx")
X = np.random.randn(1, 15).astype(np.float32)
pred = session.run(None, {"input": X})
Training Details
- Training data: LEAVES dataset (~1.07M labeled light curves from ZTF, ASAS-SN, Gaia)
- Architecture: Hierarchical LightGBM with unlimited
num_leaves - Feature extraction: feets v0.4
- Framework: Part of the TDLight system
Citation
@software{tdlight_hierarchical_classifier,
title={TDLight Hierarchical Variable Star Classifier},
author={Yu, Xinghang and Yu, Ce and Shao, Zeguang and Yang, Bin},
year={2026},
url={https://huggingface.co/bestdo77/Lightcurve_lgbm_111w_15_model}
}
License
MIT License