TDLight Hierarchical Variable Star Classifier

A hierarchical LightGBM classifier for astronomical variable star classification. The model employs a four-level decision tree of 7 specialized sub-models, achieving 93.6% overall accuracy (weighted F1 = 93.7%) on 10 variable star classes via 5-fold cross-validation on 1.07 million samples.

⚡ Update (2026-03): This is the new hierarchical model replacing the previous flat LightGBM. The hierarchical architecture improves accuracy from 92.5% → 93.6% while maintaining fast CPU inference.

Architecture

The classifier uses a hierarchical decision structure where each node is an independent LightGBM model:

                        init
                       /    \
                Non-var     Variable
                            /      \
                     Extrinsic    Intrinsic
                      /    \       /   |   \
                    ROT    EB    CEP  DSCT  RR   LPV
                          / \              / \   / \
                        EA   EW        RRAB RRC  M  SR
Sub-model Task Classes
init Variable vs Non-variable Non-var, Variable
variable Extrinsic vs Intrinsic Extrinsic, Intrinsic
extrinsic Extrinsic subtypes EB, ROT
intrinsic Intrinsic subtypes CEP, DSCT, RR, LPV
eb Eclipsing binary subtypes EA, EW
rr RR Lyrae subtypes RRAB, RRC
lpv Long-period variable subtypes M, SR

Performance

5-fold cross-validation on the full training set (1,068,220 samples):

Class Precision Recall F1-score Support
CEP 0.8886 0.8103 0.8476 2,451
DSCT 0.9132 0.8234 0.8660 18,600
EA 0.9638 0.9515 0.9576 48,156
EW 0.9593 0.9426 0.9509 334,751
M 0.9080 0.9878 0.9462 18,874
Non-var 0.8678 0.9629 0.9129 134,592
ROT 0.8642 0.8876 0.8758 119,116
RRAB 0.9753 0.9796 0.9775 41,822
RRC 0.8987 0.8272 0.8615 18,748
SR 0.9680 0.9393 0.9534 331,110
Overall 0.9378 0.9362 0.9365 1,068,220

Early Classification (Truncated Light Curves)

The model supports early classification on incomplete light curves:

Completeness Accuracy Weighted F1
30% 74.1% 74.3%
50% 85.0% 85.2%
70% 89.4% 89.5%
100% 93.6% 93.7%

Files

File Format Size Description
init.pkl pickle 110 MB Level 1: Non-var vs Variable
variable.pkl pickle 100 MB Level 2: Extrinsic vs Intrinsic
extrinsic.pkl pickle 62 MB Level 3: ROT vs EB
intrinsic.pkl pickle 51 MB Level 3: CEP/DSCT/RR/LPV
eb.pkl pickle 18 MB Level 4: EA vs EW
rr.pkl pickle 3.9 MB Level 4: RRAB vs RRC
lpv.pkl pickle 5.6 MB Level 4: M vs SR
label_encoders.pkl pickle 698 B Label encoders for all levels
*.onnx ONNX ~250 MB ONNX versions for cross-language inference

Input Features

The model expects 15 features extracted from light curves using feets:

# Feature Description
1 PeriodLS Lomb-Scargle period
2 Mean Mean magnitude
3 Rcs Range of cumulative sum
4 Psi_eta Psi-eta statistic
5 StetsonK_AC Stetson K with autocorrelation
6 Gskew Skewness of magnitude differences
7 Psi_CS Psi cumulative sum
8 Skew Skewness
9 Freq1_harmonics_amplitude_1 First harmonic amplitude (1st)
10 Eta_e Eta-e variability index
11 LinearTrend Linear trend coefficient
12 Freq1_harmonics_amplitude_0 First harmonic amplitude (0th)
13 AndersonDarling Anderson-Darling statistic
14 MaxSlope Maximum slope
15 StetsonK Stetson K index

Usage

Python (pickle)

import pickle
import joblib
import numpy as np

# Load all sub-models
MODEL_NAMES = ['init', 'variable', 'extrinsic', 'intrinsic', 'eb', 'rr', 'lpv']
models = {}
for name in MODEL_NAMES:
    models[name] = joblib.load(f'{name}.pkl')

with open('label_encoders.pkl', 'rb') as f:
    label_encoders = pickle.load(f)

def hierarchical_predict(X):
    """Predict 10-class labels using the hierarchical model."""
    n = len(X)
    preds = np.array([''] * n, dtype=object)

    # Level 1: Non-var vs Variable
    le = label_encoders['init']
    p1 = le.inverse_transform(models['init'].predict(X))
    preds[p1 == 'Non-var'] = 'Non-var'

    mask_var = p1 == 'Variable'
    if not mask_var.any():
        return preds
    iv = np.where(mask_var)[0]

    # Level 2: Extrinsic vs Intrinsic
    le = label_encoders['variable']
    p2 = le.inverse_transform(models['variable'].predict(X[mask_var]))

    # Extrinsic branch
    me = np.zeros(n, dtype=bool)
    me[iv[p2 == 'Extrinsic']] = True
    if me.any():
        le = label_encoders['extrinsic']
        p3 = le.inverse_transform(models['extrinsic'].predict(X[me]))
        ie = np.where(me)[0]
        preds[ie[p3 == 'ROT']] = 'ROT'
        mb = np.zeros(n, dtype=bool)
        mb[ie[p3 == 'EB']] = True
        if mb.any():
            le = label_encoders['eb']
            preds[mb] = le.inverse_transform(models['eb'].predict(X[mb]))

    # Intrinsic branch
    mi = np.zeros(n, dtype=bool)
    mi[iv[p2 == 'Intrinsic']] = True
    if mi.any():
        le = label_encoders['intrinsic']
        p3 = le.inverse_transform(models['intrinsic'].predict(X[mi]))
        ii = np.where(mi)[0]
        preds[ii[p3 == 'CEP']] = 'CEP'
        preds[ii[p3 == 'DSCT']] = 'DSCT'
        mr = np.zeros(n, dtype=bool)
        mr[ii[p3 == 'RR']] = True
        if mr.any():
            le = label_encoders['rr']
            preds[mr] = le.inverse_transform(models['rr'].predict(X[mr]))
        ml = np.zeros(n, dtype=bool)
        ml[ii[p3 == 'LPV']] = True
        if ml.any():
            le = label_encoders['lpv']
            preds[ml] = le.inverse_transform(models['lpv'].predict(X[ml]))

    return preds

# Example
X = np.random.randn(5, 15)  # Replace with real features
predictions = hierarchical_predict(X)
print(predictions)

Using with Hugging Face Hub

from huggingface_hub import hf_hub_download
import joblib, pickle, numpy as np

repo_id = "bestdo77/Lightcurve_lgbm_111w_15_model"

# Download model files
MODEL_NAMES = ['init', 'variable', 'extrinsic', 'intrinsic', 'eb', 'rr', 'lpv']
models = {}
for name in MODEL_NAMES:
    path = hf_hub_download(repo_id=repo_id, filename=f"{name}.pkl")
    models[name] = joblib.load(path)

le_path = hf_hub_download(repo_id=repo_id, filename="label_encoders.pkl")
with open(le_path, 'rb') as f:
    label_encoders = pickle.load(f)

ONNX Runtime (cross-language)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("init.onnx")
X = np.random.randn(1, 15).astype(np.float32)
pred = session.run(None, {"input": X})

Training Details

  • Training data: LEAVES dataset (~1.07M labeled light curves from ZTF, ASAS-SN, Gaia)
  • Architecture: Hierarchical LightGBM with unlimited num_leaves
  • Feature extraction: feets v0.4
  • Framework: Part of the TDLight system

Citation

@software{tdlight_hierarchical_classifier,
  title={TDLight Hierarchical Variable Star Classifier},
  author={Yu, Xinghang and Yu, Ce and Shao, Zeguang and Yang, Bin},
  year={2026},
  url={https://huggingface.co/bestdo77/Lightcurve_lgbm_111w_15_model}
}

License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support