TDLight Hierarchical Variable Star Classifier

A hierarchical LightGBM classifier for astronomical variable star classification. The model employs a four-level decision tree of 7 specialized sub-models, achieving 93.6% overall accuracy (weighted F1 = 93.7%) on 10 variable star classes via 5-fold cross-validation on 1.07 million samples.

⚡ Update (2026-03): This is the new hierarchical model replacing the previous flat LightGBM. The hierarchical architecture improves accuracy from 92.5% → 93.6% while maintaining fast CPU inference.

Architecture

The classifier uses a hierarchical decision structure where each node is an independent LightGBM model:

                        init
                       /    \
                Non-var     Variable
                            /      \
                     Extrinsic    Intrinsic
                      /    \       /   |   \
                    ROT    EB    CEP  DSCT  RR   LPV
                          / \              / \   / \
                        EA   EW        RRAB RRC  M  SR

Sub-model	Task	Classes
`init`	Variable vs Non-variable	Non-var, Variable
`variable`	Extrinsic vs Intrinsic	Extrinsic, Intrinsic
`extrinsic`	Extrinsic subtypes	EB, ROT
`intrinsic`	Intrinsic subtypes	CEP, DSCT, RR, LPV
`eb`	Eclipsing binary subtypes	EA, EW
`rr`	RR Lyrae subtypes	RRAB, RRC
`lpv`	Long-period variable subtypes	M, SR

Performance

5-fold cross-validation on the full training set (1,068,220 samples):

Class	Precision	Recall	F1-score	Support
CEP	0.8886	0.8103	0.8476	2,451
DSCT	0.9132	0.8234	0.8660	18,600
EA	0.9638	0.9515	0.9576	48,156
EW	0.9593	0.9426	0.9509	334,751
M	0.9080	0.9878	0.9462	18,874
Non-var	0.8678	0.9629	0.9129	134,592
ROT	0.8642	0.8876	0.8758	119,116
RRAB	0.9753	0.9796	0.9775	41,822
RRC	0.8987	0.8272	0.8615	18,748
SR	0.9680	0.9393	0.9534	331,110
Overall	0.9378	0.9362	0.9365	1,068,220

Early Classification (Truncated Light Curves)

The model supports early classification on incomplete light curves:

Completeness	Accuracy	Weighted F1
30%	74.1%	74.3%
50%	85.0%	85.2%
70%	89.4%	89.5%
100%	93.6%	93.7%

Files

File	Format	Size	Description
`init.pkl`	pickle	110 MB	Level 1: Non-var vs Variable
`variable.pkl`	pickle	100 MB	Level 2: Extrinsic vs Intrinsic
`extrinsic.pkl`	pickle	62 MB	Level 3: ROT vs EB
`intrinsic.pkl`	pickle	51 MB	Level 3: CEP/DSCT/RR/LPV
`eb.pkl`	pickle	18 MB	Level 4: EA vs EW
`rr.pkl`	pickle	3.9 MB	Level 4: RRAB vs RRC
`lpv.pkl`	pickle	5.6 MB	Level 4: M vs SR
`label_encoders.pkl`	pickle	698 B	Label encoders for all levels
`*.onnx`	ONNX	~250 MB	ONNX versions for cross-language inference

Input Features

The model expects 15 features extracted from light curves using feets:

#	Feature	Description
1	`PeriodLS`	Lomb-Scargle period
2	`Mean`	Mean magnitude
3	`Rcs`	Range of cumulative sum
4	`Psi_eta`	Psi-eta statistic
5	`StetsonK_AC`	Stetson K with autocorrelation
6	`Gskew`	Skewness of magnitude differences
7	`Psi_CS`	Psi cumulative sum
8	`Skew`	Skewness
9	`Freq1_harmonics_amplitude_1`	First harmonic amplitude (1st)
10	`Eta_e`	Eta-e variability index
11	`LinearTrend`	Linear trend coefficient
12	`Freq1_harmonics_amplitude_0`	First harmonic amplitude (0th)
13	`AndersonDarling`	Anderson-Darling statistic
14	`MaxSlope`	Maximum slope
15	`StetsonK`	Stetson K index

Usage

Python (pickle)

import pickle
import joblib
import numpy as np

# Load all sub-models
MODEL_NAMES = ['init', 'variable', 'extrinsic', 'intrinsic', 'eb', 'rr', 'lpv']
models = {}
for name in MODEL_NAMES:
    models[name] = joblib.load(f'{name}.pkl')

with open('label_encoders.pkl', 'rb') as f:
    label_encoders = pickle.load(f)

def hierarchical_predict(X):
    """Predict 10-class labels using the hierarchical model."""
    n = len(X)
    preds = np.array([''] * n, dtype=object)

    # Level 1: Non-var vs Variable
    le = label_encoders['init']
    p1 = le.inverse_transform(models['init'].predict(X))
    preds[p1 == 'Non-var'] = 'Non-var'

    mask_var = p1 == 'Variable'
    if not mask_var.any():
        return preds
    iv = np.where(mask_var)[0]

    # Level 2: Extrinsic vs Intrinsic
    le = label_encoders['variable']
    p2 = le.inverse_transform(models['variable'].predict(X[mask_var]))

    # Extrinsic branch
    me = np.zeros(n, dtype=bool)
    me[iv[p2 == 'Extrinsic']] = True
    if me.any():
        le = label_encoders['extrinsic']
        p3 = le.inverse_transform(models['extrinsic'].predict(X[me]))
        ie = np.where(me)[0]
        preds[ie[p3 == 'ROT']] = 'ROT'
        mb = np.zeros(n, dtype=bool)
        mb[ie[p3 == 'EB']] = True
        if mb.any():
            le = label_encoders['eb']
            preds[mb] = le.inverse_transform(models['eb'].predict(X[mb]))

    # Intrinsic branch
    mi = np.zeros(n, dtype=bool)
    mi[iv[p2 == 'Intrinsic']] = True
    if mi.any():
        le = label_encoders['intrinsic']
        p3 = le.inverse_transform(models['intrinsic'].predict(X[mi]))
        ii = np.where(mi)[0]
        preds[ii[p3 == 'CEP']] = 'CEP'
        preds[ii[p3 == 'DSCT']] = 'DSCT'
        mr = np.zeros(n, dtype=bool)
        mr[ii[p3 == 'RR']] = True
        if mr.any():
            le = label_encoders['rr']
            preds[mr] = le.inverse_transform(models['rr'].predict(X[mr]))
        ml = np.zeros(n, dtype=bool)
        ml[ii[p3 == 'LPV']] = True
        if ml.any():
            le = label_encoders['lpv']
            preds[ml] = le.inverse_transform(models['lpv'].predict(X[ml]))

    return preds

# Example
X = np.random.randn(5, 15)  # Replace with real features
predictions = hierarchical_predict(X)
print(predictions)

Using with Hugging Face Hub

from huggingface_hub import hf_hub_download
import joblib, pickle, numpy as np

repo_id = "bestdo77/Lightcurve_lgbm_111w_15_model"

# Download model files
MODEL_NAMES = ['init', 'variable', 'extrinsic', 'intrinsic', 'eb', 'rr', 'lpv']
models = {}
for name in MODEL_NAMES:
    path = hf_hub_download(repo_id=repo_id, filename=f"{name}.pkl")
    models[name] = joblib.load(path)

le_path = hf_hub_download(repo_id=repo_id, filename="label_encoders.pkl")
with open(le_path, 'rb') as f:
    label_encoders = pickle.load(f)

ONNX Runtime (cross-language)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("init.onnx")
X = np.random.randn(1, 15).astype(np.float32)
pred = session.run(None, {"input": X})

Training Details

Training data: LEAVES dataset (~1.07M labeled light curves from ZTF, ASAS-SN, Gaia)
Architecture: Hierarchical LightGBM with unlimited num_leaves
Feature extraction: feets v0.4
Framework: Part of the TDLight system

Citation

@software{tdlight_hierarchical_classifier,
  title={TDLight Hierarchical Variable Star Classifier},
  author={Yu, Xinghang and Yu, Ce and Shao, Zeguang and Yang, Bin},
  year={2026},
  url={https://huggingface.co/bestdo77/Lightcurve_lgbm_111w_15_model}
}

License

MIT License

Downloads last month: -; Downloads are not tracked for this model. How to track