FLARE / README.md
yzhouchen001's picture
clean up
6c3d8a1

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade
metadata
title: Flare
emoji: 🔥
colorFrom: green
colorTo: blue
sdk: streamlit
pinned: false
python_version: 3.11.7

FLARE

Fine-grained Learning for Alignment of spectra–molecule REpresentations

Authors

Yan Zhou Chen, Soha Hassoun
Department of Computer Science, Tufts University


Overview

FLARE learns a joint embedding space for MS/MS spectra (represented as per-peak chemical formulas from a subformula assigner) and molecular graphs. The default publication model uses FILIP-style contrastive learning (filipContrastive): fine-grained similarity between spectrum tokens and graph nodes, with a temperature-scaled loss.

Use cases:

  • Retrieval: rank a list of candidate SMILES for each query spectrum (MassSpecGym-style evaluation).
  • Interpretation: the Streamlit app visualizes peak-to-node correspondence for a single spectrum–molecule pair.

Model (default stack)

Component Setting (see params.yaml)
Spectrum input SpecFormula — formula peaks from JSON in subformula_dir_pth
Formula source default — MIST-compatible JSON (load_mist_data); optional sirius
Spectrum encoder Transformer_Formula
Molecule encoder GNN (DGL + dgllife GCN), node embeddings for FILIP
Training objective filipContrastive — masked FILIP loss, temperature contr_temp
Output Embeddings for cosine / FILIP similarity at test time

Hyperparameters are split into: run/logging, training loop, data paths, featurizers, encoder widths/depths, and evaluation (at_ks, myopic_mces_kwargs). Only keys present in params.yaml are required; paths can be relative to the repository root (recommended) or absolute.


Repository layout

Path Role
params.yaml Canonical training/testing/app hyperparameters
hparams.yaml Symlink to params.yaml (Hugging Face Spaces convention)
flare/ Training (train.py, test.py, tune.py), models, data pipeline
massspecgym/ Vendored MassSpecGym Lightning base classes and utilities
app.py, app_utils/ Streamlit peak–node visualization
pretrained_models/ Place public checkpoints here (e.g. flare.ckpt)
experiments/ Default output root for new runs (see flare/definitions.py)
archive/ Older scripts and features not part of the slim release (MAGMA, class experiments, legacy YAML, etc.); nothing was deleted

Environment variables (no hardcoded machine paths)

Variable Purpose
FLARE_PARAMS Path to YAML params (default: <repo>/params.yaml)
FLARE_CHECKPOINT Checkpoint for the app or manual runs
FLARE_DEBUG_DATASET When debug: true, TSV path for a tiny local dataset
FLARE_REPO_ROOT Optional; overrides repo root for resolving relative paths in default_param_path()
MASSSPECGYM_ROOT Optional extra sys.path root if you use an external massspecgym checkout
FLARE_UPLOAD_CKPT, HF_REPO_ID, HF_REPO_TYPE, HF_TOKEN See app_utils/upload_model.py for HF uploads

Setup

git clone https://huggingface.co/spaces/HassounLab/FLARE
cd FLARE

conda create -n flare python=3.11
conda activate flare
pip install -r requirements.txt

Place MassSpecGym (or your) spectrum TSV, candidate JSON, and subformula JSON directory where you want them, then set paths in params.yaml (relative paths like data/MassSpecGym.tsv resolve from the repo root).


Data preparation

Per-spectrum subformula JSON files (one file per spectrum id, MIST-style) are required for SpecFormula. Generate them with the bundled assigner (adapted from MIST):

cd flare/subformula_assign
export SPEC_FILES=/path/to/spectra.tsv
export OUTPUT_DIR=/path/to/subformulae_out
export LABELS_FILE=/path/to/spectra.tsv   # often same as SPEC_FILES
export MAX_FORMULAE=60
bash run.sh

Defaults in run.sh point at data/sample/ under the repo if you add a small sample there.


Training

From the repository root (so flare and massspecgym import correctly):

cd flare
python train.py                          # uses FLARE_PARAMS or ../params.yaml
python train.py --param_pth /path/to/custom.yaml

train.py creates experiments/<YYYYMMDD>_<run_name>/, writes TensorBoard logs there, and saves checkpoints. df_test_path defaults to <experiment_dir>/result.pkl if unset.


Testing (retrieval)

cd flare
python test.py \
  --checkpoint_pth /path/to/epoch=....ckpt \
  --exp_dir /path/to/experiment_dir   # optional; else latest matching run_name

Useful flags: --candidates_pth, --df_test_pth, --external_test (no positive label in the list). Override params file with --param_pth or FLARE_PARAMS.


Hyperparameter search

cd flare
python tune.py --n_trials 20

Uses Optuna; study database and logs live under experiments/<date>_<run_name>_optuna/. Best YAML is written to best_params.yaml in that folder.


Streamlit app (peak-to-node visualization)

streamlit run app.py

The app loads architecture settings from FLARE_PARAMS (default params.yaml) and weights from FLARE_CHECKPOINT (default pretrained_models/flare.ckpt). Ensure the checkpoint matches the architecture in the YAML.


Acknowledgments


Contact

For questions: soha.hassoun@tufts.edu