Spaces:

HassounLab
/

FLARE

Running

App Files Files Community

FLARE / README.md

yzhouchen001

clean up

6c3d8a1 6 days ago

preview code

raw

history blame contribute delete

5.67 kB

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

metadata

title: Flare
emoji: 🔥
colorFrom: green
colorTo: blue
sdk: streamlit
pinned: false
python_version: 3.11.7

FLARE

Fine-grained Learning for Alignment of spectra–molecule REpresentations

Authors

Yan Zhou Chen, Soha Hassoun
Department of Computer Science, Tufts University

Overview

FLARE learns a joint embedding space for MS/MS spectra (represented as per-peak chemical formulas from a subformula assigner) and molecular graphs. The default publication model uses FILIP-style contrastive learning (filipContrastive): fine-grained similarity between spectrum tokens and graph nodes, with a temperature-scaled loss.

Use cases:

Retrieval: rank a list of candidate SMILES for each query spectrum (MassSpecGym-style evaluation).
Interpretation: the Streamlit app visualizes peak-to-node correspondence for a single spectrum–molecule pair.

Model (default stack)

Component	Setting (see `params.yaml`)
Spectrum input	`SpecFormula` — formula peaks from JSON in `subformula_dir_pth`
Formula source	`default` — MIST-compatible JSON (`load_mist_data`); optional `sirius`
Spectrum encoder	`Transformer_Formula`
Molecule encoder	`GNN` (DGL + dgllife GCN), node embeddings for FILIP
Training objective	`filipContrastive` — masked FILIP loss, temperature `contr_temp`
Output	Embeddings for cosine / FILIP similarity at test time

Hyperparameters are split into: run/logging, training loop, data paths, featurizers, encoder widths/depths, and evaluation (at_ks, myopic_mces_kwargs). Only keys present in params.yaml are required; paths can be relative to the repository root (recommended) or absolute.

Repository layout

Path	Role
`params.yaml`	Canonical training/testing/app hyperparameters
`hparams.yaml`	Symlink to `params.yaml` (Hugging Face Spaces convention)
`flare/`	Training (`train.py`, `test.py`, `tune.py`), models, data pipeline
`massspecgym/`	Vendored MassSpecGym Lightning base classes and utilities
`app.py`, `app_utils/`	Streamlit peak–node visualization
`pretrained_models/`	Place public checkpoints here (e.g. `flare.ckpt`)
`experiments/`	Default output root for new runs (see `flare/definitions.py`)
`archive/`	Older scripts and features not part of the slim release (MAGMA, class experiments, legacy YAML, etc.); nothing was deleted

Environment variables (no hardcoded machine paths)

Variable	Purpose
`FLARE_PARAMS`	Path to YAML params (default: `<repo>/params.yaml`)
`FLARE_CHECKPOINT`	Checkpoint for the app or manual runs
`FLARE_DEBUG_DATASET`	When `debug: true`, TSV path for a tiny local dataset
`FLARE_REPO_ROOT`	Optional; overrides repo root for resolving relative paths in `default_param_path()`
`MASSSPECGYM_ROOT`	Optional extra `sys.path` root if you use an external `massspecgym` checkout
`FLARE_UPLOAD_CKPT`, `HF_REPO_ID`, `HF_REPO_TYPE`, `HF_TOKEN`	See `app_utils/upload_model.py` for HF uploads

Setup

git clone https://huggingface.co/spaces/HassounLab/FLARE
cd FLARE

conda create -n flare python=3.11
conda activate flare
pip install -r requirements.txt

Place MassSpecGym (or your) spectrum TSV, candidate JSON, and subformula JSON directory where you want them, then set paths in params.yaml (relative paths like data/MassSpecGym.tsv resolve from the repo root).

Data preparation

Per-spectrum subformula JSON files (one file per spectrum id, MIST-style) are required for SpecFormula. Generate them with the bundled assigner (adapted from MIST):

cd flare/subformula_assign
export SPEC_FILES=/path/to/spectra.tsv
export OUTPUT_DIR=/path/to/subformulae_out
export LABELS_FILE=/path/to/spectra.tsv   # often same as SPEC_FILES
export MAX_FORMULAE=60
bash run.sh

Defaults in run.sh point at data/sample/ under the repo if you add a small sample there.

Training

From the repository root (so flare and massspecgym import correctly):

cd flare
python train.py                          # uses FLARE_PARAMS or ../params.yaml
python train.py --param_pth /path/to/custom.yaml

train.py creates experiments/<YYYYMMDD>_<run_name>/, writes TensorBoard logs there, and saves checkpoints. df_test_path defaults to <experiment_dir>/result.pkl if unset.

Testing (retrieval)

cd flare
python test.py \
  --checkpoint_pth /path/to/epoch=....ckpt \
  --exp_dir /path/to/experiment_dir   # optional; else latest matching run_name

Useful flags: --candidates_pth, --df_test_pth, --external_test (no positive label in the list). Override params file with --param_pth or FLARE_PARAMS.

Hyperparameter search

cd flare
python tune.py --n_trials 20

Uses Optuna; study database and logs live under experiments/<date>_<run_name>_optuna/. Best YAML is written to best_params.yaml in that folder.

Streamlit app (peak-to-node visualization)

streamlit run app.py

The app loads architecture settings from FLARE_PARAMS (default params.yaml) and weights from FLARE_CHECKPOINT (default pretrained_models/flare.ckpt). Ensure the checkpoint matches the architecture in the YAML.

Acknowledgments

Data: MassSpecGym
Subformula tooling: MIST

Contact

For questions: soha.hassoun@tufts.edu