Spaces:
Running
A newer version of the Streamlit SDK is available: 1.56.0
title: Flare
emoji: 🔥
colorFrom: green
colorTo: blue
sdk: streamlit
pinned: false
python_version: 3.11.7
FLARE
Fine-grained Learning for Alignment of spectra–molecule REpresentations
Authors
Yan Zhou Chen, Soha Hassoun
Department of Computer Science, Tufts University
Overview
FLARE learns a joint embedding space for MS/MS spectra (represented as per-peak chemical formulas from a subformula assigner) and molecular graphs. The default publication model uses FILIP-style contrastive learning (filipContrastive): fine-grained similarity between spectrum tokens and graph nodes, with a temperature-scaled loss.
Use cases:
- Retrieval: rank a list of candidate SMILES for each query spectrum (MassSpecGym-style evaluation).
- Interpretation: the Streamlit app visualizes peak-to-node correspondence for a single spectrum–molecule pair.
Model (default stack)
| Component | Setting (see params.yaml) |
|---|---|
| Spectrum input | SpecFormula — formula peaks from JSON in subformula_dir_pth |
| Formula source | default — MIST-compatible JSON (load_mist_data); optional sirius |
| Spectrum encoder | Transformer_Formula |
| Molecule encoder | GNN (DGL + dgllife GCN), node embeddings for FILIP |
| Training objective | filipContrastive — masked FILIP loss, temperature contr_temp |
| Output | Embeddings for cosine / FILIP similarity at test time |
Hyperparameters are split into: run/logging, training loop, data paths, featurizers, encoder widths/depths, and evaluation (at_ks, myopic_mces_kwargs). Only keys present in params.yaml are required; paths can be relative to the repository root (recommended) or absolute.
Repository layout
| Path | Role |
|---|---|
params.yaml |
Canonical training/testing/app hyperparameters |
hparams.yaml |
Symlink to params.yaml (Hugging Face Spaces convention) |
flare/ |
Training (train.py, test.py, tune.py), models, data pipeline |
massspecgym/ |
Vendored MassSpecGym Lightning base classes and utilities |
app.py, app_utils/ |
Streamlit peak–node visualization |
pretrained_models/ |
Place public checkpoints here (e.g. flare.ckpt) |
experiments/ |
Default output root for new runs (see flare/definitions.py) |
archive/ |
Older scripts and features not part of the slim release (MAGMA, class experiments, legacy YAML, etc.); nothing was deleted |
Environment variables (no hardcoded machine paths)
| Variable | Purpose |
|---|---|
FLARE_PARAMS |
Path to YAML params (default: <repo>/params.yaml) |
FLARE_CHECKPOINT |
Checkpoint for the app or manual runs |
FLARE_DEBUG_DATASET |
When debug: true, TSV path for a tiny local dataset |
FLARE_REPO_ROOT |
Optional; overrides repo root for resolving relative paths in default_param_path() |
MASSSPECGYM_ROOT |
Optional extra sys.path root if you use an external massspecgym checkout |
FLARE_UPLOAD_CKPT, HF_REPO_ID, HF_REPO_TYPE, HF_TOKEN |
See app_utils/upload_model.py for HF uploads |
Setup
git clone https://huggingface.co/spaces/HassounLab/FLARE
cd FLARE
conda create -n flare python=3.11
conda activate flare
pip install -r requirements.txt
Place MassSpecGym (or your) spectrum TSV, candidate JSON, and subformula JSON directory where you want them, then set paths in params.yaml (relative paths like data/MassSpecGym.tsv resolve from the repo root).
Data preparation
Per-spectrum subformula JSON files (one file per spectrum id, MIST-style) are required for SpecFormula. Generate them with the bundled assigner (adapted from MIST):
cd flare/subformula_assign
export SPEC_FILES=/path/to/spectra.tsv
export OUTPUT_DIR=/path/to/subformulae_out
export LABELS_FILE=/path/to/spectra.tsv # often same as SPEC_FILES
export MAX_FORMULAE=60
bash run.sh
Defaults in run.sh point at data/sample/ under the repo if you add a small sample there.
Training
From the repository root (so flare and massspecgym import correctly):
cd flare
python train.py # uses FLARE_PARAMS or ../params.yaml
python train.py --param_pth /path/to/custom.yaml
train.py creates experiments/<YYYYMMDD>_<run_name>/, writes TensorBoard logs there, and saves checkpoints. df_test_path defaults to <experiment_dir>/result.pkl if unset.
Testing (retrieval)
cd flare
python test.py \
--checkpoint_pth /path/to/epoch=....ckpt \
--exp_dir /path/to/experiment_dir # optional; else latest matching run_name
Useful flags: --candidates_pth, --df_test_pth, --external_test (no positive label in the list). Override params file with --param_pth or FLARE_PARAMS.
Hyperparameter search
cd flare
python tune.py --n_trials 20
Uses Optuna; study database and logs live under experiments/<date>_<run_name>_optuna/. Best YAML is written to best_params.yaml in that folder.
Streamlit app (peak-to-node visualization)
streamlit run app.py
The app loads architecture settings from FLARE_PARAMS (default params.yaml) and weights from FLARE_CHECKPOINT (default pretrained_models/flare.ckpt). Ensure the checkpoint matches the architecture in the YAML.
Acknowledgments
- Data: MassSpecGym
- Subformula tooling: MIST
Contact
For questions: soha.hassoun@tufts.edu