--- title: Flare emoji: 🔥 colorFrom: green colorTo: blue sdk: streamlit pinned: false python_version: 3.11.7 --- # FLARE **F**ine-grained **L**earning for **A**lignment of spectra–molecule **RE**presentations ### Authors **Yan Zhou Chen, Soha Hassoun** Department of Computer Science, Tufts University --- ## Overview FLARE learns a joint embedding space for **MS/MS spectra** (represented as **per-peak chemical formulas** from a subformula assigner) and **molecular graphs**. The default publication model uses **FILIP-style contrastive learning** (`filipContrastive`): fine-grained similarity between spectrum tokens and graph nodes, with a temperature-scaled loss. Use cases: - **Retrieval**: rank a list of candidate SMILES for each query spectrum (MassSpecGym-style evaluation). - **Interpretation**: the Streamlit app visualizes **peak-to-node** correspondence for a single spectrum–molecule pair. --- ## Model (default stack) | Component | Setting (see `params.yaml`) | |-----------|----------------------------| | Spectrum input | `SpecFormula` — formula peaks from JSON in `subformula_dir_pth` | | Formula source | `default` — MIST-compatible JSON (`load_mist_data`); optional `sirius` | | Spectrum encoder | `Transformer_Formula` | | Molecule encoder | `GNN` (DGL + dgllife GCN), node embeddings for FILIP | | Training objective | `filipContrastive` — masked FILIP loss, temperature `contr_temp` | | Output | Embeddings for cosine / FILIP similarity at test time | Hyperparameters are split into: **run/logging**, **training loop**, **data paths**, **featurizers**, **encoder widths/depths**, and **evaluation** (`at_ks`, `myopic_mces_kwargs`). Only keys present in `params.yaml` are required; paths can be **relative to the repository root** (recommended) or absolute. --- ## Repository layout | Path | Role | |------|------| | `params.yaml` | Canonical training/testing/app hyperparameters | | `hparams.yaml` | Symlink to `params.yaml` (Hugging Face Spaces convention) | | `flare/` | Training (`train.py`, `test.py`, `tune.py`), models, data pipeline | | `massspecgym/` | Vendored MassSpecGym Lightning base classes and utilities | | `app.py`, `app_utils/` | Streamlit peak–node visualization | | `pretrained_models/` | Place public checkpoints here (e.g. `flare.ckpt`) | | `experiments/` | Default output root for new runs (see `flare/definitions.py`) | | `archive/` | Older scripts and features **not** part of the slim release (MAGMA, class experiments, legacy YAML, etc.); nothing was deleted | --- ## Environment variables (no hardcoded machine paths) | Variable | Purpose | |----------|---------| | `FLARE_PARAMS` | Path to YAML params (default: `/params.yaml`) | | `FLARE_CHECKPOINT` | Checkpoint for the app or manual runs | | `FLARE_DEBUG_DATASET` | When `debug: true`, TSV path for a tiny local dataset | | `FLARE_REPO_ROOT` | Optional; overrides repo root for resolving relative paths in `default_param_path()` | | `MASSSPECGYM_ROOT` | Optional extra `sys.path` root if you use an external `massspecgym` checkout | | `FLARE_UPLOAD_CKPT`, `HF_REPO_ID`, `HF_REPO_TYPE`, `HF_TOKEN` | See `app_utils/upload_model.py` for HF uploads | --- ## Setup ```bash git clone https://huggingface.co/spaces/HassounLab/FLARE cd FLARE conda create -n flare python=3.11 conda activate flare pip install -r requirements.txt ``` Place **MassSpecGym** (or your) spectrum TSV, **candidate JSON**, and **subformula JSON directory** where you want them, then set paths in `params.yaml` (relative paths like `data/MassSpecGym.tsv` resolve from the repo root). --- ## Data preparation Per-spectrum subformula JSON files (one file per spectrum id, MIST-style) are required for `SpecFormula`. Generate them with the bundled assigner (adapted from MIST): ```bash cd flare/subformula_assign export SPEC_FILES=/path/to/spectra.tsv export OUTPUT_DIR=/path/to/subformulae_out export LABELS_FILE=/path/to/spectra.tsv # often same as SPEC_FILES export MAX_FORMULAE=60 bash run.sh ``` Defaults in `run.sh` point at `data/sample/` under the repo if you add a small sample there. --- ## Training From the repository root (so `flare` and `massspecgym` import correctly): ```bash cd flare python train.py # uses FLARE_PARAMS or ../params.yaml python train.py --param_pth /path/to/custom.yaml ``` `train.py` creates `experiments/_/`, writes TensorBoard logs there, and saves checkpoints. `df_test_path` defaults to `/result.pkl` if unset. --- ## Testing (retrieval) ```bash cd flare python test.py \ --checkpoint_pth /path/to/epoch=....ckpt \ --exp_dir /path/to/experiment_dir # optional; else latest matching run_name ``` Useful flags: `--candidates_pth`, `--df_test_pth`, `--external_test` (no positive label in the list). Override params file with `--param_pth` or `FLARE_PARAMS`. --- ## Hyperparameter search ```bash cd flare python tune.py --n_trials 20 ``` Uses Optuna; study database and logs live under `experiments/__optuna/`. Best YAML is written to `best_params.yaml` in that folder. --- ## Streamlit app (peak-to-node visualization) ```bash streamlit run app.py ``` The app loads architecture settings from `FLARE_PARAMS` (default `params.yaml`) and weights from `FLARE_CHECKPOINT` (default `pretrained_models/flare.ckpt`). Ensure the checkpoint matches the architecture in the YAML. --- ## Acknowledgments - **Data**: [MassSpecGym](https://github.com/pluskal-lab/MassSpecGym) - **Subformula tooling**: [MIST](https://github.com/samgoldman97/mist/tree/main_v2) --- ## Contact For questions: soha.hassoun@tufts.edu