---
title: Flare
emoji: 🔥 
colorFrom: green
colorTo: blue
sdk: streamlit
pinned: false
python_version: 3.11.7
---

# FLARE

**F**ine-grained **L**earning for **A**lignment of spectra–molecule **RE**presentations

### Authors

**Yan Zhou Chen, Soha Hassoun**  
Department of Computer Science, Tufts University

---

## Overview

FLARE learns a joint embedding space for **MS/MS spectra** (represented as **per-peak chemical formulas** from a subformula assigner) and **molecular graphs**. The default publication model uses **FILIP-style contrastive learning** (`filipContrastive`): fine-grained similarity between spectrum tokens and graph nodes, with a temperature-scaled loss.

Use cases:

- **Retrieval**: rank a list of candidate SMILES for each query spectrum (MassSpecGym-style evaluation).
- **Interpretation**: the Streamlit app visualizes **peak-to-node** correspondence for a single spectrum–molecule pair.

---

## Model (default stack)

| Component | Setting (see `params.yaml`) |
|-----------|----------------------------|
| Spectrum input | `SpecFormula` — formula peaks from JSON in `subformula_dir_pth` |
| Formula source | `default` — MIST-compatible JSON (`load_mist_data`); optional `sirius` |
| Spectrum encoder | `Transformer_Formula` |
| Molecule encoder | `GNN` (DGL + dgllife GCN), node embeddings for FILIP |
| Training objective | `filipContrastive` — masked FILIP loss, temperature `contr_temp` |
| Output | Embeddings for cosine / FILIP similarity at test time |

Hyperparameters are split into: **run/logging**, **training loop**, **data paths**, **featurizers**, **encoder widths/depths**, and **evaluation** (`at_ks`, `myopic_mces_kwargs`). Only keys present in `params.yaml` are required; paths can be **relative to the repository root** (recommended) or absolute.

---

## Repository layout

| Path | Role |
|------|------|
| `params.yaml` | Canonical training/testing/app hyperparameters |
| `hparams.yaml` | Symlink to `params.yaml` (Hugging Face Spaces convention) |
| `flare/` | Training (`train.py`, `test.py`, `tune.py`), models, data pipeline |
| `massspecgym/` | Vendored MassSpecGym Lightning base classes and utilities |
| `app.py`, `app_utils/` | Streamlit peak–node visualization |
| `pretrained_models/` | Place public checkpoints here (e.g. `flare.ckpt`) |
| `experiments/` | Default output root for new runs (see `flare/definitions.py`) |
| `archive/` | Older scripts and features **not** part of the slim release (MAGMA, class experiments, legacy YAML, etc.); nothing was deleted |

---

## Environment variables (no hardcoded machine paths)

| Variable | Purpose |
|----------|---------|
| `FLARE_PARAMS` | Path to YAML params (default: `<repo>/params.yaml`) |
| `FLARE_CHECKPOINT` | Checkpoint for the app or manual runs |
| `FLARE_DEBUG_DATASET` | When `debug: true`, TSV path for a tiny local dataset |
| `FLARE_REPO_ROOT` | Optional; overrides repo root for resolving relative paths in `default_param_path()` |
| `MASSSPECGYM_ROOT` | Optional extra `sys.path` root if you use an external `massspecgym` checkout |
| `FLARE_UPLOAD_CKPT`, `HF_REPO_ID`, `HF_REPO_TYPE`, `HF_TOKEN` | See `app_utils/upload_model.py` for HF uploads |

---

## Setup

```bash
git clone https://huggingface.co/spaces/HassounLab/FLARE
cd FLARE

conda create -n flare python=3.11
conda activate flare
pip install -r requirements.txt
```

Place **MassSpecGym** (or your) spectrum TSV, **candidate JSON**, and **subformula JSON directory** where you want them, then set paths in `params.yaml` (relative paths like `data/MassSpecGym.tsv` resolve from the repo root).

---

## Data preparation

Per-spectrum subformula JSON files (one file per spectrum id, MIST-style) are required for `SpecFormula`. Generate them with the bundled assigner (adapted from MIST):

```bash
cd flare/subformula_assign
export SPEC_FILES=/path/to/spectra.tsv
export OUTPUT_DIR=/path/to/subformulae_out
export LABELS_FILE=/path/to/spectra.tsv   # often same as SPEC_FILES
export MAX_FORMULAE=60
bash run.sh
```

Defaults in `run.sh` point at `data/sample/` under the repo if you add a small sample there.

---

## Training

From the repository root (so `flare` and `massspecgym` import correctly):

```bash
cd flare
python train.py                          # uses FLARE_PARAMS or ../params.yaml
python train.py --param_pth /path/to/custom.yaml
```

`train.py` creates `experiments/<YYYYMMDD>_<run_name>/`, writes TensorBoard logs there, and saves checkpoints. `df_test_path` defaults to `<experiment_dir>/result.pkl` if unset.

---

## Testing (retrieval)

```bash
cd flare
python test.py \
  --checkpoint_pth /path/to/epoch=....ckpt \
  --exp_dir /path/to/experiment_dir   # optional; else latest matching run_name
```

Useful flags: `--candidates_pth`, `--df_test_pth`, `--external_test` (no positive label in the list). Override params file with `--param_pth` or `FLARE_PARAMS`.

---

## Hyperparameter search

```bash
cd flare
python tune.py --n_trials 20
```

Uses Optuna; study database and logs live under `experiments/<date>_<run_name>_optuna/`. Best YAML is written to `best_params.yaml` in that folder.

---

## Streamlit app (peak-to-node visualization)

```bash
streamlit run app.py
```

The app loads architecture settings from `FLARE_PARAMS` (default `params.yaml`) and weights from `FLARE_CHECKPOINT` (default `pretrained_models/flare.ckpt`). Ensure the checkpoint matches the architecture in the YAML.

---

## Acknowledgments

- **Data**: [MassSpecGym](https://github.com/pluskal-lab/MassSpecGym)
- **Subformula tooling**: [MIST](https://github.com/samgoldman97/mist/tree/main_v2)

---

## Contact

For questions: soha.hassoun@tufts.edu