diff --git "a/mineru_outputs/6027_MolSpectra_Pre_training_3/auto/dag.json" "b/mineru_outputs/6027_MolSpectra_Pre_training_3/auto/dag.json"
new file mode 100644--- /dev/null
+++ "b/mineru_outputs/6027_MolSpectra_Pre_training_3/auto/dag.json"
@@ -0,0 +1,894 @@
+{
+    "nodes": [
+        {
+            "name": "MOLSPECTRA: PRE-TRAINING 3D MOLECULAR REP-RESENTATION WITH MULTI-MODAL ENERGY SPECTRA",
+            "content": "Liang Wang, Shaozhen Liu, Yu Rong, Deli Zhao, Qiang Liu, Shu Wu, Liang Wang......New Laboratory of Pattern Recognition (NLPR), State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA), School of Artificial Intelligence, University of Chinese Academy of Sciences, DAMO Academy, Alibaba Group",
+            "github": "",
+            "edge": [
+                "1 INTRODUCTION",
+                "2 PRELIMINARIES.md",
+                "3 THE PROPOSED MOLSPECTRA METHOD",
+                "4 EXPERIMENTS",
+                "6 CONCLUSION.md"
+            ],
+            "level": 0,
+            "visual_node": []
+        },
+        {
+            "name": "1 INTRODUCTION",
+            "content": "# 1 INTRODUCTION Learning 3D molecular representations from geometric conformations offers a promising approach for understanding molecular geometry and predicting quantum properties and interactions, which is significant in drug discovery and materials science (Musaelian et al., 2023; Batatia et al., 2022; Liao & Smidt, 2023; Wang et al., 2023b; Du et al., 2023b). Given the scarcity of molecular property labels, self-supervised representation pre-training has been proposed and utilized to provide generalizable representations (Hu et al., 2020; Rong et al., 2020; Ma et al., 2024). In contrast to contrastive learning (Wang et al., 2022; Kim et al., 2022) and masked modeling (Hou et al., 2022; Liu et al., 2023c; Wang et al., 2024b) on 2D molecular graphs and molecular languages (e.g., SMILES), the design of pre-training strategies on 3D molecular geometries is more closely aligned with physical principles. Previous studies (Zaidi et al., 2023; Jiao et al., 2023) have guided representation learning through denoising processes on 3D molecular geometries, theoretically demonstrating that denoising 3D geometries is equivalent to learning molecular force fields, specifically the negative gradient of molecular potential energy with respect to position. Essentially, these studies reveal that establishing the relationship between 3D geometries and the energy states of molecular systems is an effective pathway to learn 3D molecular representations. However, existing methods are limited to the continuous description (i.e., the potential energy function) of the molecular energy states within the classical mechanics, overlooking the quantized (discrete) energy level structures from the quantum mechanical perspective. From the quantum perspective, molecular systems exhibit quantized energy level structures, meaning that energy states can only assume specific discrete values. Specifically, different types of molecular motion, such as electronic, vibrational, and rotational motion, correspond to different energy level structures. Knowledge of these energy levels is crucial in molecular physics and quantum chemistry, as they determine the spectroscopic characteristics, chemical reactivity, and many other important molecular properties. Fortunately, experimental measurements of molecular energy spectra can reflect these structures. Meanwhile, there are many molecular spectra data obtained through experimental measurements or simulations (Zou et al., 2023; Alberts et al., 2024). Therefore, incorporating the knowledge of energy levels into molecular representation learning is expected to facilitate the development of more informative molecular representations. ![](images/d1786526fd710a309dcca5721e53bf82d19ca2622ac3314d3bfc46284f668f1d.jpg) Figure 1: The conceptual view of MolSpectra, which leverages both molecular conformation and spectra for pre-training. Prior works only model classical mechanics by denoising on conformations. In this paper, we propose MolSpectra, a framework that incorporates molecular spectra into the pre-training of 3D molecular representations, thereby infusing the knowledge of quantized energy level structures into the representations, as shown in Figure 1. In MolSpectra, we introduce a multispectrum encoder, SpecFormer, to capture both intra-spectrum and inter-spectrum peak correlations by training with a masked patches reconstruction (MPR) objective. Additionally, we employ a contrastive objective to distills the spectral features and its inherent knowledge into the learning of 3D representations. After pre-training, the resulting 3D encoder can be fine-tuned for downstream tasks, providing expressive 3D molecular representations without the need for associated spectral data. Extensive experiments over different downstream molecular property prediction benchmarks shows the superiority of MolSpectra. In summary, our contributions are as follows: • We introduce quantized energy level structures and molecular spectra into 3D molecular representation pre-training for the first time, surpassing previous work that relied solely on physical knowledge within the scope of classical mechanics. • We propose SpecFormer as an expressive multi-spectrum encoder, along with the masked patches reconstruction objective for spectral representation learning. • We propose a contrastive objective to align molecular representations in the 3D modality and spectral modalities, enabling the pre-trained 3D encoder to infer molecular spectral features in downstream tasks without relying on spectral data. • Experiments across different downstream benchmarks demonstrate that our method effectively enhances the expressiveness of the pre-trained 3D molecular representations.",
+            "edge": [
+                "Background and Current Methods",
+                "Limitations and Motivation",
+                "Proposed Method: MolSpectra",
+                "Contributions"
+            ],
+            "level": 1,
+            "visual_node": [
+                "![](images/d1786526fd710a309dcca5721e53bf82d19ca2622ac3314d3bfc46284f668f1d.jpg)"
+            ]
+        },
+        {
+            "name": "Background and Current Methods",
+            "content": "Learning 3D molecular representations from geometric conformations offers a promising approach for understanding molecular geometry and predicting quantum properties and interactions, which is significant in drug discovery and materials science (Musaelian et al., 2023; Batatia et al., 2022; Liao & Smidt, 2023; Wang et al., 2023b; Du et al., 2023b). Given the scarcity of molecular property labels, self-supervised representation pre-training has been proposed and utilized to provide generalizable representations (Hu et al., 2020; Rong et al., 2020; Ma et al., 2024). In contrast to contrastive learning (Wang et al., 2022; Kim et al., 2022) and masked modeling (Hou et al., 2022; Liu et al., 2023c; Wang et al., 2024b) on 2D molecular graphs and molecular languages (e.g., SMILES), the design of pre-training strategies on 3D molecular geometries is more closely aligned with physical principles. Previous studies (Zaidi et al., 2023; Jiao et al., 2023) have guided representation learning through denoising processes on 3D molecular geometries, theoretically demonstrating that denoising 3D geometries is equivalent to learning molecular force fields, specifically the negative gradient of molecular potential energy with respect to position. Essentially, these studies reveal that establishing the relationship between 3D geometries and the energy states of molecular systems is an effective pathway to learn 3D molecular representations.",
+            "edge": [
+                "Importance of 3D Molecular Representations",
+                "Physical Principles in 3D Pre-training"
+            ],
+            "level": 2,
+            "visual_node": []
+        },
+        {
+            "name": "Limitations and Motivation",
+            "content": "However, existing methods are limited to the continuous description (i.e., the potential energy function) of the molecular energy states within the classical mechanics, overlooking the quantized (discrete) energy level structures from the quantum mechanical perspective. From the quantum perspective, molecular systems exhibit quantized energy level structures, meaning that energy states can only assume specific discrete values. Specifically, different types of molecular motion, such as electronic, vibrational, and rotational motion, correspond to different energy level structures. Knowledge of these energy levels is crucial in molecular physics and quantum chemistry, as they determine the spectroscopic characteristics, chemical reactivity, and many other important molecular properties. Fortunately, experimental measurements of molecular energy spectra can reflect these structures. Meanwhile, there are many molecular spectra data obtained through experimental measurements or simulations (Zou et al., 2023; Alberts et al., 2024). Therefore, incorporating the knowledge of energy levels into molecular representation learning is expected to facilitate the development of more informative molecular representations.",
+            "edge": [
+                "Limitations of Classical Mechanics",
+                "Quantum Perspective and Spectra Data"
+            ],
+            "level": 2,
+            "visual_node": []
+        },
+        {
+            "name": "Proposed Method: MolSpectra",
+            "content": "![](images/d1786526fd710a309dcca5721e53bf82d19ca2622ac3314d3bfc46284f668f1d.jpg) Figure 1: The conceptual view of MolSpectra, which leverages both molecular conformation and spectra for pre-training. Prior works only model classical mechanics by denoising on conformations. In this paper, we propose MolSpectra, a framework that incorporates molecular spectra into the pre-training of 3D molecular representations, thereby infusing the knowledge of quantized energy level structures into the representations, as shown in Figure 1. In MolSpectra, we introduce a multispectrum encoder, SpecFormer, to capture both intra-spectrum and inter-spectrum peak correlations by training with a masked patches reconstruction (MPR) objective. Additionally, we employ a contrastive objective to distills the spectral features and its inherent knowledge into the learning of 3D representations. After pre-training, the resulting 3D encoder can be fine-tuned for downstream tasks, providing expressive 3D molecular representations without the need for associated spectral data. Extensive experiments over different downstream molecular property prediction benchmarks shows the superiority of MolSpectra.",
+            "edge": [
+                "Conceptual View of MolSpectra",
+                "MolSpectra Framework Details"
+            ],
+            "level": 2,
+            "visual_node": [
+                "![](images/d1786526fd710a309dcca5721e53bf82d19ca2622ac3314d3bfc46284f668f1d.jpg)"
+            ]
+        },
+        {
+            "name": "Contributions",
+            "content": "In summary, our contributions are as follows: • We introduce quantized energy level structures and molecular spectra into 3D molecular representation pre-training for the first time, surpassing previous work that relied solely on physical knowledge within the scope of classical mechanics. • We propose SpecFormer as an expressive multi-spectrum encoder, along with the masked patches reconstruction objective for spectral representation learning. • We propose a contrastive objective to align molecular representations in the 3D modality and spectral modalities, enabling the pre-trained 3D encoder to infer molecular spectral features in downstream tasks without relying on spectral data. • Experiments across different downstream benchmarks demonstrate that our method effectively enhances the expressiveness of the pre-trained 3D molecular representations.",
+            "edge": [
+                "Summary Statement",
+                "List of Contributions"
+            ],
+            "level": 2,
+            "visual_node": []
+        },
+        {
+            "name": "Importance of 3D Molecular Representations",
+            "content": "Learning 3D molecular representations from geometric conformations offers a promising approach for understanding molecular geometry and predicting quantum properties and interactions, which is significant in drug discovery and materials science (Musaelian et al., 2023; Batatia et al., 2022; Liao & Smidt, 2023; Wang et al., 2023b; Du et al., 2023b). Given the scarcity of molecular property labels, self-supervised representation pre-training has been proposed and utilized to provide generalizable representations (Hu et al., 2020; Rong et al., 2020; Ma et al., 2024).",
+            "edge": [],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "Physical Principles in 3D Pre-training",
+            "content": "In contrast to contrastive learning (Wang et al., 2022; Kim et al., 2022) and masked modeling (Hou et al., 2022; Liu et al., 2023c; Wang et al., 2024b) on 2D molecular graphs and molecular languages (e.g., SMILES), the design of pre-training strategies on 3D molecular geometries is more closely aligned with physical principles. Previous studies (Zaidi et al., 2023; Jiao et al., 2023) have guided representation learning through denoising processes on 3D molecular geometries, theoretically demonstrating that denoising 3D geometries is equivalent to learning molecular force fields, specifically the negative gradient of molecular potential energy with respect to position. Essentially, these studies reveal that establishing the relationship between 3D geometries and the energy states of molecular systems is an effective pathway to learn 3D molecular representations.",
+            "edge": [],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "Limitations of Classical Mechanics",
+            "content": "However, existing methods are limited to the continuous description (i.e., the potential energy function) of the molecular energy states within the classical mechanics, overlooking the quantized (discrete) energy level structures from the quantum mechanical perspective. From the quantum perspective, molecular systems exhibit quantized energy level structures, meaning that energy states can only assume specific discrete values.",
+            "edge": [],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "Quantum Perspective and Spectra Data",
+            "content": "Specifically, different types of molecular motion, such as electronic, vibrational, and rotational motion, correspond to different energy level structures. Knowledge of these energy levels is crucial in molecular physics and quantum chemistry, as they determine the spectroscopic characteristics, chemical reactivity, and many other important molecular properties. Fortunately, experimental measurements of molecular energy spectra can reflect these structures. Meanwhile, there are many molecular spectra data obtained through experimental measurements or simulations (Zou et al., 2023; Alberts et al., 2024). Therefore, incorporating the knowledge of energy levels into molecular representation learning is expected to facilitate the development of more informative molecular representations.",
+            "edge": [],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "Conceptual View of MolSpectra",
+            "content": "![](images/d1786526fd710a309dcca5721e53bf82d19ca2622ac3314d3bfc46284f668f1d.jpg) Figure 1: The conceptual view of MolSpectra, which leverages both molecular conformation and spectra for pre-training. Prior works only model classical mechanics by denoising on conformations.",
+            "edge": [],
+            "level": 3,
+            "visual_node": [
+                "![](images/d1786526fd710a309dcca5721e53bf82d19ca2622ac3314d3bfc46284f668f1d.jpg)"
+            ]
+        },
+        {
+            "name": "MolSpectra Framework Details",
+            "content": "In this paper, we propose MolSpectra, a framework that incorporates molecular spectra into the pre-training of 3D molecular representations, thereby infusing the knowledge of quantized energy level structures into the representations, as shown in Figure 1. In MolSpectra, we introduce a multispectrum encoder, SpecFormer, to capture both intra-spectrum and inter-spectrum peak correlations by training with a masked patches reconstruction (MPR) objective. Additionally, we employ a contrastive objective to distills the spectral features and its inherent knowledge into the learning of 3D representations. After pre-training, the resulting 3D encoder can be fine-tuned for downstream tasks, providing expressive 3D molecular representations without the need for associated spectral data. Extensive experiments over different downstream molecular property prediction benchmarks shows the superiority of MolSpectra.",
+            "edge": [
+                "SpecFormer and MPR Objective",
+                "Contrastive Objective and Fine-tuning"
+            ],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "Summary Statement",
+            "content": "In summary, our contributions are as follows:",
+            "edge": [],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "List of Contributions",
+            "content": "• We introduce quantized energy level structures and molecular spectra into 3D molecular representation pre-training for the first time, surpassing previous work that relied solely on physical knowledge within the scope of classical mechanics. • We propose SpecFormer as an expressive multi-spectrum encoder, along with the masked patches reconstruction objective for spectral representation learning. • We propose a contrastive objective to align molecular representations in the 3D modality and spectral modalities, enabling the pre-trained 3D encoder to infer molecular spectral features in downstream tasks without relying on spectral data. • Experiments across different downstream benchmarks demonstrate that our method effectively enhances the expressiveness of the pre-trained 3D molecular representations.",
+            "edge": [
+                "Contributions 1 and 2",
+                "Contributions 3 and 4"
+            ],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "SpecFormer and MPR Objective",
+            "content": "In this paper, we propose MolSpectra, a framework that incorporates molecular spectra into the pre-training of 3D molecular representations, thereby infusing the knowledge of quantized energy level structures into the representations, as shown in Figure 1. In MolSpectra, we introduce a multispectrum encoder, SpecFormer, to capture both intra-spectrum and inter-spectrum peak correlations by training with a masked patches reconstruction (MPR) objective.",
+            "edge": [],
+            "level": 4,
+            "visual_node": []
+        },
+        {
+            "name": "Contrastive Objective and Fine-tuning",
+            "content": "Additionally, we employ a contrastive objective to distills the spectral features and its inherent knowledge into the learning of 3D representations. After pre-training, the resulting 3D encoder can be fine-tuned for downstream tasks, providing expressive 3D molecular representations without the need for associated spectral data. Extensive experiments over different downstream molecular property prediction benchmarks shows the superiority of MolSpectra.",
+            "edge": [],
+            "level": 4,
+            "visual_node": []
+        },
+        {
+            "name": "Contributions 1 and 2",
+            "content": "• We introduce quantized energy level structures and molecular spectra into 3D molecular representation pre-training for the first time, surpassing previous work that relied solely on physical knowledge within the scope of classical mechanics. • We propose SpecFormer as an expressive multi-spectrum encoder, along with the masked patches reconstruction objective for spectral representation learning.",
+            "edge": [],
+            "level": 4,
+            "visual_node": []
+        },
+        {
+            "name": "Contributions 3 and 4",
+            "content": "• We propose a contrastive objective to align molecular representations in the 3D modality and spectral modalities, enabling the pre-trained 3D encoder to infer molecular spectral features in downstream tasks without relying on spectral data. • Experiments across different downstream benchmarks demonstrate that our method effectively enhances the expressiveness of the pre-trained 3D molecular representations.",
+            "edge": [],
+            "level": 4,
+            "visual_node": []
+        },
+        {
+            "name": "2 PRELIMINARIES.md",
+            "content": "# 2 PRELIMINARIES  # 2.1 NOTATIONS  Consider a molecule characterized by its 3D structure and spectra, represented as M = (a, x, S). Here, a ∈ {1, 2, . . . , 118}N specifies the atomic numbers, indicating the types of atoms within the molecule. The vector x ∈ R3N describes the conformation of the molecule, while S represents its spectra. The parameter N denotes the number of atoms in the molecule. Note that the atoms are arranged in the same order in both a and x, ensuring consistency between the atomic numbers and their corresponding spatial coordinates.  S = (s1, . . . , s|S|) represents the set of spectra for a molecule, where |S| denotes the number of spectrum types considered. In our study, we focus on three types, so |S| = 3. The first spectrum, s1 ∈ R601, is the UV-Vis spectrum, which spans from 1.5 to 13.5 eV with 601 data points at intervals of 0.02 eV. The second spectrum, s2 ∈ R3501, is the IR spectrum, covering a range from 500 to 4000 cm−1 with 3501 data points at intervals of 1 cm−1. The third spectrum, s3 ∈ R3501, is the Raman spectrum, with the same range and intervals as the IR spectrum. Together, these spectra provide a comprehensive description of the molecular characteristics across different spectral modalities.  # 2.2 PRE-TRAINING 3D MOLECULAR REPRESENTATION VIA DENOISING  Denoising has emerged as a prominent pre-training objective in 3D molecular representation learning, excelling in various downstream tasks. This method involves training models to predict and remove noise introduced deliberately into molecular structures. This approach is physically interpretable due to its proven equivalence to learning the molecular force field.  Equivalence between denoising and learning molecular force fields. The equivalence between coordinate denoising and force field learning is established by Zaidi et al. (2023). For a given molecule M, perturb its equilibrium structure x0 according to the distribution p(x|x0), where x is the noisy conformation. Assuming the molecular distribution adheres to the energy-based Boltzmann distribution with respect to the energy function E(·), then  ![](images/b1279274391a8ee385cd510483726424c1b7bb542dd3dfcaa664dd7784591ce0.jpg)  where GNNθ(x) denotes a graph neural network parameterized by θ, which processes the conformation x to produce node-level predictions. The notation ≃ signifies the equivalence of different objectives. The proof of this equivalence is provided in the Appendix A. In prior research, the energy function E(·) has been defined in several forms. Below are three representative studies.  Energy function I: mixture of isotropic Gaussians. In Coord (Zaidi et al., 2023), the energy function is approximated using a mixture of isotropic Gaussians centered at the known equilibrium structures to replace the Boltzmann distribution, since these structures are local maxima of the Boltzman distribution. Leveraging the equivalence between the score-matching objective and denoising autoencoders (Vincent, 2011), the following denoising-based energy function ECoord(·) is derived:  ![](images/7ad84d22cd3c911b24b3de89b1077c3b9f08c6b762840ecd38160beb882f3a16.jpg)  Note that this objective is derived under the assumption of isotropic Gaussian noise, i.e., p(x|x0) ∼ N (x0, τ 2c I3N ), where I3N represents the identity matrix of size 3N , and the subscript c indicates the coordinate denoising approach.  Energy function II: mixture of anisotropic Gaussians. Considering rigid and flexible components in molecular structures, isotropic Gaussian can lead to significant approximation errors. To address the anisotropic distribution, Frad (Feng et al., 2023) introduces hybrid noise on dihedral angles of rotatable bonds and atomic coordinates, incorporating fractional denoising of the coordinate noise. The equilibrium structure x0 is initially perturbed by dihedral angle noise p(ψa|ψ0) ∼ N (ψ0, σ2f Im), followed by coordinate noise p(x|xa) ∼ N (xa, τ 2f I3N ). Here, ψa, ψ0 ∈ [0, 2π)m represent to the dihedral angles of rotatable bonds in structures xa and x0, respectively, with m denoting the number of rotatable bonds. The subscript f indicates the fractional denoising approach. Subsequently, the energy function is induced:  ![](images/b5f78fc1f86d0c7c6f996497a567b0462a6e0529447787d81ffe3fdbb71b3ccb.jpg)  where Στf ,σf = τ 2f I3N + σ2f CC⊤, and C ∈ R3N×m is a matrix used to linearly transform the dihedral angle noise into coordinate change, expressed as ∆x ≈ C∆ψ.  Energy function III: classical potential energy theory. SliDe (Ni et al., 2024) derives energy function from classical molecular potential energy theory (Alavi, 2020; Zhou & Liu, 2022). In this  ![](images/1ee1e5bd4e35d7acf3b80e7404b113daf11b6925490369aaa637e60964fb744d.jpg) Figure 2: Overview of the MolSpectra pre-training framework. Our pre-training framework comprises three sub-objectives: the denoising objective and the MPR objective, which respectively guide the representation learning of the 3D and spectral modalities, and the contrastive objective, which aligns the representations of both modalities.  form, the total intramolecular potential energy is mainly attributed to three types of interactions: bond stretching, bond angle bending, and bond torsion. The following energy function is derived:  ![](images/bdae202d54ee161317731819baf6967d189260b12ff333dc514a8cc065c47f97.jpg)  where r ∈ (R≥0)m1 , θ ∈ [0, 2π)m2 , ϕ ∈ [0, 2π)m3 represent vectors of the bond lengths, bond angles, and bond torsion angles of the molecule, respectively. r0, θ0, ϕ0 correspond to the respective equilibrium values. The parameter vectors kB, kA, kT determine the interaction strength.",
+            "edge": [
+                "2.1 Notations",
+                "2.2 Pre-training 3D Molecular Representation via Denoising"
+            ],
+            "level": 1,
+            "visual_node": [
+                "![](images/b1279274391a8ee385cd510483726424c1b7bb542dd3dfcaa664dd7784591ce0.jpg)",
+                "![](images/7ad84d22cd3c911b24b3de89b1077c3b9f08c6b762840ecd38160beb882f3a16.jpg)",
+                "![](images/b5f78fc1f86d0c7c6f996497a567b0462a6e0529447787d81ffe3fdbb71b3ccb.jpg)",
+                "![](images/1ee1e5bd4e35d7acf3b80e7404b113daf11b6925490369aaa637e60964fb744d.jpg)",
+                "![](images/bdae202d54ee161317731819baf6967d189260b12ff333dc514a8cc065c47f97.jpg)"
+            ]
+        },
+        {
+            "name": "2.1 Notations",
+            "content": "# 2.1 NOTATIONS  Consider a molecule characterized by its 3D structure and spectra, represented as M = (a, x, S). Here, a ∈ {1, 2, . . . , 118}N specifies the atomic numbers, indicating the types of atoms within the molecule. The vector x ∈ R3N describes the conformation of the molecule, while S represents its spectra. The parameter N denotes the number of atoms in the molecule. Note that the atoms are arranged in the same order in both a and x, ensuring consistency between the atomic numbers and their corresponding spatial coordinates.  S = (s1, . . . , s|S|) represents the set of spectra for a molecule, where |S| denotes the number of spectrum types considered. In our study, we focus on three types, so |S| = 3. The first spectrum, s1 ∈ R601, is the UV-Vis spectrum, which spans from 1.5 to 13.5 eV with 601 data points at intervals of 0.02 eV. The second spectrum, s2 ∈ R3501, is the IR spectrum, covering a range from 500 to 4000 cm−1 with 3501 data points at intervals of 1 cm−1. The third spectrum, s3 ∈ R3501, is the Raman spectrum, with the same range and intervals as the IR spectrum. Together, these spectra provide a comprehensive description of the molecular characteristics across different spectral modalities.",
+            "edge": [
+                "Molecule Structure Definition",
+                "Spectra Definition"
+            ],
+            "level": 2,
+            "visual_node": []
+        },
+        {
+            "name": "2.2 Pre-training 3D Molecular Representation via Denoising",
+            "content": "# 2.2 PRE-TRAINING 3D MOLECULAR REPRESENTATION VIA DENOISING  Denoising has emerged as a prominent pre-training objective in 3D molecular representation learning, excelling in various downstream tasks. This method involves training models to predict and remove noise introduced deliberately into molecular structures. This approach is physically interpretable due to its proven equivalence to learning the molecular force field.  Equivalence between denoising and learning molecular force fields. The equivalence between coordinate denoising and force field learning is established by Zaidi et al. (2023). For a given molecule M, perturb its equilibrium structure x0 according to the distribution p(x|x0), where x is the noisy conformation. Assuming the molecular distribution adheres to the energy-based Boltzmann distribution with respect to the energy function E(·), then  ![](images/b1279274391a8ee385cd510483726424c1b7bb542dd3dfcaa664dd7784591ce0.jpg)  where GNNθ(x) denotes a graph neural network parameterized by θ, which processes the conformation x to produce node-level predictions. The notation ≃ signifies the equivalence of different objectives. The proof of this equivalence is provided in the Appendix A. In prior research, the energy function E(·) has been defined in several forms. Below are three representative studies.  Energy function I: mixture of isotropic Gaussians. In Coord (Zaidi et al., 2023), the energy function is approximated using a mixture of isotropic Gaussians centered at the known equilibrium structures to replace the Boltzmann distribution, since these structures are local maxima of the Boltzman distribution. Leveraging the equivalence between the score-matching objective and denoising autoencoders (Vincent, 2011), the following denoising-based energy function ECoord(·) is derived:  ![](images/7ad84d22cd3c911b24b3de89b1077c3b9f08c6b762840ecd38160beb882f3a16.jpg)  Note that this objective is derived under the assumption of isotropic Gaussian noise, i.e., p(x|x0) ∼ N (x0, τ 2c I3N ), where I3N represents the identity matrix of size 3N , and the subscript c indicates the coordinate denoising approach.  Energy function II: mixture of anisotropic Gaussians. Considering rigid and flexible components in molecular structures, isotropic Gaussian can lead to significant approximation errors. To address the anisotropic distribution, Frad (Feng et al., 2023) introduces hybrid noise on dihedral angles of rotatable bonds and atomic coordinates, incorporating fractional denoising of the coordinate noise. The equilibrium structure x0 is initially perturbed by dihedral angle noise p(ψa|ψ0) ∼ N (ψ0, σ2f Im), followed by coordinate noise p(x|xa) ∼ N (xa, τ 2f I3N ). Here, ψa, ψ0 ∈ [0, 2π)m represent to the dihedral angles of rotatable bonds in structures xa and x0, respectively, with m denoting the number of rotatable bonds. The subscript f indicates the fractional denoising approach. Subsequently, the energy function is induced:  ![](images/b5f78fc1f86d0c7c6f996497a567b0462a6e0529447787d81ffe3fdbb71b3ccb.jpg)  where Στf ,σf = τ 2f I3N + σ2f CC⊤, and C ∈ R3N×m is a matrix used to linearly transform the dihedral angle noise into coordinate change, expressed as ∆x ≈ C∆ψ.  Energy function III: classical potential energy theory. SliDe (Ni et al., 2024) derives energy function from classical molecular potential energy theory (Alavi, 2020; Zhou & Liu, 2022). In this  ![](images/1ee1e5bd4e35d7acf3b80e7404b113daf11b6925490369aaa637e60964fb744d.jpg) Figure 2: Overview of the MolSpectra pre-training framework. Our pre-training framework comprises three sub-objectives: the denoising objective and the MPR objective, which respectively guide the representation learning of the 3D and spectral modalities, and the contrastive objective, which aligns the representations of both modalities.  form, the total intramolecular potential energy is mainly attributed to three types of interactions: bond stretching, bond angle bending, and bond torsion. The following energy function is derived:  ![](images/bdae202d54ee161317731819baf6967d189260b12ff333dc514a8cc065c47f97.jpg)  where r ∈ (R≥0)m1 , θ ∈ [0, 2π)m2 , ϕ ∈ [0, 2π)m3 represent vectors of the bond lengths, bond angles, and bond torsion angles of the molecule, respectively. r0, θ0, ϕ0 correspond to the respective equilibrium values. The parameter vectors kB, kA, kT determine the interaction strength.",
+            "edge": [
+                "Denoising Fundamentals and Equivalence",
+                "Representative Energy Functions"
+            ],
+            "level": 2,
+            "visual_node": [
+                "![](images/b1279274391a8ee385cd510483726424c1b7bb542dd3dfcaa664dd7784591ce0.jpg)",
+                "![](images/7ad84d22cd3c911b24b3de89b1077c3b9f08c6b762840ecd38160beb882f3a16.jpg)",
+                "![](images/b5f78fc1f86d0c7c6f996497a567b0462a6e0529447787d81ffe3fdbb71b3ccb.jpg)",
+                "![](images/1ee1e5bd4e35d7acf3b80e7404b113daf11b6925490369aaa637e60964fb744d.jpg)",
+                "![](images/bdae202d54ee161317731819baf6967d189260b12ff333dc514a8cc065c47f97.jpg)"
+            ]
+        },
+        {
+            "name": "Molecule Structure Definition",
+            "content": "Consider a molecule characterized by its 3D structure and spectra, represented as M = (a, x, S). Here, a ∈ {1, 2, . . . , 118}N specifies the atomic numbers, indicating the types of atoms within the molecule. The vector x ∈ R3N describes the conformation of the molecule, while S represents its spectra. The parameter N denotes the number of atoms in the molecule. Note that the atoms are arranged in the same order in both a and x, ensuring consistency between the atomic numbers and their corresponding spatial coordinates.",
+            "edge": [],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "Spectra Definition",
+            "content": "S = (s1, . . . , s|S|) represents the set of spectra for a molecule, where |S| denotes the number of spectrum types considered. In our study, we focus on three types, so |S| = 3. The first spectrum, s1 ∈ R601, is the UV-Vis spectrum, which spans from 1.5 to 13.5 eV with 601 data points at intervals of 0.02 eV. The second spectrum, s2 ∈ R3501, is the IR spectrum, covering a range from 500 to 4000 cm−1 with 3501 data points at intervals of 1 cm−1. The third spectrum, s3 ∈ R3501, is the Raman spectrum, with the same range and intervals as the IR spectrum. Together, these spectra provide a comprehensive description of the molecular characteristics across different spectral modalities.",
+            "edge": [],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "Denoising Fundamentals and Equivalence",
+            "content": "Denoising has emerged as a prominent pre-training objective in 3D molecular representation learning, excelling in various downstream tasks. This method involves training models to predict and remove noise introduced deliberately into molecular structures. This approach is physically interpretable due to its proven equivalence to learning the molecular force field.  Equivalence between denoising and learning molecular force fields. The equivalence between coordinate denoising and force field learning is established by Zaidi et al. (2023). For a given molecule M, perturb its equilibrium structure x0 according to the distribution p(x|x0), where x is the noisy conformation. Assuming the molecular distribution adheres to the energy-based Boltzmann distribution with respect to the energy function E(·), then  ![](images/b1279274391a8ee385cd510483726424c1b7bb542dd3dfcaa664dd7784591ce0.jpg)  where GNNθ(x) denotes a graph neural network parameterized by θ, which processes the conformation x to produce node-level predictions. The notation ≃ signifies the equivalence of different objectives. The proof of this equivalence is provided in the Appendix A.",
+            "edge": [
+                "Introduction to Denoising",
+                "Mathematical Equivalence"
+            ],
+            "level": 3,
+            "visual_node": [
+                "![](images/b1279274391a8ee385cd510483726424c1b7bb542dd3dfcaa664dd7784591ce0.jpg)"
+            ]
+        },
+        {
+            "name": "Representative Energy Functions",
+            "content": "In prior research, the energy function E(·) has been defined in several forms. Below are three representative studies.  Energy function I: mixture of isotropic Gaussians. In Coord (Zaidi et al., 2023), the energy function is approximated using a mixture of isotropic Gaussians centered at the known equilibrium structures to replace the Boltzmann distribution, since these structures are local maxima of the Boltzman distribution. Leveraging the equivalence between the score-matching objective and denoising autoencoders (Vincent, 2011), the following denoising-based energy function ECoord(·) is derived:  ![](images/7ad84d22cd3c911b24b3de89b1077c3b9f08c6b762840ecd38160beb882f3a16.jpg)  Note that this objective is derived under the assumption of isotropic Gaussian noise, i.e., p(x|x0) ∼ N (x0, τ 2c I3N ), where I3N represents the identity matrix of size 3N , and the subscript c indicates the coordinate denoising approach.  Energy function II: mixture of anisotropic Gaussians. Considering rigid and flexible components in molecular structures, isotropic Gaussian can lead to significant approximation errors. To address the anisotropic distribution, Frad (Feng et al., 2023) introduces hybrid noise on dihedral angles of rotatable bonds and atomic coordinates, incorporating fractional denoising of the coordinate noise. The equilibrium structure x0 is initially perturbed by dihedral angle noise p(ψa|ψ0) ∼ N (ψ0, σ2f Im), followed by coordinate noise p(x|xa) ∼ N (xa, τ 2f I3N ). Here, ψa, ψ0 ∈ [0, 2π)m represent to the dihedral angles of rotatable bonds in structures xa and x0, respectively, with m denoting the number of rotatable bonds. The subscript f indicates the fractional denoising approach. Subsequently, the energy function is induced:  ![](images/b5f78fc1f86d0c7c6f996497a567b0462a6e0529447787d81ffe3fdbb71b3ccb.jpg)  where Στf ,σf = τ 2f I3N + σ2f CC⊤, and C ∈ R3N×m is a matrix used to linearly transform the dihedral angle noise into coordinate change, expressed as ∆x ≈ C∆ψ.  Energy function III: classical potential energy theory. SliDe (Ni et al., 2024) derives energy function from classical molecular potential energy theory (Alavi, 2020; Zhou & Liu, 2022). In this  ![](images/1ee1e5bd4e35d7acf3b80e7404b113daf11b6925490369aaa637e60964fb744d.jpg) Figure 2: Overview of the MolSpectra pre-training framework. Our pre-training framework comprises three sub-objectives: the denoising objective and the MPR objective, which respectively guide the representation learning of the 3D and spectral modalities, and the contrastive objective, which aligns the representations of both modalities.  form, the total intramolecular potential energy is mainly attributed to three types of interactions: bond stretching, bond angle bending, and bond torsion. The following energy function is derived:  ![](images/bdae202d54ee161317731819baf6967d189260b12ff333dc514a8cc065c47f97.jpg)  where r ∈ (R≥0)m1 , θ ∈ [0, 2π)m2 , ϕ ∈ [0, 2π)m3 represent vectors of the bond lengths, bond angles, and bond torsion angles of the molecule, respectively. r0, θ0, ϕ0 correspond to the respective equilibrium values. The parameter vectors kB, kA, kT determine the interaction strength.",
+            "edge": [
+                "Energy Function I: Isotropic Gaussians",
+                "Energy Function II: Anisotropic Gaussians",
+                "Energy Function III: Classical Potential"
+            ],
+            "level": 3,
+            "visual_node": [
+                "![](images/7ad84d22cd3c911b24b3de89b1077c3b9f08c6b762840ecd38160beb882f3a16.jpg)",
+                "![](images/b5f78fc1f86d0c7c6f996497a567b0462a6e0529447787d81ffe3fdbb71b3ccb.jpg)",
+                "![](images/1ee1e5bd4e35d7acf3b80e7404b113daf11b6925490369aaa637e60964fb744d.jpg)",
+                "![](images/bdae202d54ee161317731819baf6967d189260b12ff333dc514a8cc065c47f97.jpg)"
+            ]
+        },
+        {
+            "name": "Introduction to Denoising",
+            "content": "Denoising has emerged as a prominent pre-training objective in 3D molecular representation learning, excelling in various downstream tasks. This method involves training models to predict and remove noise introduced deliberately into molecular structures. This approach is physically interpretable due to its proven equivalence to learning the molecular force field.",
+            "edge": [],
+            "level": 4,
+            "visual_node": []
+        },
+        {
+            "name": "Mathematical Equivalence",
+            "content": "Equivalence between denoising and learning molecular force fields. The equivalence between coordinate denoising and force field learning is established by Zaidi et al. (2023). For a given molecule M, perturb its equilibrium structure x0 according to the distribution p(x|x0), where x is the noisy conformation. Assuming the molecular distribution adheres to the energy-based Boltzmann distribution with respect to the energy function E(·), then  ![](images/b1279274391a8ee385cd510483726424c1b7bb542dd3dfcaa664dd7784591ce0.jpg)  where GNNθ(x) denotes a graph neural network parameterized by θ, which processes the conformation x to produce node-level predictions. The notation ≃ signifies the equivalence of different objectives. The proof of this equivalence is provided in the Appendix A.",
+            "edge": [],
+            "level": 4,
+            "visual_node": [
+                "![](images/b1279274391a8ee385cd510483726424c1b7bb542dd3dfcaa664dd7784591ce0.jpg)"
+            ]
+        },
+        {
+            "name": "Energy Function I: Isotropic Gaussians",
+            "content": "In prior research, the energy function E(·) has been defined in several forms. Below are three representative studies.  Energy function I: mixture of isotropic Gaussians. In Coord (Zaidi et al., 2023), the energy function is approximated using a mixture of isotropic Gaussians centered at the known equilibrium structures to replace the Boltzmann distribution, since these structures are local maxima of the Boltzman distribution. Leveraging the equivalence between the score-matching objective and denoising autoencoders (Vincent, 2011), the following denoising-based energy function ECoord(·) is derived:  ![](images/7ad84d22cd3c911b24b3de89b1077c3b9f08c6b762840ecd38160beb882f3a16.jpg)  Note that this objective is derived under the assumption of isotropic Gaussian noise, i.e., p(x|x0) ∼ N (x0, τ 2c I3N ), where I3N represents the identity matrix of size 3N , and the subscript c indicates the coordinate denoising approach.",
+            "edge": [],
+            "level": 4,
+            "visual_node": [
+                "![](images/7ad84d22cd3c911b24b3de89b1077c3b9f08c6b762840ecd38160beb882f3a16.jpg)"
+            ]
+        },
+        {
+            "name": "Energy Function II: Anisotropic Gaussians",
+            "content": "Energy function II: mixture of anisotropic Gaussians. Considering rigid and flexible components in molecular structures, isotropic Gaussian can lead to significant approximation errors. To address the anisotropic distribution, Frad (Feng et al., 2023) introduces hybrid noise on dihedral angles of rotatable bonds and atomic coordinates, incorporating fractional denoising of the coordinate noise. The equilibrium structure x0 is initially perturbed by dihedral angle noise p(ψa|ψ0) ∼ N (ψ0, σ2f Im), followed by coordinate noise p(x|xa) ∼ N (xa, τ 2f I3N ). Here, ψa, ψ0 ∈ [0, 2π)m represent to the dihedral angles of rotatable bonds in structures xa and x0, respectively, with m denoting the number of rotatable bonds. The subscript f indicates the fractional denoising approach. Subsequently, the energy function is induced:  ![](images/b5f78fc1f86d0c7c6f996497a567b0462a6e0529447787d81ffe3fdbb71b3ccb.jpg)  where Στf ,σf = τ 2f I3N + σ2f CC⊤, and C ∈ R3N×m is a matrix used to linearly transform the dihedral angle noise into coordinate change, expressed as ∆x ≈ C∆ψ.",
+            "edge": [],
+            "level": 4,
+            "visual_node": [
+                "![](images/b5f78fc1f86d0c7c6f996497a567b0462a6e0529447787d81ffe3fdbb71b3ccb.jpg)"
+            ]
+        },
+        {
+            "name": "Energy Function III: Classical Potential",
+            "content": "Energy function III: classical potential energy theory. SliDe (Ni et al., 2024) derives energy function from classical molecular potential energy theory (Alavi, 2020; Zhou & Liu, 2022). In this  ![](images/1ee1e5bd4e35d7acf3b80e7404b113daf11b6925490369aaa637e60964fb744d.jpg) Figure 2: Overview of the MolSpectra pre-training framework. Our pre-training framework comprises three sub-objectives: the denoising objective and the MPR objective, which respectively guide the representation learning of the 3D and spectral modalities, and the contrastive objective, which aligns the representations of both modalities.  form, the total intramolecular potential energy is mainly attributed to three types of interactions: bond stretching, bond angle bending, and bond torsion. The following energy function is derived:  ![](images/bdae202d54ee161317731819baf6967d189260b12ff333dc514a8cc065c47f97.jpg)  where r ∈ (R≥0)m1 , θ ∈ [0, 2π)m2 , ϕ ∈ [0, 2π)m3 represent vectors of the bond lengths, bond angles, and bond torsion angles of the molecule, respectively. r0, θ0, ϕ0 correspond to the respective equilibrium values. The parameter vectors kB, kA, kT determine the interaction strength.",
+            "edge": [],
+            "level": 4,
+            "visual_node": [
+                "![](images/1ee1e5bd4e35d7acf3b80e7404b113daf11b6925490369aaa637e60964fb744d.jpg)",
+                "![](images/bdae202d54ee161317731819baf6967d189260b12ff333dc514a8cc065c47f97.jpg)"
+            ]
+        },
+        {
+            "name": "3 THE PROPOSED MOLSPECTRA METHOD",
+            "content": "# 3 THE PROPOSED MOLSPECTRA METHOD  Considering the complementarity of different spectra, we introduce multiple spectra into molecular representation learning. To effectively comprehend molecular spectra, we designed a Transformerbased multi-spectrum encoder, SpecFormer, along with a masked reconstruction objective to guide its training. Finally, a contrastive objective is employed to align the 3D encoding guided by the denoising objective with the spectra encoding guided by the reconstruction objective, endowing the 3D encoding with the capability to understand spectra and the knowledge they encompass.  # 3.1 SPECFORMER: A SINGLE-STREAM ENCODER FOR MULTI-MODAL ENERGY SPECTRA  For different types of spectra, each spectrum is independently patched and initially encoded. Then, all the resulting patch embeddings are concatenated and encoded using a Transformer-based encoder.  Patching. Compared to directly encoding individual frequency points, we divided each spectrum into multiple patches. This approach offers two distinct advantages: (i) By forming patches from adjacent frequency points, local semantic features, such as absorption peaks, can be captured more effectively. (ii) It reduces the computational overhead of subsequent Transformer layers. Technically, each spectrum si ∈ RLi where i = 1, · · · , |S| is first divided into patches according to the patch length Pi and the stride Di. When 0 < Di < Pi, the consecutive patches will be overlapped with overlapping region length Pi − Di. When Di = Pi, the consecutive patches will be nonoverlapped. Li denotes the length of si. The patching process on each spectrum will generate a sequence of patches pi ∈ RNi×Pi, where Ni = Li−PiDi + 1 is the number of patches.  Patch encoding and position encoding. Prior to be fed into the encoder, the patches of the i-th spectrum are mapped to the latent space of dimension d via a trainable linear projection Wi ∈ RPi×d. A learnable additive position encoding W posi ∈ RNi×d is applied to maintain the order of the patches: p′i = piWi + W posi , where p′i ∈ RNi×d denotes the latent representation of the spectrum si that will be fed into the subsequent SpecFormer encoder.  SpecFormer: multi-spectrum Transformer encoder. Although several encoders have been proposed to map molecular spectrum into implicit representations, such as the CNN-AM (Tao et al., 2024) based on one-dimensional convolution, these encoders are designed to encode only a single type of spectrum. In our approach, multiple molecular spectra (UV-Vis, IR, Raman) are jointly considered. When encoding multiple spectra of a molecule simultaneously, an observation caught our attention and led us to adopt a Transformer-based encoder with multiple spectra as input, similar to the single-stream Transformer in multi-modal learning (Shin et al., 2021).  The observation refers to the fact that the same functional group not only causes multiple peaks within a single spectrum, but also generates peaks across different spectra. As shown on the left of Figure 3, the different vibrational modes of the methyl group (-CH3) in methanol (CH3OH) result in three peaks in the IR spectrum, indicating intra-spectrum dependencies among these peaks. A similar phenomenon occurs with the hydroxyl group (-OH) in methanol. Additionally, the aromatic ring in phenol (C6H5OH), shown on the right of  ![](images/6c50a187dd8276b7372af9e7e00b99521d3741fa210eee7a1af570b334053570.jpg)   Figure 3: Illustration of intra-spectrum (left) and interspectrum (right) dependencies.  Figure 3, not only produces multiple peaks in the IR spectrum due to different vibrational modes but also causes an absorption peak near 270 nm in the UV-Vis spectrum due to the π → π∗ transition in the aromatic ring, demonstrating the existence of inter-spectrum dependencies. Such dependencies have been theoretically studied, for example, in the context of vibronic coupling (Kong et al., 2021).  To capture intra-spectrum and inter-spectrum dependencies, we concatenate the embeddings obtained from patch encoding and position encoding of different spectra: pˆ = p′1∥ · · · ∥p′|S| ∈ R(P|S|i=1 Ni)×d, and then input them into the Transformer encoder as depicted in Figure 2. Then each head h = 1, . . . , H in multi-head attention will transform them into query matrices Qh = pWˆ Qh , key matrices Kh = pWˆ Kh and value matrices Vh = pWˆ Vh , where W Qh , W Kh ∈ Rd×dk and WVh ∈ Rd× dH . Afterward, a scaled product is utilized to obtain the attention output Oh ∈ R(P|S|i=1 Ni)× dH :  ![](images/16005489fb642f13912d85a6d90523ef0a5e1d56b5c30d9d2ee35336fdb04f4f.jpg)  The multi-head attention block also includes BatchNorm layers and a feed forward network with residual connections as shown in Figure 2. After combining the outputs of all heads, it generates the representation denoted as z ∈ R(P|S|i=1 Ni)×d. Finally, a flatten layer with representation projection head is used to obtain the molecular spectra representation zs ∈ Rd.  # 3.2 MASKED PATCHES RECONSTRUCTION PRE-TRAINING FOR SPECTRA  Before distilling the spectra information into 3D molecular representation learning, we need first ensure that the spectrum encoder can effectively comprehend molecular spectra and generate spectral representations. Considering the success of masking modeling across various domains (Devlin et al., 2019; He et al., 2022; Hou et al., 2022; Xia et al., 2023; Wang et al., 2024b; Nie et al., 2023), we propose a masked patches reconstruction (MPR) objective to guide the training of SpecFormer.  After the patching step, we randomly select a portion of patches according to the mask ratio α and replace them with zero vectors to implement the masking. Subsequently, the masked patches undergo patch encoding and position encoding. In this way, the semantics of the masked patches (the absorption intensity at specific wavelengths) are obscured during patch encoding, while the positional information is retained to facilitate the reconstruction of the original semantics.  After encoding by SpecFormer, the encoded results corresponding to the masked patches are input into a spectrum-specific reconstruction head to reconstruct the original spectral values that were masked. The mean squared error (MSE) between the reconstruction results and the original masked spectra serves as the loss function for the MPR task, guiding the training of SpecFormer:  ![](images/6b6a48b7abea8c9dc0c3c72345c9c94cbb0ac1d1ef8824d485ff0ace9b5a0a1e.jpg)  where Pi denotes the set of masked patches in the i-th type of molecular spectra, and pˆi,j denotes ethe reconstructed patch corresponding to the masked patch pi,j .  # 3.3 CONTRASTIVE LEARNING BETWEEN 3D STRUCTURES AND SPECTRA  Under the guidance of the denoising objective for 3D representation learning and the MPR objective for spectral representation learning, we further introduce a contrastive objective to align the representations across these two modalities. We treat the 3D representation zx ∈ Rd and spectral representation zs ∈ Rd of the same molecule as positive samples, and negative samples otherwise. Subsequently, the consistency between positive samples and the discrepancy between negative samples are maximized through the contrastive objective. Given the theoretical and empirical effectiveness, we employ InfoNCE (van den Oord et al., 2018) as the contrastive objective:  ![](images/35c4ba087f1ae7299aa554638aef5647f9efa2de01450fec172beb01bf64538e.jpg)  where zjx, zjs are randomly sampled 3D and spectra views regarding to the positive pair (zx, zs). fx(zx, zs) and fs(zs, zx) are scoring functions for the two corresponding views, with flexible formulations. Here we adopt fx(zx, zs) = fs(zs, zx) = ⟨zx, zs⟩.  Note that the denoising objective can utilize any form from existing 3D molecular representation pre-training studies, enabling seamless integration of our method into these frameworks.  # 3.4 TWO-STAGE PRE-TRAINING PIPELINE  Previous pre-training efforts for 3D molecular representation have been conducted on unlabeled datasets using denoising objective. These datasets typically provide only equilibrium 3D structures without offering spectra for all molecules. To enhance the pre-training effect by incorporating spectra while leveraging denoising pre-training, we employ a two-stage pre-training approach. The first stage involves training on a larger dataset (Nakata & Shimazaki, 2017) without spectra using only the denoising objective. Subsequently, the second stage involves training on a dataset that includes spectra using the complete objective as follows:  ![](images/977b41496865477e7652249bec51630a1a037097fa74f17c10e8a72851cd7ce3.jpg)  where βDenoising, βMPR, and βContrast denote the weights of each sub-objective.",
+            "edge": [
+                "Method Overview",
+                "SpecFormer Encoder (3.1)",
+                "Masked Patches Reconstruction (3.2)",
+                "Contrastive Learning (3.3)",
+                "Two-Stage Pipeline (3.4)"
+            ],
+            "level": 1,
+            "visual_node": [
+                "![](images/6c50a187dd8276b7372af9e7e00b99521d3741fa210eee7a1af570b334053570.jpg)",
+                "![](images/16005489fb642f13912d85a6d90523ef0a5e1d56b5c30d9d2ee35336fdb04f4f.jpg)",
+                "![](images/6b6a48b7abea8c9dc0c3c72345c9c94cbb0ac1d1ef8824d485ff0ace9b5a0a1e.jpg)",
+                "![](images/35c4ba087f1ae7299aa554638aef5647f9efa2de01450fec172beb01bf64538e.jpg)",
+                "![](images/977b41496865477e7652249bec51630a1a037097fa74f17c10e8a72851cd7ce3.jpg)"
+            ]
+        },
+        {
+            "name": "Method Overview",
+            "content": "Considering the complementarity of different spectra, we introduce multiple spectra into molecular representation learning. To effectively comprehend molecular spectra, we designed a Transformerbased multi-spectrum encoder, SpecFormer, along with a masked reconstruction objective to guide its training. Finally, a contrastive objective is employed to align the 3D encoding guided by the denoising objective with the spectra encoding guided by the reconstruction objective, endowing the 3D encoding with the capability to understand spectra and the knowledge they encompass.",
+            "edge": [],
+            "level": 2,
+            "visual_node": []
+        },
+        {
+            "name": "SpecFormer Encoder (3.1)",
+            "content": "# 3.1 SPECFORMER: A SINGLE-STREAM ENCODER FOR MULTI-MODAL ENERGY SPECTRA  For different types of spectra, each spectrum is independently patched and initially encoded. Then, all the resulting patch embeddings are concatenated and encoded using a Transformer-based encoder.  Patching. Compared to directly encoding individual frequency points, we divided each spectrum into multiple patches. This approach offers two distinct advantages: (i) By forming patches from adjacent frequency points, local semantic features, such as absorption peaks, can be captured more effectively. (ii) It reduces the computational overhead of subsequent Transformer layers. Technically, each spectrum si ∈ RLi where i = 1, · · · , |S| is first divided into patches according to the patch length Pi and the stride Di. When 0 < Di < Pi, the consecutive patches will be overlapped with overlapping region length Pi − Di. When Di = Pi, the consecutive patches will be nonoverlapped. Li denotes the length of si. The patching process on each spectrum will generate a sequence of patches pi ∈ RNi×Pi, where Ni = Li−PiDi + 1 is the number of patches.  Patch encoding and position encoding. Prior to be fed into the encoder, the patches of the i-th spectrum are mapped to the latent space of dimension d via a trainable linear projection Wi ∈ RPi×d. A learnable additive position encoding W posi ∈ RNi×d is applied to maintain the order of the patches: p′i = piWi + W posi , where p′i ∈ RNi×d denotes the latent representation of the spectrum si that will be fed into the subsequent SpecFormer encoder.  SpecFormer: multi-spectrum Transformer encoder. Although several encoders have been proposed to map molecular spectrum into implicit representations, such as the CNN-AM (Tao et al., 2024) based on one-dimensional convolution, these encoders are designed to encode only a single type of spectrum. In our approach, multiple molecular spectra (UV-Vis, IR, Raman) are jointly considered. When encoding multiple spectra of a molecule simultaneously, an observation caught our attention and led us to adopt a Transformer-based encoder with multiple spectra as input, similar to the single-stream Transformer in multi-modal learning (Shin et al., 2021).  The observation refers to the fact that the same functional group not only causes multiple peaks within a single spectrum, but also generates peaks across different spectra. As shown on the left of Figure 3, the different vibrational modes of the methyl group (-CH3) in methanol (CH3OH) result in three peaks in the IR spectrum, indicating intra-spectrum dependencies among these peaks. A similar phenomenon occurs with the hydroxyl group (-OH) in methanol. Additionally, the aromatic ring in phenol (C6H5OH), shown on the right of  ![](images/6c50a187dd8276b7372af9e7e00b99521d3741fa210eee7a1af570b334053570.jpg)   Figure 3: Illustration of intra-spectrum (left) and interspectrum (right) dependencies.  Figure 3, not only produces multiple peaks in the IR spectrum due to different vibrational modes but also causes an absorption peak near 270 nm in the UV-Vis spectrum due to the π → π∗ transition in the aromatic ring, demonstrating the existence of inter-spectrum dependencies. Such dependencies have been theoretically studied, for example, in the context of vibronic coupling (Kong et al., 2021).  To capture intra-spectrum and inter-spectrum dependencies, we concatenate the embeddings obtained from patch encoding and position encoding of different spectra: pˆ = p′1∥ · · · ∥p′|S| ∈ R(P|S|i=1 Ni)×d, and then input them into the Transformer encoder as depicted in Figure 2. Then each head h = 1, . . . , H in multi-head attention will transform them into query matrices Qh = pWˆ Qh , key matrices Kh = pWˆ Kh and value matrices Vh = pWˆ Vh , where W Qh , W Kh ∈ Rd×dk and WVh ∈ Rd× dH . Afterward, a scaled product is utilized to obtain the attention output Oh ∈ R(P|S|i=1 Ni)× dH :  ![](images/16005489fb642f13912d85a6d90523ef0a5e1d56b5c30d9d2ee35336fdb04f4f.jpg)  The multi-head attention block also includes BatchNorm layers and a feed forward network with residual connections as shown in Figure 2. After combining the outputs of all heads, it generates the representation denoted as z ∈ R(P|S|i=1 Ni)×d. Finally, a flatten layer with representation projection head is used to obtain the molecular spectra representation zs ∈ Rd.",
+            "edge": [
+                "Patching and Initial Encoding",
+                "Transformer Architecture and Dependencies"
+            ],
+            "level": 2,
+            "visual_node": [
+                "![](images/6c50a187dd8276b7372af9e7e00b99521d3741fa210eee7a1af570b334053570.jpg)",
+                "![](images/16005489fb642f13912d85a6d90523ef0a5e1d56b5c30d9d2ee35336fdb04f4f.jpg)"
+            ]
+        },
+        {
+            "name": "Patching and Initial Encoding",
+            "content": "For different types of spectra, each spectrum is independently patched and initially encoded. Then, all the resulting patch embeddings are concatenated and encoded using a Transformer-based encoder.  Patching. Compared to directly encoding individual frequency points, we divided each spectrum into multiple patches. This approach offers two distinct advantages: (i) By forming patches from adjacent frequency points, local semantic features, such as absorption peaks, can be captured more effectively. (ii) It reduces the computational overhead of subsequent Transformer layers. Technically, each spectrum si ∈ RLi where i = 1, · · · , |S| is first divided into patches according to the patch length Pi and the stride Di. When 0 < Di < Pi, the consecutive patches will be overlapped with overlapping region length Pi − Di. When Di = Pi, the consecutive patches will be nonoverlapped. Li denotes the length of si. The patching process on each spectrum will generate a sequence of patches pi ∈ RNi×Pi, where Ni = Li−PiDi + 1 is the number of patches.  Patch encoding and position encoding. Prior to be fed into the encoder, the patches of the i-th spectrum are mapped to the latent space of dimension d via a trainable linear projection Wi ∈ RPi×d. A learnable additive position encoding W posi ∈ RNi×d is applied to maintain the order of the patches: p′i = piWi + W posi , where p′i ∈ RNi×d denotes the latent representation of the spectrum si that will be fed into the subsequent SpecFormer encoder.",
+            "edge": [
+                "Process Overview",
+                "Patching Strategy",
+                "Patch and Position Encoding"
+            ],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "Process Overview",
+            "content": "For different types of spectra, each spectrum is independently patched and initially encoded. Then, all the resulting patch embeddings are concatenated and encoded using a Transformer-based encoder.",
+            "edge": [],
+            "level": 4,
+            "visual_node": []
+        },
+        {
+            "name": "Patching Strategy",
+            "content": "Patching. Compared to directly encoding individual frequency points, we divided each spectrum into multiple patches. This approach offers two distinct advantages: (i) By forming patches from adjacent frequency points, local semantic features, such as absorption peaks, can be captured more effectively. (ii) It reduces the computational overhead of subsequent Transformer layers. Technically, each spectrum si ∈ RLi where i = 1, · · · , |S| is first divided into patches according to the patch length Pi and the stride Di. When 0 < Di < Pi, the consecutive patches will be overlapped with overlapping region length Pi − Di. When Di = Pi, the consecutive patches will be nonoverlapped. Li denotes the length of si. The patching process on each spectrum will generate a sequence of patches pi ∈ RNi×Pi, where Ni = Li−PiDi + 1 is the number of patches.",
+            "edge": [],
+            "level": 4,
+            "visual_node": []
+        },
+        {
+            "name": "Patch and Position Encoding",
+            "content": "Patch encoding and position encoding. Prior to be fed into the encoder, the patches of the i-th spectrum are mapped to the latent space of dimension d via a trainable linear projection Wi ∈ RPi×d. A learnable additive position encoding W posi ∈ RNi×d is applied to maintain the order of the patches: p′i = piWi + W posi , where p′i ∈ RNi×d denotes the latent representation of the spectrum si that will be fed into the subsequent SpecFormer encoder.",
+            "edge": [],
+            "level": 4,
+            "visual_node": []
+        },
+        {
+            "name": "Transformer Architecture and Dependencies",
+            "content": "SpecFormer: multi-spectrum Transformer encoder. Although several encoders have been proposed to map molecular spectrum into implicit representations, such as the CNN-AM (Tao et al., 2024) based on one-dimensional convolution, these encoders are designed to encode only a single type of spectrum. In our approach, multiple molecular spectra (UV-Vis, IR, Raman) are jointly considered. When encoding multiple spectra of a molecule simultaneously, an observation caught our attention and led us to adopt a Transformer-based encoder with multiple spectra as input, similar to the single-stream Transformer in multi-modal learning (Shin et al., 2021).  The observation refers to the fact that the same functional group not only causes multiple peaks within a single spectrum, but also generates peaks across different spectra. As shown on the left of Figure 3, the different vibrational modes of the methyl group (-CH3) in methanol (CH3OH) result in three peaks in the IR spectrum, indicating intra-spectrum dependencies among these peaks. A similar phenomenon occurs with the hydroxyl group (-OH) in methanol. Additionally, the aromatic ring in phenol (C6H5OH), shown on the right of  ![](images/6c50a187dd8276b7372af9e7e00b99521d3741fa210eee7a1af570b334053570.jpg)   Figure 3: Illustration of intra-spectrum (left) and interspectrum (right) dependencies.  Figure 3, not only produces multiple peaks in the IR spectrum due to different vibrational modes but also causes an absorption peak near 270 nm in the UV-Vis spectrum due to the π → π∗ transition in the aromatic ring, demonstrating the existence of inter-spectrum dependencies. Such dependencies have been theoretically studied, for example, in the context of vibronic coupling (Kong et al., 2021).  To capture intra-spectrum and inter-spectrum dependencies, we concatenate the embeddings obtained from patch encoding and position encoding of different spectra: pˆ = p′1∥ · · · ∥p′|S| ∈ R(P|S|i=1 Ni)×d, and then input them into the Transformer encoder as depicted in Figure 2. Then each head h = 1, . . . , H in multi-head attention will transform them into query matrices Qh = pWˆ Qh , key matrices Kh = pWˆ Kh and value matrices Vh = pWˆ Vh , where W Qh , W Kh ∈ Rd×dk and WVh ∈ Rd× dH . Afterward, a scaled product is utilized to obtain the attention output Oh ∈ R(P|S|i=1 Ni)× dH :  ![](images/16005489fb642f13912d85a6d90523ef0a5e1d56b5c30d9d2ee35336fdb04f4f.jpg)  The multi-head attention block also includes BatchNorm layers and a feed forward network with residual connections as shown in Figure 2. After combining the outputs of all heads, it generates the representation denoted as z ∈ R(P|S|i=1 Ni)×d. Finally, a flatten layer with representation projection head is used to obtain the molecular spectra representation zs ∈ Rd.",
+            "edge": [
+                "Encoder Motivation",
+                "Spectral Dependencies Observation",
+                "Attention Mechanism and Output"
+            ],
+            "level": 3,
+            "visual_node": [
+                "![](images/6c50a187dd8276b7372af9e7e00b99521d3741fa210eee7a1af570b334053570.jpg)",
+                "![](images/16005489fb642f13912d85a6d90523ef0a5e1d56b5c30d9d2ee35336fdb04f4f.jpg)"
+            ]
+        },
+        {
+            "name": "Encoder Motivation",
+            "content": "SpecFormer: multi-spectrum Transformer encoder. Although several encoders have been proposed to map molecular spectrum into implicit representations, such as the CNN-AM (Tao et al., 2024) based on one-dimensional convolution, these encoders are designed to encode only a single type of spectrum. In our approach, multiple molecular spectra (UV-Vis, IR, Raman) are jointly considered. When encoding multiple spectra of a molecule simultaneously, an observation caught our attention and led us to adopt a Transformer-based encoder with multiple spectra as input, similar to the single-stream Transformer in multi-modal learning (Shin et al., 2021).",
+            "edge": [],
+            "level": 4,
+            "visual_node": []
+        },
+        {
+            "name": "Spectral Dependencies Observation",
+            "content": "The observation refers to the fact that the same functional group not only causes multiple peaks within a single spectrum, but also generates peaks across different spectra. As shown on the left of Figure 3, the different vibrational modes of the methyl group (-CH3) in methanol (CH3OH) result in three peaks in the IR spectrum, indicating intra-spectrum dependencies among these peaks. A similar phenomenon occurs with the hydroxyl group (-OH) in methanol. Additionally, the aromatic ring in phenol (C6H5OH), shown on the right of  ![](images/6c50a187dd8276b7372af9e7e00b99521d3741fa210eee7a1af570b334053570.jpg)   Figure 3: Illustration of intra-spectrum (left) and interspectrum (right) dependencies.  Figure 3, not only produces multiple peaks in the IR spectrum due to different vibrational modes but also causes an absorption peak near 270 nm in the UV-Vis spectrum due to the π → π∗ transition in the aromatic ring, demonstrating the existence of inter-spectrum dependencies. Such dependencies have been theoretically studied, for example, in the context of vibronic coupling (Kong et al., 2021).",
+            "edge": [],
+            "level": 4,
+            "visual_node": [
+                "![](images/6c50a187dd8276b7372af9e7e00b99521d3741fa210eee7a1af570b334053570.jpg)"
+            ]
+        },
+        {
+            "name": "Attention Mechanism and Output",
+            "content": "To capture intra-spectrum and inter-spectrum dependencies, we concatenate the embeddings obtained from patch encoding and position encoding of different spectra: pˆ = p′1∥ · · · ∥p′|S| ∈ R(P|S|i=1 Ni)×d, and then input them into the Transformer encoder as depicted in Figure 2. Then each head h = 1, . . . , H in multi-head attention will transform them into query matrices Qh = pWˆ Qh , key matrices Kh = pWˆ Kh and value matrices Vh = pWˆ Vh , where W Qh , W Kh ∈ Rd×dk and WVh ∈ Rd× dH . Afterward, a scaled product is utilized to obtain the attention output Oh ∈ R(P|S|i=1 Ni)× dH :  ![](images/16005489fb642f13912d85a6d90523ef0a5e1d56b5c30d9d2ee35336fdb04f4f.jpg)  The multi-head attention block also includes BatchNorm layers and a feed forward network with residual connections as shown in Figure 2. After combining the outputs of all heads, it generates the representation denoted as z ∈ R(P|S|i=1 Ni)×d. Finally, a flatten layer with representation projection head is used to obtain the molecular spectra representation zs ∈ Rd.",
+            "edge": [],
+            "level": 4,
+            "visual_node": [
+                "![](images/16005489fb642f13912d85a6d90523ef0a5e1d56b5c30d9d2ee35336fdb04f4f.jpg)"
+            ]
+        },
+        {
+            "name": "Masked Patches Reconstruction (3.2)",
+            "content": "# 3.2 MASKED PATCHES RECONSTRUCTION PRE-TRAINING FOR SPECTRA  Before distilling the spectra information into 3D molecular representation learning, we need first ensure that the spectrum encoder can effectively comprehend molecular spectra and generate spectral representations. Considering the success of masking modeling across various domains (Devlin et al., 2019; He et al., 2022; Hou et al., 2022; Xia et al., 2023; Wang et al., 2024b; Nie et al., 2023), we propose a masked patches reconstruction (MPR) objective to guide the training of SpecFormer.  After the patching step, we randomly select a portion of patches according to the mask ratio α and replace them with zero vectors to implement the masking. Subsequently, the masked patches undergo patch encoding and position encoding. In this way, the semantics of the masked patches (the absorption intensity at specific wavelengths) are obscured during patch encoding, while the positional information is retained to facilitate the reconstruction of the original semantics.  After encoding by SpecFormer, the encoded results corresponding to the masked patches are input into a spectrum-specific reconstruction head to reconstruct the original spectral values that were masked. The mean squared error (MSE) between the reconstruction results and the original masked spectra serves as the loss function for the MPR task, guiding the training of SpecFormer:  ![](images/6b6a48b7abea8c9dc0c3c72345c9c94cbb0ac1d1ef8824d485ff0ace9b5a0a1e.jpg)  where Pi denotes the set of masked patches in the i-th type of molecular spectra, and pˆi,j denotes ethe reconstructed patch corresponding to the masked patch pi,j .",
+            "edge": [
+                "MPR Motivation and Masking",
+                "Reconstruction Head and Loss"
+            ],
+            "level": 2,
+            "visual_node": [
+                "![](images/6b6a48b7abea8c9dc0c3c72345c9c94cbb0ac1d1ef8824d485ff0ace9b5a0a1e.jpg)"
+            ]
+        },
+        {
+            "name": "MPR Motivation and Masking",
+            "content": "Before distilling the spectra information into 3D molecular representation learning, we need first ensure that the spectrum encoder can effectively comprehend molecular spectra and generate spectral representations. Considering the success of masking modeling across various domains (Devlin et al., 2019; He et al., 2022; Hou et al., 2022; Xia et al., 2023; Wang et al., 2024b; Nie et al., 2023), we propose a masked patches reconstruction (MPR) objective to guide the training of SpecFormer.  After the patching step, we randomly select a portion of patches according to the mask ratio α and replace them with zero vectors to implement the masking. Subsequently, the masked patches undergo patch encoding and position encoding. In this way, the semantics of the masked patches (the absorption intensity at specific wavelengths) are obscured during patch encoding, while the positional information is retained to facilitate the reconstruction of the original semantics.",
+            "edge": [
+                "MPR Motivation",
+                "Masking Strategy"
+            ],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "MPR Motivation",
+            "content": "Before distilling the spectra information into 3D molecular representation learning, we need first ensure that the spectrum encoder can effectively comprehend molecular spectra and generate spectral representations. Considering the success of masking modeling across various domains (Devlin et al., 2019; He et al., 2022; Hou et al., 2022; Xia et al., 2023; Wang et al., 2024b; Nie et al., 2023), we propose a masked patches reconstruction (MPR) objective to guide the training of SpecFormer.",
+            "edge": [],
+            "level": 4,
+            "visual_node": []
+        },
+        {
+            "name": "Masking Strategy",
+            "content": "After the patching step, we randomly select a portion of patches according to the mask ratio α and replace them with zero vectors to implement the masking. Subsequently, the masked patches undergo patch encoding and position encoding. In this way, the semantics of the masked patches (the absorption intensity at specific wavelengths) are obscured during patch encoding, while the positional information is retained to facilitate the reconstruction of the original semantics.",
+            "edge": [],
+            "level": 4,
+            "visual_node": []
+        },
+        {
+            "name": "Reconstruction Head and Loss",
+            "content": "After encoding by SpecFormer, the encoded results corresponding to the masked patches are input into a spectrum-specific reconstruction head to reconstruct the original spectral values that were masked. The mean squared error (MSE) between the reconstruction results and the original masked spectra serves as the loss function for the MPR task, guiding the training of SpecFormer:  ![](images/6b6a48b7abea8c9dc0c3c72345c9c94cbb0ac1d1ef8824d485ff0ace9b5a0a1e.jpg)  where Pi denotes the set of masked patches in the i-th type of molecular spectra, and pˆi,j denotes ethe reconstructed patch corresponding to the masked patch pi,j .",
+            "edge": [],
+            "level": 3,
+            "visual_node": [
+                "![](images/6b6a48b7abea8c9dc0c3c72345c9c94cbb0ac1d1ef8824d485ff0ace9b5a0a1e.jpg)"
+            ]
+        },
+        {
+            "name": "Contrastive Learning (3.3)",
+            "content": "# 3.3 CONTRASTIVE LEARNING BETWEEN 3D STRUCTURES AND SPECTRA  Under the guidance of the denoising objective for 3D representation learning and the MPR objective for spectral representation learning, we further introduce a contrastive objective to align the representations across these two modalities. We treat the 3D representation zx ∈ Rd and spectral representation zs ∈ Rd of the same molecule as positive samples, and negative samples otherwise. Subsequently, the consistency between positive samples and the discrepancy between negative samples are maximized through the contrastive objective. Given the theoretical and empirical effectiveness, we employ InfoNCE (van den Oord et al., 2018) as the contrastive objective:  ![](images/35c4ba087f1ae7299aa554638aef5647f9efa2de01450fec172beb01bf64538e.jpg)  where zjx, zjs are randomly sampled 3D and spectra views regarding to the positive pair (zx, zs). fx(zx, zs) and fs(zs, zx) are scoring functions for the two corresponding views, with flexible formulations. Here we adopt fx(zx, zs) = fs(zs, zx) = ⟨zx, zs⟩.  Note that the denoising objective can utilize any form from existing 3D molecular representation pre-training studies, enabling seamless integration of our method into these frameworks.",
+            "edge": [
+                "Contrastive Objective Formulation",
+                "Integration with Denoising"
+            ],
+            "level": 2,
+            "visual_node": [
+                "![](images/35c4ba087f1ae7299aa554638aef5647f9efa2de01450fec172beb01bf64538e.jpg)"
+            ]
+        },
+        {
+            "name": "Contrastive Objective Formulation",
+            "content": "Under the guidance of the denoising objective for 3D representation learning and the MPR objective for spectral representation learning, we further introduce a contrastive objective to align the representations across these two modalities. We treat the 3D representation zx ∈ Rd and spectral representation zs ∈ Rd of the same molecule as positive samples, and negative samples otherwise. Subsequently, the consistency between positive samples and the discrepancy between negative samples are maximized through the contrastive objective. Given the theoretical and empirical effectiveness, we employ InfoNCE (van den Oord et al., 2018) as the contrastive objective:  ![](images/35c4ba087f1ae7299aa554638aef5647f9efa2de01450fec172beb01bf64538e.jpg)  where zjx, zjs are randomly sampled 3D and spectra views regarding to the positive pair (zx, zs). fx(zx, zs) and fs(zs, zx) are scoring functions for the two corresponding views, with flexible formulations. Here we adopt fx(zx, zs) = fs(zs, zx) = ⟨zx, zs⟩.",
+            "edge": [],
+            "level": 3,
+            "visual_node": [
+                "![](images/35c4ba087f1ae7299aa554638aef5647f9efa2de01450fec172beb01bf64538e.jpg)"
+            ]
+        },
+        {
+            "name": "Integration with Denoising",
+            "content": "Note that the denoising objective can utilize any form from existing 3D molecular representation pre-training studies, enabling seamless integration of our method into these frameworks.",
+            "edge": [],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "Two-Stage Pipeline (3.4)",
+            "content": "# 3.4 TWO-STAGE PRE-TRAINING PIPELINE  Previous pre-training efforts for 3D molecular representation have been conducted on unlabeled datasets using denoising objective. These datasets typically provide only equilibrium 3D structures without offering spectra for all molecules. To enhance the pre-training effect by incorporating spectra while leveraging denoising pre-training, we employ a two-stage pre-training approach. The first stage involves training on a larger dataset (Nakata & Shimazaki, 2017) without spectra using only the denoising objective. Subsequently, the second stage involves training on a dataset that includes spectra using the complete objective as follows:  ![](images/977b41496865477e7652249bec51630a1a037097fa74f17c10e8a72851cd7ce3.jpg)  where βDenoising, βMPR, and βContrast denote the weights of each sub-objective.",
+            "edge": [
+                "Pre-training Context",
+                "Two-Stage Method"
+            ],
+            "level": 2,
+            "visual_node": [
+                "![](images/977b41496865477e7652249bec51630a1a037097fa74f17c10e8a72851cd7ce3.jpg)"
+            ]
+        },
+        {
+            "name": "Pre-training Context",
+            "content": "Previous pre-training efforts for 3D molecular representation have been conducted on unlabeled datasets using denoising objective. These datasets typically provide only equilibrium 3D structures without offering spectra for all molecules.",
+            "edge": [],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "Two-Stage Method",
+            "content": "To enhance the pre-training effect by incorporating spectra while leveraging denoising pre-training, we employ a two-stage pre-training approach. The first stage involves training on a larger dataset (Nakata & Shimazaki, 2017) without spectra using only the denoising objective. Subsequently, the second stage involves training on a dataset that includes spectra using the complete objective as follows:  ![](images/977b41496865477e7652249bec51630a1a037097fa74f17c10e8a72851cd7ce3.jpg)  where βDenoising, βMPR, and βContrast denote the weights of each sub-objective.",
+            "edge": [],
+            "level": 3,
+            "visual_node": [
+                "![](images/977b41496865477e7652249bec51630a1a037097fa74f17c10e8a72851cd7ce3.jpg)"
+            ]
+        },
+        {
+            "name": "4 EXPERIMENTS",
+            "content": "# 4 EXPERIMENTS  To comprehensively evaluate the impact of molecular spectra on molecular tasks, we first verify the effectiveness of molecular spectra in the training-from-scratch method for the downstream task. Furthermore, we evaluate the effectiveness of our pre-training framework MolSpectra.  # 4.1 EFFECTIVENESS OF MOLECULAR SPECTRA IN TRAINING FROM SCRATCH  This pilot experiment aims to demonstrate the rationality for incorporating molecular spectra into pre-training. We introduce additional spectral features into a train-from-scratch molecular property prediction model to observe the impact of spectral information on prediction outcomes. We employ EGNN (Satorras et al., 2021), a representative 3D molecular encoder, equipped with an MLP-based prediction head as the baseline model. While EGNN encodes the 3D representations, the UV-Vis spectrum of each molecule provided by the QM9S (Zou et al., 2023) dataset is encoded into spectral representations by a spectrum encoder. Before making predictions with the final MLP, we concatenate the spectral and 3D representations for prediction. The results are presented in Table 1.  Table 1: Performance (MAE ↓) when training from scratch on QM9 dataset.    ![](images/084cd722defc01e058c1747c1103ac680c0cd8217a93077cab2a30ae06000a37.jpg)  We observe that by directly concatenating spectral representations, the performance of molecular property prediction can be effectively enhanced. This indicates that the information from molecular spectra is beneficial for downstream molecular property prediction. Further incorporating molecular spectra into the pre-training phase of molecular representation has the potential to enhance the informativeness and generalization capability of the representations, thereby broadly improving the performance of downstream tasks.  # 4.2 EFFECTIVENESS OF MOLECULAR SPECTRA IN REPRESENTATION PRE-TRAINING  We conduct experiments to evaluate MolSpectra by first introducing spectral data into the pretraining of 3D representations, followed by evaluating the performance on downstream tasks. For a comprehensive comparison, two types of baselines are adopted: (1) training-from-scratch methods, including SchNet (Schutt et al. ¨ , 2017), EGNN, DimeNet (Klicpera et al., 2020b), DimeNet++ (Klicpera et al., 2020a), PaiNN (Schutt et al. ¨ , 2021), SphereNet (Liu et al., 2021), and TorchMD-Net (Tholke & Fabritiis ¨ , 2022); and (2) pre-training methods, including Transformer-M (Luo et al., 2023), SE(3)-DDM (Liu et al., 2023b), 3D-EMGP (Jiao et al., 2023), and Coord.  MolSpectra can be seamlessly plugged into any existing denoising method. To evaluate the enhancement provided by our method compared to denoising alone, we select the representative coordinate denoising (Coord) as our denoising sub-objective. This method also serves as our primary baseline.  # 4.2.1 PRE-TRAINING DATASET.  As described in Section 3.4, we first perform denoising pre-training on the PCQM4Mv2 (Nakata & Shimazaki, 2017) dataset, followed by a second stage of pre-training on the QM9Spectra (QM9S) (Zou et al., 2023) dataset, which includes multi-modal molecular energy spectra. In both stages, we adopt the denoising objective provided by Coord (Zaidi et al., 2023), as defined in Eq. 2.  The QM9S dataset comprises organic molecules from the QM9 (Ramakrishnan et al., 2014) dataset. The UV-Vis, IR, and Raman spectra of the molecules are calculated at the B3LYP/def-TZVP level of theory, through frequency analysis and time-dependent density functional theory (TD-DFT).  # 4.2.2 QM9  The QM9 dataset is a quantum chemistry dataset comprising over 134,000 small molecules, each consisting of up to 9 heavy atoms (C, N, O, F) and additional H atoms. This dataset provides an equilibrium geometric conformation for each molecule along with 12 property labels. The dataset is divided into a training set of 110k molecules, a validation set of 10k molecules, and a test set containing the remaining over 10k molecules. Prediction errors are measured using the mean absolute error (MAE). The experimental results are presented in Table 2.  The 3D molecular representations pre-trained using our method are fine-tuned and used for prediction across various properties, achieving state-of-the-art performance in 8 out of 12 properties and outperforms Coord in 10 out of 12 properties. In conjunction with the observations in Section 4.1, the performance improvement can be attributed to our incorporation of an understanding of molecular spectra and the knowledge they entail into the 3D molecular representations.  Table 2: Performance (MAE↓) on QM9 dataset. The compared methods are divided into two groups: training from scratch and pre-training then fine-tuning. The best results are highlighted in bold.    ![](images/5237f9010e1b7bb79b84bdc91c83cb5152b351c1f8b05fe25b9f0b961f759e2d.jpg)  Table 3: Performance (MAE↓) on MD17 force prediction (kcal/mol/ ˚A). The methods are divided into two groups: training from scratch and pre-training then fine-tuning. The best results are in bold.    ![](images/1695d160ca54e8992fed88c81378f03bd31abccebafebc3c6a3d5dcadc6a747f.jpg)  # 4.2.3 MD17  The MD17 dataset contains molecular dynamics trajectories for eight organic molecules, including aspirin, benzene, and ethanol. It offers 150k to nearly 1M conformations per molecule, with energy and force labels. Unlike QM9, MD17 emphasizes dynamic behavior in addition to static properties. We use a standard limited data split: models train on 1k samples, validate on 50, and test on the rest. Performance is evaluated using MAE, with results in Table 3.  Our approach also results in the expected performance improvement on MD17. MD17 is a dataset comprising a large number of non-equilibrium molecular structures and their corresponding force fields, which serves to evaluate a model’s understanding of molecular dynamics. However, previous pre-training methods based solely on denoising have only learned force field patterns at static equilibrium states, failing to adequately capture the dynamic evolution of molecular systems. In contrast, our MolSpectra learns the dynamic evolution of molecules by understanding energy level transition patterns, thereby outperforming denoising-based pre-training methods.  4.3 SENSITIVITY ANALYSIS OF PATCH LENGTH Pi, STRIDE Di, AND MASK RATIO α  We conduct experiments to evaluate the impact of patch length Pi, stride Di, and mask ratio α.    Results are summarized in Table 4 and Table 5.  From Table 4, we observe that when consecutive patches have overlap (Di < Pi), the performance of pre-training is superior compared to scenarios without overlap (Di = Pi). Specifically, the performance is optimal when the stride is half of the patch length. This is because appropriate overlap can better preserve and capture local features, particularly the information at the patch boundaries. Additionally, we find that choosing an appropriate patch length further enhances performance. In our experiments, the configuration of Pi = 20, Di = 10 yields the best results.  Table 4: Sensitivity of patch length and stride.    ![](images/07aae76295c011d4cdd34a6c00be2fe8427447017185173c5e25e9468ccf833d.jpg)  Table 5: Sensitivity of mask ratio.    ![](images/7c18349955cc82bf082d353ddb8f1ff323edb09a34c2ad852dba320d3a0a3faa.jpg)  Regarding the mask ratio, α = 0.10 is a preferable choice. A small mask ratio results in insufficient MPR optimization, hindering SpecFormer training. Conversely, a large mask ratio causes excessive spectral perturbation, degrading performance when aligning with the 3D representations with the contrastive objective. An appropriate mask ratio strikes a balance between these two aspects.  # 4.4 ABLATION STUDY  To rigorously demonstrate the contributions of masked patches reconstruction, the incorporation of molecular spectra, and each spectral modality, we conducted an ablation study on them.  Ablation study of masked patches reconstruction. We remove the MPR loss to analyze the impact of masked patches reconstruction, referred to as “w/o MPR” in Table 6. Removing the MPR objective leads to performance deterioration. This is consistent with the sensitivity analysis of the mask ratio α in Section 4.3, as removing MPR is an extreme case where α = 0. This decline is due to the lack of effective guidance in training SpecFormer. Using an undertrained SpecFormer for contrastive learning with 3D encoder outputs limits performance improvement.  Table 6: Ablation of optimization objectives.    ![](images/71b0bba244aef7a34cb53a9293a46808b73589fc9a53aa7b50884d5e277c2b36.jpg)  Ablation study of molecular spectra. We retain only the denoising loss, removing both the MPR loss and contrastive loss, referred to as “w/o MPR, Contrast” in Table 6. The only difference between this variant and MolSpectra is the incorporation of molecular spectra into the pre-training. The ”w/o MPR, Contrast” results are inferior to those of MolSpectra, highlighting that incorporating molecular spectra effectively enhances the quality and generalizability of molecular 3D representations.  Ablation study of each spectral modality. To evaluate the contributions of each spectral modality to the performance, we conduct an ablation study for each modality. The results are presented in Table 7. It can be observed that each spectral modality contributes differently, with the UV-Vis spectrum having the smallest contribution and the IR spectrum the largest, likely due to the varying information content in each modality.  Table 7: Ablation of spectral modalities.    ![](images/43ac77c0fc35ecea17e2a91075a4f1c272643487659bc83241629c4a80c6ae86.jpg)",
+            "edge": [
+                "Experiment Overview",
+                "4.1 Effectiveness in Training from Scratch",
+                "4.2 Effectiveness in Representation Pre-training",
+                "4.3 Sensitivity Analysis",
+                "4.4 Ablation Study"
+            ],
+            "level": 1,
+            "visual_node": [
+                "![](images/084cd722defc01e058c1747c1103ac680c0cd8217a93077cab2a30ae06000a37.jpg)",
+                "![](images/5237f9010e1b7bb79b84bdc91c83cb5152b351c1f8b05fe25b9f0b961f759e2d.jpg)",
+                "![](images/1695d160ca54e8992fed88c81378f03bd31abccebafebc3c6a3d5dcadc6a747f.jpg)",
+                "![](images/07aae76295c011d4cdd34a6c00be2fe8427447017185173c5e25e9468ccf833d.jpg)",
+                "![](images/7c18349955cc82bf082d353ddb8f1ff323edb09a34c2ad852dba320d3a0a3faa.jpg)",
+                "![](images/71b0bba244aef7a34cb53a9293a46808b73589fc9a53aa7b50884d5e277c2b36.jpg)",
+                "![](images/43ac77c0fc35ecea17e2a91075a4f1c272643487659bc83241629c4a80c6ae86.jpg)"
+            ]
+        },
+        {
+            "name": "Experiment Overview",
+            "content": "To comprehensively evaluate the impact of molecular spectra on molecular tasks, we first verify the effectiveness of molecular spectra in the training-from-scratch method for the downstream task. Furthermore, we evaluate the effectiveness of our pre-training framework MolSpectra.",
+            "edge": [],
+            "level": 2,
+            "visual_node": []
+        },
+        {
+            "name": "4.1 Effectiveness in Training from Scratch",
+            "content": "# 4.1 EFFECTIVENESS OF MOLECULAR SPECTRA IN TRAINING FROM SCRATCH  This pilot experiment aims to demonstrate the rationality for incorporating molecular spectra into pre-training. We introduce additional spectral features into a train-from-scratch molecular property prediction model to observe the impact of spectral information on prediction outcomes. We employ EGNN (Satorras et al., 2021), a representative 3D molecular encoder, equipped with an MLP-based prediction head as the baseline model. While EGNN encodes the 3D representations, the UV-Vis spectrum of each molecule provided by the QM9S (Zou et al., 2023) dataset is encoded into spectral representations by a spectrum encoder. Before making predictions with the final MLP, we concatenate the spectral and 3D representations for prediction. The results are presented in Table 1.  Table 1: Performance (MAE ↓) when training from scratch on QM9 dataset.    ![](images/084cd722defc01e058c1747c1103ac680c0cd8217a93077cab2a30ae06000a37.jpg)  We observe that by directly concatenating spectral representations, the performance of molecular property prediction can be effectively enhanced. This indicates that the information from molecular spectra is beneficial for downstream molecular property prediction. Further incorporating molecular spectra into the pre-training phase of molecular representation has the potential to enhance the informativeness and generalization capability of the representations, thereby broadly improving the performance of downstream tasks.",
+            "edge": [
+                "4.1 Methodology",
+                "4.1 Results Analysis"
+            ],
+            "level": 2,
+            "visual_node": [
+                "![](images/084cd722defc01e058c1747c1103ac680c0cd8217a93077cab2a30ae06000a37.jpg)"
+            ]
+        },
+        {
+            "name": "4.1 Methodology",
+            "content": "This pilot experiment aims to demonstrate the rationality for incorporating molecular spectra into pre-training. We introduce additional spectral features into a train-from-scratch molecular property prediction model to observe the impact of spectral information on prediction outcomes. We employ EGNN (Satorras et al., 2021), a representative 3D molecular encoder, equipped with an MLP-based prediction head as the baseline model. While EGNN encodes the 3D representations, the UV-Vis spectrum of each molecule provided by the QM9S (Zou et al., 2023) dataset is encoded into spectral representations by a spectrum encoder. Before making predictions with the final MLP, we concatenate the spectral and 3D representations for prediction. The results are presented in Table 1.  Table 1: Performance (MAE ↓) when training from scratch on QM9 dataset.    ![](images/084cd722defc01e058c1747c1103ac680c0cd8217a93077cab2a30ae06000a37.jpg)",
+            "edge": [],
+            "level": 3,
+            "visual_node": [
+                "![](images/084cd722defc01e058c1747c1103ac680c0cd8217a93077cab2a30ae06000a37.jpg)"
+            ]
+        },
+        {
+            "name": "4.1 Results Analysis",
+            "content": "We observe that by directly concatenating spectral representations, the performance of molecular property prediction can be effectively enhanced. This indicates that the information from molecular spectra is beneficial for downstream molecular property prediction. Further incorporating molecular spectra into the pre-training phase of molecular representation has the potential to enhance the informativeness and generalization capability of the representations, thereby broadly improving the performance of downstream tasks.",
+            "edge": [],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "4.2 Effectiveness in Representation Pre-training",
+            "content": "# 4.2 EFFECTIVENESS OF MOLECULAR SPECTRA IN REPRESENTATION PRE-TRAINING  We conduct experiments to evaluate MolSpectra by first introducing spectral data into the pretraining of 3D representations, followed by evaluating the performance on downstream tasks. For a comprehensive comparison, two types of baselines are adopted: (1) training-from-scratch methods, including SchNet (Schutt et al. ¨ , 2017), EGNN, DimeNet (Klicpera et al., 2020b), DimeNet++ (Klicpera et al., 2020a), PaiNN (Schutt et al. ¨ , 2021), SphereNet (Liu et al., 2021), and TorchMD-Net (Tholke & Fabritiis ¨ , 2022); and (2) pre-training methods, including Transformer-M (Luo et al., 2023), SE(3)-DDM (Liu et al., 2023b), 3D-EMGP (Jiao et al., 2023), and Coord.  MolSpectra can be seamlessly plugged into any existing denoising method. To evaluate the enhancement provided by our method compared to denoising alone, we select the representative coordinate denoising (Coord) as our denoising sub-objective. This method also serves as our primary baseline.  # 4.2.1 PRE-TRAINING DATASET.  As described in Section 3.4, we first perform denoising pre-training on the PCQM4Mv2 (Nakata & Shimazaki, 2017) dataset, followed by a second stage of pre-training on the QM9Spectra (QM9S) (Zou et al., 2023) dataset, which includes multi-modal molecular energy spectra. In both stages, we adopt the denoising objective provided by Coord (Zaidi et al., 2023), as defined in Eq. 2.  The QM9S dataset comprises organic molecules from the QM9 (Ramakrishnan et al., 2014) dataset. The UV-Vis, IR, and Raman spectra of the molecules are calculated at the B3LYP/def-TZVP level of theory, through frequency analysis and time-dependent density functional theory (TD-DFT).  # 4.2.2 QM9  The QM9 dataset is a quantum chemistry dataset comprising over 134,000 small molecules, each consisting of up to 9 heavy atoms (C, N, O, F) and additional H atoms. This dataset provides an equilibrium geometric conformation for each molecule along with 12 property labels. The dataset is divided into a training set of 110k molecules, a validation set of 10k molecules, and a test set containing the remaining over 10k molecules. Prediction errors are measured using the mean absolute error (MAE). The experimental results are presented in Table 2.  The 3D molecular representations pre-trained using our method are fine-tuned and used for prediction across various properties, achieving state-of-the-art performance in 8 out of 12 properties and outperforms Coord in 10 out of 12 properties. In conjunction with the observations in Section 4.1, the performance improvement can be attributed to our incorporation of an understanding of molecular spectra and the knowledge they entail into the 3D molecular representations.  Table 2: Performance (MAE↓) on QM9 dataset. The compared methods are divided into two groups: training from scratch and pre-training then fine-tuning. The best results are highlighted in bold.    ![](images/5237f9010e1b7bb79b84bdc91c83cb5152b351c1f8b05fe25b9f0b961f759e2d.jpg)  Table 3: Performance (MAE↓) on MD17 force prediction (kcal/mol/ ˚A). The methods are divided into two groups: training from scratch and pre-training then fine-tuning. The best results are in bold.    ![](images/1695d160ca54e8992fed88c81378f03bd31abccebafebc3c6a3d5dcadc6a747f.jpg)  # 4.2.3 MD17  The MD17 dataset contains molecular dynamics trajectories for eight organic molecules, including aspirin, benzene, and ethanol. It offers 150k to nearly 1M conformations per molecule, with energy and force labels. Unlike QM9, MD17 emphasizes dynamic behavior in addition to static properties. We use a standard limited data split: models train on 1k samples, validate on 50, and test on the rest. Performance is evaluated using MAE, with results in Table 3.  Our approach also results in the expected performance improvement on MD17. MD17 is a dataset comprising a large number of non-equilibrium molecular structures and their corresponding force fields, which serves to evaluate a model’s understanding of molecular dynamics. However, previous pre-training methods based solely on denoising have only learned force field patterns at static equilibrium states, failing to adequately capture the dynamic evolution of molecular systems. In contrast, our MolSpectra learns the dynamic evolution of molecules by understanding energy level transition patterns, thereby outperforming denoising-based pre-training methods.",
+            "edge": [
+                "4.2 Baselines and Setup",
+                "4.2.1 Pre-training Dataset",
+                "4.2.2 QM9 Evaluation",
+                "4.2.3 MD17 Evaluation"
+            ],
+            "level": 2,
+            "visual_node": [
+                "![](images/5237f9010e1b7bb79b84bdc91c83cb5152b351c1f8b05fe25b9f0b961f759e2d.jpg)",
+                "![](images/1695d160ca54e8992fed88c81378f03bd31abccebafebc3c6a3d5dcadc6a747f.jpg)"
+            ]
+        },
+        {
+            "name": "4.2 Baselines and Setup",
+            "content": "We conduct experiments to evaluate MolSpectra by first introducing spectral data into the pretraining of 3D representations, followed by evaluating the performance on downstream tasks. For a comprehensive comparison, two types of baselines are adopted: (1) training-from-scratch methods, including SchNet (Schutt et al. ¨ , 2017), EGNN, DimeNet (Klicpera et al., 2020b), DimeNet++ (Klicpera et al., 2020a), PaiNN (Schutt et al. ¨ , 2021), SphereNet (Liu et al., 2021), and TorchMD-Net (Tholke & Fabritiis ¨ , 2022); and (2) pre-training methods, including Transformer-M (Luo et al., 2023), SE(3)-DDM (Liu et al., 2023b), 3D-EMGP (Jiao et al., 2023), and Coord.  MolSpectra can be seamlessly plugged into any existing denoising method. To evaluate the enhancement provided by our method compared to denoising alone, we select the representative coordinate denoising (Coord) as our denoising sub-objective. This method also serves as our primary baseline.",
+            "edge": [],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "4.2.1 Pre-training Dataset",
+            "content": "# 4.2.1 PRE-TRAINING DATASET.  As described in Section 3.4, we first perform denoising pre-training on the PCQM4Mv2 (Nakata & Shimazaki, 2017) dataset, followed by a second stage of pre-training on the QM9Spectra (QM9S) (Zou et al., 2023) dataset, which includes multi-modal molecular energy spectra. In both stages, we adopt the denoising objective provided by Coord (Zaidi et al., 2023), as defined in Eq. 2.  The QM9S dataset comprises organic molecules from the QM9 (Ramakrishnan et al., 2014) dataset. The UV-Vis, IR, and Raman spectra of the molecules are calculated at the B3LYP/def-TZVP level of theory, through frequency analysis and time-dependent density functional theory (TD-DFT).",
+            "edge": [],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "4.2.2 QM9 Evaluation",
+            "content": "# 4.2.2 QM9  The QM9 dataset is a quantum chemistry dataset comprising over 134,000 small molecules, each consisting of up to 9 heavy atoms (C, N, O, F) and additional H atoms. This dataset provides an equilibrium geometric conformation for each molecule along with 12 property labels. The dataset is divided into a training set of 110k molecules, a validation set of 10k molecules, and a test set containing the remaining over 10k molecules. Prediction errors are measured using the mean absolute error (MAE). The experimental results are presented in Table 2.  The 3D molecular representations pre-trained using our method are fine-tuned and used for prediction across various properties, achieving state-of-the-art performance in 8 out of 12 properties and outperforms Coord in 10 out of 12 properties. In conjunction with the observations in Section 4.1, the performance improvement can be attributed to our incorporation of an understanding of molecular spectra and the knowledge they entail into the 3D molecular representations.  Table 2: Performance (MAE↓) on QM9 dataset. The compared methods are divided into two groups: training from scratch and pre-training then fine-tuning. The best results are highlighted in bold.    ![](images/5237f9010e1b7bb79b84bdc91c83cb5152b351c1f8b05fe25b9f0b961f759e2d.jpg)",
+            "edge": [
+                "QM9 Dataset Description",
+                "QM9 Performance Analysis"
+            ],
+            "level": 3,
+            "visual_node": [
+                "![](images/5237f9010e1b7bb79b84bdc91c83cb5152b351c1f8b05fe25b9f0b961f759e2d.jpg)"
+            ]
+        },
+        {
+            "name": "QM9 Dataset Description",
+            "content": "The QM9 dataset is a quantum chemistry dataset comprising over 134,000 small molecules, each consisting of up to 9 heavy atoms (C, N, O, F) and additional H atoms. This dataset provides an equilibrium geometric conformation for each molecule along with 12 property labels. The dataset is divided into a training set of 110k molecules, a validation set of 10k molecules, and a test set containing the remaining over 10k molecules. Prediction errors are measured using the mean absolute error (MAE). The experimental results are presented in Table 2. Table 2: Performance (MAE↓) on QM9 dataset. The compared methods are divided into two groups: training from scratch and pre-training then fine-tuning. The best results are highlighted in bold.    ![](images/5237f9010e1b7bb79b84bdc91c83cb5152b351c1f8b05fe25b9f0b961f759e2d.jpg)",
+            "edge": [],
+            "level": 4,
+            "visual_node": [
+                "![](images/5237f9010e1b7bb79b84bdc91c83cb5152b351c1f8b05fe25b9f0b961f759e2d.jpg)"
+            ]
+        },
+        {
+            "name": "QM9 Performance Analysis",
+            "content": "The 3D molecular representations pre-trained using our method are fine-tuned and used for prediction across various properties, achieving state-of-the-art performance in 8 out of 12 properties and outperforms Coord in 10 out of 12 properties. In conjunction with the observations in Section 4.1, the performance improvement can be attributed to our incorporation of an understanding of molecular spectra and the knowledge they entail into the 3D molecular representations.",
+            "edge": [],
+            "level": 4,
+            "visual_node": []
+        },
+        {
+            "name": "4.2.3 MD17 Evaluation",
+            "content": "Table 3: Performance (MAE↓) on MD17 force prediction (kcal/mol/ ˚A). The methods are divided into two groups: training from scratch and pre-training then fine-tuning. The best results are in bold.    ![](images/1695d160ca54e8992fed88c81378f03bd31abccebafebc3c6a3d5dcadc6a747f.jpg)  # 4.2.3 MD17  The MD17 dataset contains molecular dynamics trajectories for eight organic molecules, including aspirin, benzene, and ethanol. It offers 150k to nearly 1M conformations per molecule, with energy and force labels. Unlike QM9, MD17 emphasizes dynamic behavior in addition to static properties. We use a standard limited data split: models train on 1k samples, validate on 50, and test on the rest. Performance is evaluated using MAE, with results in Table 3.  Our approach also results in the expected performance improvement on MD17. MD17 is a dataset comprising a large number of non-equilibrium molecular structures and their corresponding force fields, which serves to evaluate a model’s understanding of molecular dynamics. However, previous pre-training methods based solely on denoising have only learned force field patterns at static equilibrium states, failing to adequately capture the dynamic evolution of molecular systems. In contrast, our MolSpectra learns the dynamic evolution of molecules by understanding energy level transition patterns, thereby outperforming denoising-based pre-training methods.",
+            "edge": [
+                "MD17 Dataset Description",
+                "MD17 Performance Analysis"
+            ],
+            "level": 3,
+            "visual_node": [
+                "![](images/1695d160ca54e8992fed88c81378f03bd31abccebafebc3c6a3d5dcadc6a747f.jpg)"
+            ]
+        },
+        {
+            "name": "MD17 Dataset Description",
+            "content": "Table 3: Performance (MAE↓) on MD17 force prediction (kcal/mol/ ˚A). The methods are divided into two groups: training from scratch and pre-training then fine-tuning. The best results are in bold.    ![](images/1695d160ca54e8992fed88c81378f03bd31abccebafebc3c6a3d5dcadc6a747f.jpg)  # 4.2.3 MD17  The MD17 dataset contains molecular dynamics trajectories for eight organic molecules, including aspirin, benzene, and ethanol. It offers 150k to nearly 1M conformations per molecule, with energy and force labels. Unlike QM9, MD17 emphasizes dynamic behavior in addition to static properties. We use a standard limited data split: models train on 1k samples, validate on 50, and test on the rest. Performance is evaluated using MAE, with results in Table 3.",
+            "edge": [],
+            "level": 4,
+            "visual_node": [
+                "![](images/1695d160ca54e8992fed88c81378f03bd31abccebafebc3c6a3d5dcadc6a747f.jpg)"
+            ]
+        },
+        {
+            "name": "MD17 Performance Analysis",
+            "content": "Our approach also results in the expected performance improvement on MD17. MD17 is a dataset comprising a large number of non-equilibrium molecular structures and their corresponding force fields, which serves to evaluate a model’s understanding of molecular dynamics. However, previous pre-training methods based solely on denoising have only learned force field patterns at static equilibrium states, failing to adequately capture the dynamic evolution of molecular systems. In contrast, our MolSpectra learns the dynamic evolution of molecules by understanding energy level transition patterns, thereby outperforming denoising-based pre-training methods.",
+            "edge": [],
+            "level": 4,
+            "visual_node": []
+        },
+        {
+            "name": "4.3 Sensitivity Analysis",
+            "content": "4.3 SENSITIVITY ANALYSIS OF PATCH LENGTH Pi, STRIDE Di, AND MASK RATIO α  We conduct experiments to evaluate the impact of patch length Pi, stride Di, and mask ratio α.    Results are summarized in Table 4 and Table 5.  From Table 4, we observe that when consecutive patches have overlap (Di < Pi), the performance of pre-training is superior compared to scenarios without overlap (Di = Pi). Specifically, the performance is optimal when the stride is half of the patch length. This is because appropriate overlap can better preserve and capture local features, particularly the information at the patch boundaries. Additionally, we find that choosing an appropriate patch length further enhances performance. In our experiments, the configuration of Pi = 20, Di = 10 yields the best results.  Table 4: Sensitivity of patch length and stride.    ![](images/07aae76295c011d4cdd34a6c00be2fe8427447017185173c5e25e9468ccf833d.jpg)  Table 5: Sensitivity of mask ratio.    ![](images/7c18349955cc82bf082d353ddb8f1ff323edb09a34c2ad852dba320d3a0a3faa.jpg)  Regarding the mask ratio, α = 0.10 is a preferable choice. A small mask ratio results in insufficient MPR optimization, hindering SpecFormer training. Conversely, a large mask ratio causes excessive spectral perturbation, degrading performance when aligning with the 3D representations with the contrastive objective. An appropriate mask ratio strikes a balance between these two aspects.",
+            "edge": [
+                "Patch Length and Stride Analysis",
+                "Mask Ratio Analysis"
+            ],
+            "level": 2,
+            "visual_node": [
+                "![](images/07aae76295c011d4cdd34a6c00be2fe8427447017185173c5e25e9468ccf833d.jpg)",
+                "![](images/7c18349955cc82bf082d353ddb8f1ff323edb09a34c2ad852dba320d3a0a3faa.jpg)"
+            ]
+        },
+        {
+            "name": "Patch Length and Stride Analysis",
+            "content": "We conduct experiments to evaluate the impact of patch length Pi, stride Di, and mask ratio α.    Results are summarized in Table 4 and Table 5.  From Table 4, we observe that when consecutive patches have overlap (Di < Pi), the performance of pre-training is superior compared to scenarios without overlap (Di = Pi). Specifically, the performance is optimal when the stride is half of the patch length. This is because appropriate overlap can better preserve and capture local features, particularly the information at the patch boundaries. Additionally, we find that choosing an appropriate patch length further enhances performance. In our experiments, the configuration of Pi = 20, Di = 10 yields the best results.  Table 4: Sensitivity of patch length and stride.    ![](images/07aae76295c011d4cdd34a6c00be2fe8427447017185173c5e25e9468ccf833d.jpg)",
+            "edge": [],
+            "level": 3,
+            "visual_node": [
+                "![](images/07aae76295c011d4cdd34a6c00be2fe8427447017185173c5e25e9468ccf833d.jpg)"
+            ]
+        },
+        {
+            "name": "Mask Ratio Analysis",
+            "content": "Table 5: Sensitivity of mask ratio.    ![](images/7c18349955cc82bf082d353ddb8f1ff323edb09a34c2ad852dba320d3a0a3faa.jpg)  Regarding the mask ratio, α = 0.10 is a preferable choice. A small mask ratio results in insufficient MPR optimization, hindering SpecFormer training. Conversely, a large mask ratio causes excessive spectral perturbation, degrading performance when aligning with the 3D representations with the contrastive objective. An appropriate mask ratio strikes a balance between these two aspects.",
+            "edge": [],
+            "level": 3,
+            "visual_node": [
+                "![](images/7c18349955cc82bf082d353ddb8f1ff323edb09a34c2ad852dba320d3a0a3faa.jpg)"
+            ]
+        },
+        {
+            "name": "4.4 Ablation Study",
+            "content": "# 4.4 ABLATION STUDY  To rigorously demonstrate the contributions of masked patches reconstruction, the incorporation of molecular spectra, and each spectral modality, we conducted an ablation study on them.  Ablation study of masked patches reconstruction. We remove the MPR loss to analyze the impact of masked patches reconstruction, referred to as “w/o MPR” in Table 6. Removing the MPR objective leads to performance deterioration. This is consistent with the sensitivity analysis of the mask ratio α in Section 4.3, as removing MPR is an extreme case where α = 0. This decline is due to the lack of effective guidance in training SpecFormer. Using an undertrained SpecFormer for contrastive learning with 3D encoder outputs limits performance improvement.  Table 6: Ablation of optimization objectives.    ![](images/71b0bba244aef7a34cb53a9293a46808b73589fc9a53aa7b50884d5e277c2b36.jpg)  Ablation study of molecular spectra. We retain only the denoising loss, removing both the MPR loss and contrastive loss, referred to as “w/o MPR, Contrast” in Table 6. The only difference between this variant and MolSpectra is the incorporation of molecular spectra into the pre-training. The ”w/o MPR, Contrast” results are inferior to those of MolSpectra, highlighting that incorporating molecular spectra effectively enhances the quality and generalizability of molecular 3D representations.  Ablation study of each spectral modality. To evaluate the contributions of each spectral modality to the performance, we conduct an ablation study for each modality. The results are presented in Table 7. It can be observed that each spectral modality contributes differently, with the UV-Vis spectrum having the smallest contribution and the IR spectrum the largest, likely due to the varying information content in each modality.  Table 7: Ablation of spectral modalities.    ![](images/43ac77c0fc35ecea17e2a91075a4f1c272643487659bc83241629c4a80c6ae86.jpg)",
+            "edge": [
+                "MPR Ablation",
+                "Spectra and Modality Ablation"
+            ],
+            "level": 2,
+            "visual_node": [
+                "![](images/71b0bba244aef7a34cb53a9293a46808b73589fc9a53aa7b50884d5e277c2b36.jpg)",
+                "![](images/43ac77c0fc35ecea17e2a91075a4f1c272643487659bc83241629c4a80c6ae86.jpg)"
+            ]
+        },
+        {
+            "name": "MPR Ablation",
+            "content": "To rigorously demonstrate the contributions of masked patches reconstruction, the incorporation of molecular spectra, and each spectral modality, we conducted an ablation study on them.  Ablation study of masked patches reconstruction. We remove the MPR loss to analyze the impact of masked patches reconstruction, referred to as “w/o MPR” in Table 6. Removing the MPR objective leads to performance deterioration. This is consistent with the sensitivity analysis of the mask ratio α in Section 4.3, as removing MPR is an extreme case where α = 0. This decline is due to the lack of effective guidance in training SpecFormer. Using an undertrained SpecFormer for contrastive learning with 3D encoder outputs limits performance improvement.  Table 6: Ablation of optimization objectives.    ![](images/71b0bba244aef7a34cb53a9293a46808b73589fc9a53aa7b50884d5e277c2b36.jpg)",
+            "edge": [],
+            "level": 3,
+            "visual_node": [
+                "![](images/71b0bba244aef7a34cb53a9293a46808b73589fc9a53aa7b50884d5e277c2b36.jpg)"
+            ]
+        },
+        {
+            "name": "Spectra and Modality Ablation",
+            "content": "Ablation study of molecular spectra. We retain only the denoising loss, removing both the MPR loss and contrastive loss, referred to as “w/o MPR, Contrast” in Table 6. The only difference between this variant and MolSpectra is the incorporation of molecular spectra into the pre-training. The ”w/o MPR, Contrast” results are inferior to those of MolSpectra, highlighting that incorporating molecular spectra effectively enhances the quality and generalizability of molecular 3D representations.  Ablation study of each spectral modality. To evaluate the contributions of each spectral modality to the performance, we conduct an ablation study for each modality. The results are presented in Table 7. It can be observed that each spectral modality contributes differently, with the UV-Vis spectrum having the smallest contribution and the IR spectrum the largest, likely due to the varying information content in each modality.  Table 7: Ablation of spectral modalities.    ![](images/43ac77c0fc35ecea17e2a91075a4f1c272643487659bc83241629c4a80c6ae86.jpg)",
+            "edge": [
+                "Spectra Ablation",
+                "Modality Ablation"
+            ],
+            "level": 3,
+            "visual_node": [
+                "![](images/43ac77c0fc35ecea17e2a91075a4f1c272643487659bc83241629c4a80c6ae86.jpg)"
+            ]
+        },
+        {
+            "name": "Spectra Ablation",
+            "content": "Ablation study of molecular spectra. We retain only the denoising loss, removing both the MPR loss and contrastive loss, referred to as “w/o MPR, Contrast” in Table 6. The only difference between this variant and MolSpectra is the incorporation of molecular spectra into the pre-training. The ”w/o MPR, Contrast” results are inferior to those of MolSpectra, highlighting that incorporating molecular spectra effectively enhances the quality and generalizability of molecular 3D representations.",
+            "edge": [],
+            "level": 4,
+            "visual_node": []
+        },
+        {
+            "name": "Modality Ablation",
+            "content": "Ablation study of each spectral modality. To evaluate the contributions of each spectral modality to the performance, we conduct an ablation study for each modality. The results are presented in Table 7. It can be observed that each spectral modality contributes differently, with the UV-Vis spectrum having the smallest contribution and the IR spectrum the largest, likely due to the varying information content in each modality.  Table 7: Ablation of spectral modalities.    ![](images/43ac77c0fc35ecea17e2a91075a4f1c272643487659bc83241629c4a80c6ae86.jpg)",
+            "edge": [],
+            "level": 4,
+            "visual_node": [
+                "![](images/43ac77c0fc35ecea17e2a91075a4f1c272643487659bc83241629c4a80c6ae86.jpg)"
+            ]
+        },
+        {
+            "name": "6 CONCLUSION.md",
+            "content": "# 6 CONCLUSION In this study, we explore pre-training molecular 3D representations beyond classical mechanics. By leveraging the correlation between molecular energy level structures and molecular spectra in quantum mechanics, we introduce molecular spectra for pre-training molecular 3D representations (MolSpectra). By aligning the 3D encoder trained with a denoising objective and the spectrum encoder trained with a masked patch reconstruction objective, we enhance the informativeness and transferability of the resulting 3D representations.",
+            "edge": [
+                "Study Overview",
+                "Methodology and Outcomes"
+            ],
+            "level": 1,
+            "visual_node": []
+        },
+        {
+            "name": "Study Overview",
+            "content": "In this study, we explore pre-training molecular 3D representations beyond classical mechanics.",
+            "edge": [],
+            "level": 2,
+            "visual_node": []
+        },
+        {
+            "name": "Methodology and Outcomes",
+            "content": "By leveraging the correlation between molecular energy level structures and molecular spectra in quantum mechanics, we introduce molecular spectra for pre-training molecular 3D representations (MolSpectra). By aligning the 3D encoder trained with a denoising objective and the spectrum encoder trained with a masked patch reconstruction objective, we enhance the informativeness and transferability of the resulting 3D representations.",
+            "edge": [
+                "Proposed Method",
+                "Technical Alignment and Results"
+            ],
+            "level": 2,
+            "visual_node": []
+        },
+        {
+            "name": "Proposed Method",
+            "content": "By leveraging the correlation between molecular energy level structures and molecular spectra in quantum mechanics, we introduce molecular spectra for pre-training molecular 3D representations (MolSpectra).",
+            "edge": [],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "Technical Alignment and Results",
+            "content": "By aligning the 3D encoder trained with a denoising objective and the spectrum encoder trained with a masked patch reconstruction objective, we enhance the informativeness and transferability of the resulting 3D representations.",
+            "edge": [],
+            "level": 3,
+            "visual_node": []
+        },
+        {
+            "name": "![](images/d1786526fd710a309dcca5721e53bf82d19ca2622ac3314d3bfc46284f668f1d.jpg)",
+            "caption": "Figure 1: The conceptual view of MolSpectra, which leverages both molecular conformation and spectra for pre-training. Prior works only model classical mechanics by denoising on conformations.",
+            "visual_node": 1,
+            "formula": 0,
+            "resolution": "1091x372"
+        },
+        {
+            "name": "![](images/b1279274391a8ee385cd510483726424c1b7bb542dd3dfcaa664dd7784591ce0.jpg)",
+            "caption": "Equation representing the equivalence between denoising and learning molecular force fields.",
+            "visual_node": 1,
+            "formula": 1,
+            "resolution": "633x83"
+        },
+        {
+            "name": "![](images/7ad84d22cd3c911b24b3de89b1077c3b9f08c6b762840ecd38160beb882f3a16.jpg)",
+            "caption": "Equation defining the denoising-based energy function ECoord derived from isotropic Gaussian noise.",
+            "visual_node": 1,
+            "formula": 1,
+            "resolution": "436x69"
+        },
+        {
+            "name": "![](images/b5f78fc1f86d0c7c6f996497a567b0462a6e0529447787d81ffe3fdbb71b3ccb.jpg)",
+            "caption": "Equation defining the energy function induced by anisotropic Gaussian noise on dihedral angles and coordinates.",
+            "visual_node": 1,
+            "formula": 1,
+            "resolution": "475x58"
+        },
+        {
+            "name": "![](images/1ee1e5bd4e35d7acf3b80e7404b113daf11b6925490369aaa637e60964fb744d.jpg)",
+            "caption": "Figure 2: Overview of the MolSpectra pre-training framework. Our pre-training framework comprises three sub-objectives: the denoising objective and the MPR objective, which respectively guide the representation learning of the 3D and spectral modalities, and the contrastive objective, which aligns the representations of both modalities.",
+            "visual_node": 1,
+            "formula": 0,
+            "resolution": "1102x619"
+        },
+        {
+            "name": "![](images/bdae202d54ee161317731819baf6967d189260b12ff333dc514a8cc065c47f97.jpg)",
+            "caption": "Equation defining the energy function derived from classical molecular potential energy theory involving bond stretching, bending, and torsion.",
+            "visual_node": 1,
+            "formula": 1,
+            "resolution": "877x131"
+        },
+        {
+            "name": "![](images/6c50a187dd8276b7372af9e7e00b99521d3741fa210eee7a1af570b334053570.jpg)",
+            "caption": "Figure 3: Illustration of intra-spectrum (left) and interspectrum (right) dependencies.",
+            "visual_node": 1,
+            "formula": 0,
+            "resolution": "656x350"
+        },
+        {
+            "name": "![](images/16005489fb642f13912d85a6d90523ef0a5e1d56b5c30d9d2ee35336fdb04f4f.jpg)",
+            "caption": "Equation for the scaled dot-product attention output in the multi-head attention block.",
+            "visual_node": 1,
+            "formula": 1,
+            "resolution": "659x75"
+        },
+        {
+            "name": "![](images/6b6a48b7abea8c9dc0c3c72345c9c94cbb0ac1d1ef8824d485ff0ace9b5a0a1e.jpg)",
+            "caption": "Equation defining the masked patches reconstruction (MPR) loss function.",
+            "visual_node": 1,
+            "formula": 1,
+            "resolution": "420x95"
+        },
+        {
+            "name": "![](images/35c4ba087f1ae7299aa554638aef5647f9efa2de01450fec172beb01bf64538e.jpg)",
+            "caption": "Equation defining the InfoNCE contrastive objective function.",
+            "visual_node": 1,
+            "formula": 1,
+            "resolution": "1106x92"
+        },
+        {
+            "name": "![](images/977b41496865477e7652249bec51630a1a037097fa74f17c10e8a72851cd7ce3.jpg)",
+            "caption": "Equation defining the complete objective function combining denoising, MPR, and contrastive losses.",
+            "visual_node": 1,
+            "formula": 1,
+            "resolution": "628x36"
+        },
+        {
+            "name": "![](images/084cd722defc01e058c1747c1103ac680c0cd8217a93077cab2a30ae06000a37.jpg)",
+            "caption": "Table 1: Performance (MAE ↓) when training from scratch on QM9 dataset.",
+            "visual_node": 1,
+            "formula": 0,
+            "resolution": "1100x156"
+        },
+        {
+            "name": "![](images/5237f9010e1b7bb79b84bdc91c83cb5152b351c1f8b05fe25b9f0b961f759e2d.jpg)",
+            "caption": "Table 2: Performance (MAE↓) on QM9 dataset. The compared methods are divided into two groups: training from scratch and pre-training then fine-tuning. The best results are highlighted in bold.",
+            "visual_node": 1,
+            "formula": 0,
+            "resolution": "1095x394"
+        },
+        {
+            "name": "![](images/1695d160ca54e8992fed88c81378f03bd31abccebafebc3c6a3d5dcadc6a747f.jpg)",
+            "caption": "Table 3: Performance (MAE↓) on MD17 force prediction (kcal/mol/ ˚A). The methods are divided into two groups: training from scratch and pre-training then fine-tuning. The best results are in bold.",
+            "visual_node": 1,
+            "formula": 0,
+            "resolution": "986x327"
+        },
+        {
+            "name": "![](images/07aae76295c011d4cdd34a6c00be2fe8427447017185173c5e25e9468ccf833d.jpg)",
+            "caption": "Table 4: Sensitivity of patch length and stride.",
+            "visual_node": 1,
+            "formula": 0,
+            "resolution": "598x231"
+        },
+        {
+            "name": "![](images/7c18349955cc82bf082d353ddb8f1ff323edb09a34c2ad852dba320d3a0a3faa.jpg)",
+            "caption": "Table 5: Sensitivity of mask ratio.",
+            "visual_node": 1,
+            "formula": 0,
+            "resolution": "358x231"
+        },
+        {
+            "name": "![](images/71b0bba244aef7a34cb53a9293a46808b73589fc9a53aa7b50884d5e277c2b36.jpg)",
+            "caption": "Table 6: Ablation of optimization objectives.",
+            "visual_node": 1,
+            "formula": 0,
+            "resolution": "486x150"
+        },
+        {
+            "name": "![](images/43ac77c0fc35ecea17e2a91075a4f1c272643487659bc83241629c4a80c6ae86.jpg)",
+            "caption": "Table 7: Ablation of spectral modalities.",
+            "visual_node": 1,
+            "formula": 0,
+            "resolution": "508x192"
+        }
+    ]
+}
\ No newline at end of file