new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Apr 15

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

Chemical reasoning usually involves complex, multi-step processes that demand precise calculations, where even minor errors can lead to cascading failures. Furthermore, large language models (LLMs) encounter difficulties handling domain-specific formulas, executing reasoning steps accurately, and integrating code effectively when tackling chemical reasoning tasks. To address these challenges, we present ChemAgent, a novel framework designed to improve the performance of LLMs through a dynamic, self-updating library. This library is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries. Then, when presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory, facilitating effective task decomposition and the generation of solutions. Our method designs three types of memory and a library-enhanced reasoning component, enabling LLMs to improve over time through experience. Experimental results on four chemical reasoning datasets from SciBench demonstrate that ChemAgent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. Our findings suggest substantial potential for future applications, including tasks such as drug discovery and materials science. Our code can be found at https://github.com/gersteinlab/chemagent

  • 12 authors
·
Jan 11, 2025 2

Optimizing Operation Recipes with Reinforcement Learning for Safe and Interpretable Control of Chemical Processes

Optimal operation of chemical processes is vital for energy, resource, and cost savings in chemical engineering. The problem of optimal operation can be tackled with reinforcement learning, but traditional reinforcement learning methods face challenges due to hard constraints related to quality and safety that must be strictly satisfied, and the large amount of required training data. Chemical processes often cannot provide sufficient experimental data, and while detailed dynamic models can be an alternative, their complexity makes it computationally intractable to generate the needed data. Optimal control methods, such as model predictive control, also struggle with the complexity of the underlying dynamic models. Consequently, many chemical processes rely on manually defined operation recipes combined with simple linear controllers, leading to suboptimal performance and limited flexibility. In this work, we propose a novel approach that leverages expert knowledge embedded in operation recipes. By using reinforcement learning to optimize the parameters of these recipes and their underlying linear controllers, we achieve an optimized operation recipe. This method requires significantly less data, handles constraints more effectively, and is more interpretable than traditional reinforcement learning methods due to the structured nature of the recipes. We demonstrate the potential of our approach through simulation results of an industrial batch polymerization reactor, showing that it can approach the performance of optimal controllers while addressing the limitations of existing methods.

  • 2 authors
·
Nov 20, 2025

A Simple Iterative Approach for Constant Chemical Potential Simulations at Interfaces

Chemical potential of species in solution is essential for understanding various chemical processes at interfaces. Molecular dynamics (MD) simulations, constrained by fixed compositions, cannot satisfy a constant chemical potential condition as solute species can migrate to the interface and deplete the bulk due to solute-interface interactions. In this study, we introduce a simple and computationally efficient approach named iterative constant chemical potential molecular dynamics (iCuMD) simulation, which helps simulate targeted molar concentrations of species in solution. iCuMD overcomes the limitations of conventional MD by adjusting the number of species in the solution to reach a target concentration (chemical potential). We demonstrate our approach using solid-liquid and liquid-air interfacial systems as case studies. Specifically, we perform classical force field-based MD simulations of NaCl(aq)-air and NaCl(aq)-graphite interfaces and machine learning interatomic potential (MLIP)-based MD simulations of the Na2SO4(aq)-graphene interface. Our results show that the iCuMD approach efficiently achieves the desired bulk ion concentration within two iterations and can also be integrated with MLIP-driven simulations which enable constant potential simulations with DFT-level accuracy. We show that iCuMD offers a robust and simple computational framework for constant chemical potential simulations as its only requirement is to be able to converge interfacial simulations with a measurable bulk region.

  • 3 authors
·
Jun 1, 2025

Chemical classification program synthesis using generative artificial intelligence

Accurately classifying chemical structures is essential for cheminformatics and bioinformatics, including tasks such as identifying bioactive compounds of interest, screening molecules for toxicity to humans, finding non-organic compounds with desirable material properties, or organizing large chemical libraries for drug discovery or environmental monitoring. However, manual classification is labor-intensive and difficult to scale to large chemical databases. Existing automated approaches either rely on manually constructed classification rules, or the use of deep learning methods that lack explainability. This work presents an approach that uses generative artificial intelligence to automatically write chemical classifier programs for classes in the Chemical Entities of Biological Interest (ChEBI) database. These programs can be used for efficient deterministic run-time classification of SMILES structures, with natural language explanations. The programs themselves constitute an explainable computable ontological model of chemical class nomenclature, which we call the ChEBI Chemical Class Program Ontology (C3PO). We validated our approach against the ChEBI database, and compared our results against state of the art deep learning models. We also demonstrate the use of C3PO to classify out-of-distribution examples taken from metabolomics repositories and natural product databases. We also demonstrate the potential use of our approach to find systematic classification errors in existing chemical databases, and show how an ensemble artificial intelligence approach combining generated ontologies, automated literature search, and multimodal vision models can be used to pinpoint potential errors requiring expert validation

  • 7 authors
·
May 23, 2025

Navigating Chemical-Linguistic Sharing Space with Heterogeneous Molecular Encoding

Chemical language models (CLMs) are prominent for their effectiveness in exploring chemical space and enabling molecular engineering. However, while exploring chemical-linguistic space, CLMs suffer from the gap between natural language and molecular representations. This challenge is primarily due to the inherent modeling differences between molecules and texts: molecules operate unified modeling to learn chemical space, while natural language sequentially models the semantic space. Additionally, the limited availability of high-quality text-to-molecule datasets further exacerbates this challenge. To address the problem, we first verified the information bias in molecular representations from different perspectives. We then developed the Heterogeneous Molecular Encoding (HME) framework, a unified molecular encoder compressing the molecular features from fragment sequence, topology, and conformation with Q-learning. To better model chemical-linguistic space, we further constructed the MCMoD dataset, which contains over one million molecules with various conditions, including properties, fragments, and descriptions. Experimentally, HME promotes CLMs to achieve chemical-linguistic sharing space exploration: (1) chemical space exploration with linguistic guidance, where HME achieves significant improvements (+37.8\% FCD) for molecular design in multiple constraints, even in zero-shot scenarios; (2) linguistic space exploration with molecular guidance, where HME generates textual descriptions with high qualities (+11.6\% BLEU) for molecules. These results highlight the precision of HME in handling multi-objective and cross-domain tasks, as well as its remarkable generalization capability on unseen task combinations. HME offers a new perspective on navigating chemical-linguistic sharing space, advancing the potential of CLMs in both fundamental research and practical applications in chemistry.

  • 8 authors
·
Dec 30, 2024

ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering

Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth. While QA datasets are plentiful in areas like general domain and biomedicine, academic chemistry is less explored. Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format. Addressing this gap, we introduce ScholarChemQA, a large-scale QA dataset constructed from chemical papers. This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful. Correspondingly, we introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data. We first address the issue of imbalanced label distribution by re-weighting the instance-wise loss based on the inverse frequency of each class, ensuring minority classes are not dominated by majority ones during optimization. Next, we utilize the unlabeled data to enrich the learning process, generating a variety of augmentations based on a SoftMix operation and ensuring their predictions align with the same target, i.e., pseudo-labels. To ensure the quality of the pseudo-labels, we propose a calibration procedure aimed at closely aligning the pseudo-label estimates of individual samples with a desired ground truth distribution. Experiments show that our QAMatch significantly outperforms the recent similar-scale baselines and Large Language Models (LLMs) not only on our ScholarChemQA dataset but also on four benchmark datasets. We hope our benchmark and model can facilitate and promote more research on chemical QA.

  • 10 authors
·
Jul 23, 2024

Chemical Physics of Controlled Wettability and Super Surfaces

Wetting phenomena are widespread in both natural and technological contexts. Despite the well-established nature of this scientific field and our extensive knowledge of its underlying principles, wetting remains a dynamic and vibrant area of study. It continues to pose fundamental questions while offering innovative avenues for controlling these phenomena to develop novel applications. By tailoring the wetting properties of surfaces, researchers and engineers can design materials with specific functionalities, such as self-cleaning surfaces, anti-fog coatings, and enhanced slipperiness. Recent years have witnessed significant advancements in wetting research, owing to the exquisite control achieved in surface topography and chemistry and to the development of novel experimental techniques. Additionally, simulations and theory have played a crucial role in these advancements. They provid the fundamental knowledge and quantitative tools to control wettability and design surfaces with enhanced properties. Given these recent breakthroughs, this special collection Chemical Physics of Controlled Wettability and Super Surfaces becomes particularly timely and significant. It serves as a platform to showcase some of the latest developments in the field of wetting. It highlights the exciting progress and potential applications in controlling wetting properties that are enabled by the synergy between theory, simulations, and experiments.

  • 3 authors
·
Jan 31, 2024

Chemical Heredity as Group Selection at the Molecular Level

Many examples of cooperation exist in biology. In chemical systems however, which can sometimes be quite complex, we do not appear to observe intricate cooperative interactions. A key question for the origin of life, is then how can molecular cooperation first arise in an abiotic system prior to the emergence of biological replication. We postulate that selection at the molecular level is a driving force behind the complexification of chemical systems, particularly during the origins of life. In the theory of multilevel selection the two selective forces are: within-group and between-group, where the former tends to favor "selfish" replication of individuals and the latter favor cooperation between individuals enhancing the replication of the group as a whole. These forces can be quantified using the Price equation, which is a standard tool used in evolutionary biology to quantify evolutionary change. Our central claim is that replication and heredity in chemical systems are subject to selection, and quantifiable using the multilevel Price equation. We demonstrate this using the Graded Autocatalysis Replication Domain computer model, describing simple protocell composed out of molecules and its replication, which respectively analogue to the group and the individuals. In contrast to previous treatments of this model, we treat the lipid molecules themselves as replicating individuals and the protocells they form as groups of individuals. Our goal is to demonstrate how evolutionary biology tools and concepts can be applied in chemistry and we suggest that molecular cooperation may arise as a result of group selection. Further, the biological relation of parent-progeny is proposed to be analogue to the reactant-product relation in chemistry, thus allowing for tools from evolutionary biology to be applied to chemistry and would deepen the connection between chemistry and biology.

  • 3 authors
·
Feb 22, 2018

Chemical abundances and kinematics of 257 G-, K-type field giants. Setting a base for further analysis of giant-planet properties orbiting evolved stars

We performed a uniform and detailed abundance analysis of 12 refractory elements (Na, Mg, Al, Si, Ca, Ti, Cr, Ni, Co, Sc, Mn, and V) for a sample of 257 G- and K-type evolved stars from the CORALIE planet search program. To date, only one of these stars is known to harbor a planetary companion. We aimed to characterize this large sample of evolved stars in terms of chemical abundances and kinematics, thus setting a solid base for further analysis of planetary properties around giant stars. This sample, being homogeneously analyzed, can be used as a comparison sample for other planet-related studies, as well as for different type of studies related to stellar and Galaxy astrophysics. The abundances of the chemical elements were determined using an LTE abundance analysis relative to the Sun, with the spectral synthesis code MOOG and a grid of Kurucz ATLAS9 atmospheres. To separate the Galactic stellar populations both a purely kinematical approach and a chemical method were applied. We confirm the overabundance of Na in giant stars compared to the field FGK dwarfs. This enhancement might have a stellar evolutionary character, but departures from LTE may also produce a similar enhancement. Our chemical separation of stellar populations also suggests a "gap" in metallicity between the thick-disk and high-alpha metal-rich stars, as previously observed in dwarfs sample from HARPS. The present sample, as most of the giant star samples, also suffers from the B - V colour cut-off, which excludes low-log g stars with high metallicities, and high-logg star with low-[Fe/H]. For future studies of planet occurrence dependence on stellar metallicity around these evolved stars we suggest to use a sub-sample of stars in a "cut-rectangle" in the logg - [Fe/H] diagram to overcome the aforementioned issue.

  • 12 authors
·
Mar 28, 2015

Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

While large language models (LLMs) with Chain-of-Thought (CoT) reasoning excel in mathematics and coding, their potential for systematic reasoning in chemistry, a domain demanding rigorous structural analysis for real-world tasks like drug design and reaction engineering, remains untapped. Current benchmarks focus on simple knowledge retrieval, neglecting step-by-step reasoning required for complex tasks such as molecular optimization and reaction prediction. To address this, we introduce ChemCoTBench, a reasoning framework that bridges molecular structure understanding with arithmetic-inspired operations, including addition, deletion, and substitution, to formalize chemical problem-solving into transparent, step-by-step workflows. By treating molecular transformations as modular "chemical operations", the framework enables slow-thinking reasoning, mirroring the logic of mathematical proofs while grounding solutions in real-world chemical constraints. We evaluate models on two high-impact tasks: Molecular Property Optimization and Chemical Reaction Prediction. These tasks mirror real-world challenges while providing structured evaluability. By providing annotated datasets, a reasoning taxonomy, and baseline evaluations, ChemCoTBench bridges the gap between abstract reasoning methods and practical chemical discovery, establishing a foundation for advancing LLMs as tools for AI-driven scientific innovation.

  • 9 authors
·
May 27, 2025

ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain

Recent advancements in language models have started a new era of superior information retrieval and content generation, with embedding models playing an important role in optimizing data representation efficiency and performance. While benchmarks like the Massive Text Embedding Benchmark (MTEB) have standardized the evaluation of general domain embedding models, a gap remains in specialized fields such as chemistry, which require tailored approaches due to domain-specific challenges. This paper introduces a novel benchmark, the Chemical Text Embedding Benchmark (ChemTEB), designed specifically for the chemical sciences. ChemTEB addresses the unique linguistic and semantic complexities of chemical literature and data, offering a comprehensive suite of tasks on chemical domain data. Through the evaluation of 34 open-source and proprietary models using this benchmark, we illuminate the strengths and weaknesses of current methodologies in processing and understanding chemical information. Our work aims to equip the research community with a standardized, domain-specific evaluation framework, promoting the development of more precise and efficient NLP models for chemistry-related applications. Furthermore, it provides insights into the performance of generic models in a domain-specific context. ChemTEB comes with open-source code and data, contributing further to its accessibility and utility.

  • 7 authors
·
Nov 30, 2024

Machine Learning for Polaritonic Chemistry: Accessing chemical kinetics

Altering chemical reactivity and material structure in confined optical environments is on the rise, and yet, a conclusive understanding of the microscopic mechanisms remains elusive. This originates mostly from the fact that accurately predicting vibrational and reactive dynamics for soluted ensembles of realistic molecules is no small endeavor, and adding (collective) strong light-matter interaction does not simplify matters. Here, we establish a framework based on a combination of machine learning (ML) models, trained using density-functional theory calculations, and molecular dynamics to accelerate such simulations. We then apply this approach to evaluate strong coupling, changes in reaction rate constant, and their influence on enthalpy and entropy for the deprotection reaction of 1-phenyl-2-trimethylsilylacetylene, which has been studied previously both experimentally and using ab initio simulations. While we find qualitative agreement with critical experimental observations, especially with regard to the changes in kinetics, we also find differences in comparison with previous theoretical predictions. The features for which the ML-accelerated and ab initio simulations agree show the experimentally estimated kinetic behavior. Conflicting features indicate that a contribution of dynamic electronic polarization to the reaction process is more relevant then currently believed. Our work demonstrates the practical use of ML for polaritonic chemistry, discusses limitations of common approximations and paves the way for a more holistic description of polaritonic chemistry.

  • 4 authors
·
Nov 16, 2023

A Chemical Modelling Roadmap Linking Protoplanetary Disks and Exoplanet Atmospheres

[Abridged] This review paper discussed which chemical effects may be at play in a planet-forming disk midplane, which effects are relevant under different conditions, and which tools are available for modelling chemical kinetics in a disk midplane. The review goes on to discuss some important efforts in the planet formation modelling community to treat chemical evolution, and, vice versa, efforts in the chemical modelling community to implement more physical effects related to planet formation into the chemical modelling. The aim of this review is both to outline some concepts related to planet formation chemistry, but also to encourage, not just collaboration between the planet formation modelling community and the astrochemical community, but also assistance and guidance from one community to the other. Guidance, regarding which effects, out of many, might be more relevant than others under certain planet formation conditions, and regarding why certain included effects lead to certain important modelling outcomes. As the research fields of exoplanet atmospheres and protoplanetary disks near new frontiers in observational insights with upcoming facilities, developing appropriate modelling frameworks (including physical and chemical effects) is paramount to ultimately enable the linking of a chemically characterised exoplanet atmospheres to its formation history in its natal protoplanetary disk.

  • 1 authors
·
Oct 30, 2022

The chemical inventory of the planet-hosting disk PDS 70

As host to two accreting planets, PDS 70 provides a unique opportunity to probe the chemical complexity of atmosphere-forming material. We present ALMA Band 6 observations of the PDS~70 disk and report the first chemical inventory of the system. With a spatial resolution of 0.4''-0.5'' (sim50 au), 12 species are detected, including CO isotopologues and formaldehyde, small hydrocarbons, HCN and HCO+ isotopologues, and S-bearing molecules. SO and CH3OH are not detected. All lines show a large cavity at the center of the disk, indicative of the deep gap carved by the massive planets. The radial profiles of the line emission are compared to the (sub-)mm continuum and infrared scattered light intensity profiles. Different molecular transitions peak at different radii, revealing the complex interplay between density, temperature and chemistry in setting molecular abundances. Column densities and optical depth profiles are derived for all detected molecules, and upper limits obtained for the non detections. Excitation temperature is obtained for H2CO. Deuteration and nitrogen fractionation profiles from the hydro-cyanide lines show radially increasing fractionation levels. Comparison of the disk chemical inventory to grids of chemical models from the literature strongly suggests a disk molecular layer hosting a carbon to oxygen ratio C/O>1, thus providing for the first time compelling evidence of planets actively accreting high C/O ratio gas at present time.

  • 6 authors
·
Jan 20, 2021

Nanoscale Chemical Evolution of Silicon Negative Electrodes Characterized by Low-Loss STEM-EELS

Continuous solid electrolyte interface (SEI) formation remains the limiting factor of the lifetime of silicon nanoparticles (SiNPs) based negative electrodes. Methods that could provide clear diagnosis of the electrode degradation are of utmost necessity to streamline further developments. We demonstrate that electron energy-loss spectroscopy (EELS) in a scanning transmission electron microscope (STEM) can be used to quickly map SEI components and quantify LixSi alloys from single experiments, with resolutions down to 5 nm. Exploiting the low-loss part of the EEL spectrum allowed us to circumvent the degradation phenomena that have so far crippled the application of this technique on such beam-sensitive compounds. Our results provide unprecedented insight into silicon aging mechanisms in full cell configuration. We observe the morphology of the SEI to be extremely heterogeneous at the particle scale but with clear chemical evolutions with extended cycling coming from both SEI accumulation and a transition from lithium-rich carbonate-like compounds to lithium-poor ones. Thanks to the retrieval of several results from a single dataset, we were able to correlate local discrepancies in lithiation to the initial crystallinity of silicon as well as to the local SEI chemistry and morphology. This study emphasizes how initial heterogeneities in the percolating electronic network and the porosity affect SiNPs aggregates along cycling. These findings pinpoint the crucial role of an optimized formulation in silicon-based thick electrodes.

  • 6 authors
·
Nov 14, 2016

Protein Chemical Shift Prediction

The protein chemical shifts holds a large amount of information about the 3-dimensional structure of the protein. A number of chemical shift predictors based on the relationship between structures resolved with X-ray crystallography and the corresponding experimental chemical shifts have been developed. These empirical predictors are very accurate on X-ray structures but tends to be insensitive to small structural changes. To overcome this limitation it has been suggested to make chemical shift predictors based on quantum mechanical(QM) calculations. In this thesis the development of the QM derived chemical shift predictor Procs14 is presented. Procs14 is based on 2.35 million density functional theory(DFT) calculations on tripeptides and contains corrections for hydrogen bonding, ring current and the effect of the previous and following residue. Procs14 is capable at performing predictions for the 13CA, 13CB, 13CO, 15NH, 1HN and 1HA backbone atoms. In order to benchmark Procs14, a number of QM NMR calculations are performed on full protein structures. Of the tested empirical and QM derived predictors, Procs14 reproduced the QM chemical shifts with the highest accuracy. A comparison with the QM derived predictor CheShift-2 on X-ray structures and NMR ensembles with experimental chemical shift data, showed that Procs14 predicted the chemical shifts with the best accuracy. The predictions on the NMR ensembles exhibited the best performance. This suggests that future work might benefit from using ensemble sampling when performing simulations of protein folding with chemical shifts. Procs14 is implemented in the markov chain monte carlo protein folding framework PHAISTOS. The computational efficient implementation of Procs14 allows for rapid predictions and therefore potential use in refinement and folding of protein structures.

  • 1 authors
·
Sep 23, 2014

Nonlinear dynamics of a chemically-active drop: from steady to chaotic self-propulsion

Individual chemically active drops suspended in a surfactant solution were observed to self-propel spontaneously with straight, helical, or chaotic trajectories. To elucidate how these drops can exhibit such strikingly different dynamics and `decide' what to do, we propose a minimal axisymmetric model of a spherical active drop, and show that simple and linear interface properties can lead to both steady self-propulsion of the droplet as well as chaotic behavior. The model includes two different mobility mechanisms, namely, diffusiophoresis and the Marangoni effect, that convert self-generated gradients of surfactant concentration into the flow at the droplet surface. In turn, surface-driven flow initiates surfactant advection that is the only nonlinear mechanism and, thus, the only source of dynamical complexity in our model. Numerical investigation of the fully-coupled hydrodynamic and advection diffusion problems reveals that strong advection (e.g., large droplet size) may destabilize a steadily self-propelling drop; once destabilized, the droplet spontaneously stops and a symmetric extensile flow emerges. If advection is strengthened even further in comparison with molecular diffusion, the droplet may perform chaotic oscillations. Our results indicate that the thresholds of these instabilities depend heavily on the balance between diffusiophoresis and the Marangoni effect. Using linear stability analysis, we demonstrate that diffusiophoresis promotes the onset of high-order modes of monotonic instability of the motionless drop. We argue that diffusiophoresis has a similar effect on the instabilities of a moving drop.

  • 2 authors
·
Jan 8, 2019

ChemLLM: A Chemical Large Language Model

Large language models (LLMs) have made impressive progress in chemistry applications, including molecular property prediction, molecular generation, experimental protocol design, etc. However, the community lacks a dialogue-based model specifically designed for chemistry. The challenge arises from the fact that most chemical data and scientific knowledge are primarily stored in structured databases, and the direct use of these structured data compromises the model's ability to maintain coherent dialogue. To tackle this issue, we develop a novel template-based instruction construction method that transforms structured knowledge into plain dialogue, making it suitable for language model training. By leveraging this approach, we develop ChemLLM, the first large language model dedicated to chemistry, capable of performing various tasks across chemical disciplines with smooth dialogue interaction. ChemLLM beats GPT-3.5 on all three principal tasks in chemistry, i.e., name conversion, molecular caption, and reaction prediction, and surpasses GPT-4 on two of them. Remarkably, ChemLLM also shows exceptional adaptability to related mathematical and physical tasks despite being trained mainly on chemical-centric corpora. Furthermore, ChemLLM demonstrates proficiency in specialized NLP tasks within chemistry, such as literature translation and cheminformatic programming. ChemLLM opens up a new avenue for exploration within chemical studies, while our method of integrating structured chemical knowledge into dialogue systems sets a new frontier for developing LLMs across various scientific fields. Codes, Datasets, and Model weights are publicly accessible at hf.co/AI4Chem/ChemLLM-7B-Chat.

  • 15 authors
·
Feb 9, 2024 7

Accurate generation of chemical reaction transition states by conditional flow matching

Transition state (TS) structures define the critical geometries and energy barriers underlying chemical reactivity, yet their fleeting nature renders them experimentally elusive and drives the reliance on costly, high-throughput density functional theory (DFT) calculations. Here, we introduce TS-GEN, a conditional flow-matching generative model that maps samples from a simple Gaussian prior directly to transition-state saddle-point geometries in a single, deterministic pass. By embedding both reactant and product conformations as conditioning information, TS-GEN learns to transport latent noise to true TS structures via an optimal-transport path, effectively replacing the iterative optimization common in nudged-elastic band or string-method algorithms. TS-GEN delivers unprecedented accuracy, achieving a root-mean-square deviation of 0.004 mathring{A} (vs. 0.103 mathring{A} for prior state-of-the-art) and a mean barrier-height error of 1.019 {rm kcal/mol} (vs. 2.864 {rm kcal/mol}), while requiring only 0.06 {rm s} GPU time per inference. Over 87% of generated TSs meet chemical-accuracy criteria (<1.58 {rm kcal/mol} error), substantially outpacing existing methods. TS-GEN also exhibits strong transferability to out-of-distribution reactions from a larger database. By uniting sub-angstrom precision, sub-second speed, and broad applicability, TS-GEN will be highly useful for high-throughput exploration of complex reaction networks, paving the way to the exploration of novel chemical reaction mechanisms.

  • 3 authors
·
Jul 14, 2025

Generative Discovery of Novel Chemical Designs using Diffusion Modeling and Transformer Deep Neural Networks with Application to Deep Eutectic Solvents

We report a series of deep learning models to solve complex forward and inverse design problems in molecular modeling and design. Using both diffusion models inspired by nonequilibrium thermodynamics and attention-based transformer architectures, we demonstrate a flexible framework to capture complex chemical structures. First trained on the QM9 dataset and a series of quantum mechanical properties (e.g. homo, lumo, free energy, heat capacity, etc.), we then generalize the model to study and design key properties of deep eutectic solvents. In addition to separate forward and inverse models, we also report an integrated fully prompt-based multi-task generative pretrained transformer model that solves multiple forward, inverse design, and prediction tasks, flexibly and within one model. We show that the multi-task generative model has the overall best performance and allows for flexible integration of multiple objectives, within one model, and for distinct chemistries, suggesting that synergies emerge during training of this large language model. Trained jointly in tasks related to the QM9 dataset and deep eutectic solvents (DESs), the model can predict various quantum mechanical properties and critical properties to achieve deep eutectic solvent behavior. Several novel combinations of DESs are proposed based on this framework.

  • 3 authors
·
Apr 24, 2023

Molecular Contrastive Learning with Chemical Element Knowledge Graph

Molecular representation learning contributes to multiple downstream tasks such as molecular property prediction and drug design. To properly represent molecules, graph contrastive learning is a promising paradigm as it utilizes self-supervision signals and has no requirements for human annotations. However, prior works fail to incorporate fundamental domain knowledge into graph semantics and thus ignore the correlations between atoms that have common attributes but are not directly connected by bonds. To address these issues, we construct a Chemical Element Knowledge Graph (KG) to summarize microscopic associations between elements and propose a novel Knowledge-enhanced Contrastive Learning (KCL) framework for molecular representation learning. KCL framework consists of three modules. The first module, knowledge-guided graph augmentation, augments the original molecular graph based on the Chemical Element KG. The second module, knowledge-aware graph representation, extracts molecular representations with a common graph encoder for the original molecular graph and a Knowledge-aware Message Passing Neural Network (KMPNN) to encode complex information in the augmented molecular graph. The final module is a contrastive objective, where we maximize agreement between these two views of molecular graphs. Extensive experiments demonstrated that KCL obtained superior performances against state-of-the-art baselines on eight molecular datasets. Visualization experiments properly interpret what KCL has learned from atoms and attributes in the augmented molecular graphs. Our codes and data are available at https://github.com/ZJU-Fangyin/KCL.

  • 10 authors
·
Dec 1, 2021

mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

Despite their ability to understand chemical knowledge, large language models (LLMs) remain limited in their capacity to propose novel molecules with desired functions (e.g., drug-like properties). In addition, the molecules that LLMs propose can often be challenging to make, and are almost never compatible with automated synthesis approaches. To better enable the discovery of functional small molecules, LLMs need to learn a new molecular language that is more effective in predicting properties and inherently synced with automated synthesis technology. Current molecule LLMs are limited by representing molecules based on atoms. In this paper, we argue that just like tokenizing texts into meaning-bearing (sub-)word tokens instead of characters, molecules should be tokenized at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as effective building blocks for real-world automated laboratory synthesis. This motivates us to propose mCLM, a modular Chemical-Language Model that comprises a bilingual language model that understands both natural language descriptions of functions and molecular blocks. mCLM front-loads synthesizability considerations while improving the predicted functions of molecules in a principled manner. mCLM, with only 3B parameters, achieves improvements in synthetic accessibility relative to 7 other leading generative AI methods including GPT-5. When tested on 122 out-of-distribution medicines using only building blocks/tokens that are compatible with automated modular synthesis, mCLM outperforms all baselines in property scores and synthetic accessibility. mCLM can also reason on multiple functions and iteratively self-improve to rescue drug candidates that failed late in clinical trials ("fallen angels").

  • 14 authors
·
May 18, 2025

OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery

Understanding the chemical structure from a graphical representation of a molecule is a challenging image caption task that would greatly benefit molecule-centric scientific discovery. Variations in molecular images and caption subtasks pose a significant challenge in both image representation learning and task modeling. Yet, existing methods only focus on a specific caption task that translates a molecular image into its graph structure, i.e., OCSR. In this paper, we propose the Optical Chemical Structure Understanding (OCSU) task, which extends OCSR to molecular image caption from motif level to molecule level and abstract level. We present two approaches for that, including an OCSR-based method and an end-to-end OCSR-free method. The proposed Double-Check achieves SOTA OCSR performance on real-world patent and journal article scenarios via attentive feature enhancement for local ambiguous atoms. Cascading with SMILES-based molecule understanding methods, it can leverage the power of existing task-specific models for OCSU. While Mol-VL is an end-to-end optimized VLM-based model. An OCSU dataset, Vis-CheBI20, is built based on the widely used CheBI20 dataset for training and evaluation. Extensive experimental results on Vis-CheBI20 demonstrate the effectiveness of the proposed approaches. Improving OCSR capability can lead to a better OCSU performance for OCSR-based approach, and the SOTA performance of Mol-VL demonstrates the great potential of end-to-end approach.

  • 8 authors
·
Jan 26, 2025

Towards Foundation Model for Chemical Reactor Modeling: Meta-Learning with Physics-Informed Adaptation

Developing accurate models for chemical reactors is often challenging due to the complexity of reaction kinetics and process dynamics. Traditional approaches require retraining models for each new system, limiting generalizability and efficiency. In this work, we take a step toward foundation models for chemical reactor modeling by introducing a neural network framework that generalizes across diverse reactor types and rapidly adapts to new chemical processes. Our approach leverages meta-learning to pretrain the model on a broad set of reactor dynamics, enabling efficient adaptation to unseen reactions with minimal data. To further enhance generalizability, we incorporate physics-informed fine-tuning, ensuring physically consistent adaptation to new reactor conditions. Our framework is evaluated across three integer-order fundamental reactor types - continuous stirred tank reactors, batch reactors, and plug flow reactors - demonstrating superior few-shot adaptation compared to conventional data-driven, physics-informed, and transfer learning approaches. By combining meta-learning with physics-informed adaptation, this work lays the foundation for a generalizable modeling framework, advancing the development of foundation models for chemical engineering applications. Source code is available at https://github.com/killingbear999/chemical-reactor-foundation-model.

  • 2 authors
·
May 19, 2024

Large-Scale Chemical Language Representations Capture Molecular Structure and Properties

Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties.

  • 6 authors
·
Jun 17, 2021

What types of chemical problems benefit from density-corrected DFT? A probe using an extensive and chemically diverse test suite

For the large and chemically diverse GMTKN55 benchmark suite, we have studied the performance of density-corrected density functional theory (HF-DFT), compared to self-consistent DFT, for several pure and hybrid GGA and meta-GGA exchange-correlation (XC) functionals (PBE, BLYP, TPSS, SCAN) as a function of the percentage of HF exchange in the hybrid. The D4 empirical dispersion correction has been added throughout. For subsets dominated by dynamical correlation -- particularly noncovalent interaction subsets -- HF-DFT is highly beneficial, particularly at low HF exchange percentages. For subsets with significant static correlation (i.e., where a Hartree-Fock determinant is not a good zero-order wavefunction), HF-DFT may do more harm than good. While the self-consistent series show optima at or near 37.5% (i.e., 3/8) for all four XC functionals -- consistent with Grimme's proposal of the PBE38 functional -- HF-BnLYP-D4, HF-PBEn-D4, and HF-TPSSn-D4 all exhibit minima nearer 25% (i.e., 1/4). Intriguingly, for HF-SCANn-D4, the minimum is near 10%, but the weighted mean absolute error (WTMAD2) for GMTKN55 is only barely lower than that of HF-SCAN-D4 (i.e., where the post-HF step is a pure meta-GGA). The latter becomes an attractive option, only slightly more costly than pure Hartree-Fock, and devoid of adjustable parameters other than the three in the dispersion correction. Moreover, its WTMAD2 is only surpassed by the highly empirical M06-2X and by the combinatorically optimized empirical range-separated hybrids wB97X-V and wB97M-V.

  • 2 authors
·
Oct 4, 2020

Fluctuations and correlations in chemical reaction kinetics and population dynamics

This chapter provides a pedagogical introduction and overview of spatial and temporal correlation and fluctuation effects resulting from the fundamentally stochastic kinetics underlying chemical reactions and the dynamics of populations or epidemics. After reviewing the assumptions and mean-field type approximations involved in the construction of chemical rate equations for uniform reactant densities, we first discuss spatial clustering in birth-death systems, where non-linearities are introduced through either density-limiting pair reactions, or equivalently via local imposition of finite carrying capacities. The competition of offspring production, death, and non-linear inhibition induces a population extinction threshold, which represents a non-equilibrium phase transition that separates active from absorbing states. This continuous transition is characterized by the universal scaling exponents of critical directed percolation clusters. Next we focus on the emergence of depletion zones in single-species annihilation processes and spatial population segregation with the associated reaction fronts in two-species pair annihilation. These strong (anti-)correlation effects are dynamically generated by the underlying stochastic kinetics. Finally, we address noise-induced and fluctuation-stabilized spatio-temporal patterns in basic predator-prey systems, exemplified by spreading activity fronts in the two-species Lotka-Volterra model as well as spiral structures in the May-Leonard variant of cyclically competing three-species systems akin to rock-paper-scissors games.

  • 1 authors
·
Jul 3, 2018

Robust Determination of the Chemical Potential in the Pole Expansion and Selected Inversion Method for Solving Kohn-Sham density functional theory

Fermi operator expansion (FOE) methods are powerful alternatives to diagonalization type methods for solving Kohn-Sham density functional theory (KSDFT). One example is the pole expansion and selected inversion (PEXSI) method, which approximates the Fermi operator by rational matrix functions and reduces the computational complexity to at most quadratic scaling for solving KSDFT. Unlike diagonalization type methods, the chemical potential often cannot be directly read off from the result of a single step of evaluation of the Fermi operator. Hence multiple evaluations are needed to be sequentially performed to compute the chemical potential to ensure the correct number of electrons within a given tolerance. This hinders the performance of FOE methods in practice. In this paper we develop an efficient and robust strategy to determine the chemical potential in the context of the PEXSI method. The main idea of the new method is not to find the exact chemical potential at each self-consistent-field (SCF) iteration iteration, but to dynamically and rigorously update the upper and lower bounds for the true chemical potential, so that the chemical potential reaches its convergence along the SCF iteration. Instead of evaluating the Fermi operator for multiple times sequentially, our method uses a two-level strategy that evaluates the Fermi operators in parallel. In the regime of full parallelization, the wall clock time of each SCF iteration is always close to the time for one single evaluation of the Fermi operator, even when the initial guess is far away from the converged solution. We demonstrate the effectiveness of the new method using examples with metallic and insulating characters, as well as results from ab initio molecular dynamics.

  • 2 authors
·
Aug 14, 2017

Deciphering the "chemical" nature of the exotic isotopes of Hydrogen by the MC-QTAIM analysis: The positively charged Muon and the Muonic Helium as new members of the Periodic Table

This report is a primarily survey on the chemical nature of some exotic species containing the positively charged muon and the muonic Helium, i.e., the negatively charged muon plus helium nucleus, as exotic isotopes of hydrogen, using the newly developed multi-component quantum theory of atoms in molecules (MC-QTAIM) analysis, employing ab initio non-Born-Oppenhiemer wavefunctions. Accordingly, the "atoms in molecules" analysis performed on various asymmetric exotic isotopomers of hydrogen molecule, recently detected experimentally [Science 331, 448 (2011)], demonstrates that both the exotic isotopes are capable of forming atoms in molecules and retaining the identity of hydrogen atom. Various derived properties of atomic basins containing muonic helium cast no doubt that apart from its short life time, it is a heavier isotope of hydrogen while the properties of basins containing the positively charged muon are more remote from those of the orthodox hydrogen basins, capable of appreciable donation of electrons as well as large charge polarization; however, with some tolerance, they may be categorized also as hydrogen basins though with a smaller electronegativity. All in all, present study also clearly demonstrates that the MC-QTAIM analysis is an efficient approach to decipher the chemical nature of species containing exotic constituents, hard to be elucidated by experimental and/or alternative theoretical schemes.

  • 2 authors
·
Nov 25, 2013

Multiscale Investigation of Chemical Interference in Proteins

We developed a multiscale approach (MultiSCAAL) that integrates the potential of mean force (PMF) obtained from all-atomistic molecular dynamics simulations with a knowledge-based energy function for coarse-grained molecular simulations in better exploring the energy landscape of a small protein under chemical interference such as chemical denaturation. An excessive amount of water molecules in all-atomistic molecular dynamics simulations often negatively impacts the sampling efficiency of some advanced sampling techniques such as the replica exchange method and it makes the investigation of chemical interferences on protein dynamics difficult. Thus, there is a need to develop an effective strategy that focuses on sampling structural changes in protein conformations rather than solvent molecule fluctuations. In this work, we address this issue by devising a multiscale simulation scheme (MultiSCAAL) that bridges the gap between all-atomistic molecular dynamics simulation and coarse-grained molecular simulation. The two key features of this scheme are the Boltzmann inversion and a protein atomistic reconstruction method we previously developed (SCAAL). Using MultiSCAAL, we were able to enhance the sampling efficiency of proteins solvated by explicit water molecules. Our method has been tested on the folding energy landscape of a small protein Trp-cage with explicit solvent under 8M urea using both the all-atomistic replica exchange molecular dynamics (AA-REMD) and MultiSCAAL. We compared computational analyses on ensemble conformations of Trp-cage with its available experimental NOE distances. The analysis demonstrated that conformations explored by MultiSCAAL better agree with the ones probed in the experiments because it can effectively capture the changes in side chain orientations that can flip out of the hydrophobic pocket in the presence of urea and water molecules.

  • 3 authors
·
Apr 9, 2010