53.8 kB

Multiscale Investigation of Chemical Interference in Proteins

Antonios Samiotakis*⁺, Dirar Homouz*⁺ and Margaret S. Cheung*

* Department of Physics, University of Houston.

⁺ These authors contributed equally to this work.

Corresponding author email: mscheung@uh.edu

Abstract

We developed a multiscale approach (MultiSCAAL) that integrates the potential of mean force (PMF) obtained from all-atomistic molecular dynamics simulations with a knowledge-based energy function for coarse-grained molecular simulations in better exploring the energy landscape of a small protein under chemical interference such as chemical denaturation. An excessive amount of water molecules in all-atomistic molecular dynamics simulations often negatively impacts the sampling efficiency of some advanced sampling techniques such as the replica exchange method and it makes the investigation of chemical interferences on protein dynamics difficult. Thus, there is a need to develop an effective strategy that focuses on sampling structural changes in protein conformations rather than solvent molecule fluctuations. In this work, we address this issue by devising a multiscale simulation scheme (MultiSCAAL) that bridges the gap between all-atomistic molecular dynamics simulation and coarse-grained molecular simulation. The two key features of this scheme are the Boltzmann inversion and a protein atomistic reconstruction method we previously developed (SCAAL). Using MultiSCAAL, we were able to enhance the sampling efficiency of proteins solvated by explicit water molecules. Our method has been tested on the folding energy landscape of a small protein Trp-cage with explicit solvent under 8M urea using both the all-atomistic replica exchangemolecular dynamics (AA-REMD) and MultiSCAAL. We compared computational analyses on ensemble conformations of Trp-cage with its available experimental NOE distances. The analysis demonstrated that conformations explored by MultiSCAAL better agree with the ones probed in the experiments because it can effectively capture the changes in side chain orientations that can flip out of the hydrophobic pocket in the presence of urea and water molecules. In this regard, MultiSCAAL is a promising and effective sampling scheme for investigating chemical interference which presents a great challenge when modeling protein interactions in vivo.## I. INTRODUCTION

Proteins are the molecular “workhorses” that carry out functions in living organisms. Proteins need to fold into well defined compact structures in order to perform these functions. Such “protein folding” events depend on the amino acid sequences of proteins and their surrounding environment. The physical principle of this process in a crowded and confined cell, however, has not been fully understood¹. The major obstacles lie in complex interactions between odd-shaped macromolecules which often chemically interfere with one another in a crowded space, making experiments extremely difficult. This problem could be better understood by utilizing computer simulation and modeling that offer an effective approach to explore a broad range of parameters and solvent conditions, reducing costs to experiments^2-8. However, performing such a task is challenging because it involves computer simulation of a biological system whose spatiotemporal dynamics span across multiple orders of magnitude that is beyond the capacity of current computing resources.

The recently advanced computer simulations remain in all-atom and coarse-grained molecular simulations which have different strengths as well as shortcomings. All-atomistic molecular dynamics simulations with explicit solvent molecules provide a great deal of detail of protein dynamics on a very short-time scale (e.g. picoseconds). However, it can be too computationally costly⁹ to produce a meaningful folding trajectory that typically elapses in microseconds. On the other hand, Coarse-grained (CG) protein models are very effective in capturing the main features of a protein and have the ability to simulate full protein folding trajectories up to several microseconds¹⁰. However, CG protein models lack the atomistic details and reliable energy functions that reflect changes in solvent conditions, so they appear “notrealistic” enough¹¹. Because of these shortcomings, CG investigations have been restricted to isolated proteins in aqueous solutions. Other methods, like the Generalized Born method^{12, 13}, attempt to reduce the computational cost by treating the solvent implicitly. However, these approaches are also restricted in aqueous solutions and have yet to address the issue of chemical interference.

To solve more complex problems with larger proteins over a broad range of spatiotemporal scales, in terms of multiple orders of magnitudes, several approaches that can mix CG and all-atomistic (or fine-grained, FG) protein models are being developed. Some multiscale algorithms bridge the length scales of FG and CG models. A “force matching” technique that gathers forces on CG models from FG simulations^{14, 15} has been developed and applied to simple systems¹⁶. A hybrid FG/CG molecular simulation system in which the resolutions can be dynamically switched has been proposed^17-20. Alternatively, some methods take hydrodynamic interactions into account and integrate disparate time scales between molecular dynamics and Brownian dynamics²¹. However, these approaches are only applicable to simple isolated systems without complex changes in solvent environment; fundamental concerns of how to integrate force fields from different scales still remain elusive. Little is known, for example, of how to take radical chemical interferences into account on the self-assembly of macromolecules. Therefore, an investigation of protein dynamics and the folding energy landscape inside a cell, a crowded medium where interactions between proteins are often changed by chemical interference, is not available; thus, development of multi-scale approaches to serve this purpose of prompt integration of force fields for protein models at different resolutions is urgentlyneeded. In addition, developments of new efficient multiscale modeling algorithms are required to optimize data transmission that enhances the computing capacity for all system sizes.

In this work, we developed a multi-scaled molecular simulation method (MultiSCAAL) and used it to construct a well-sampled folding energy landscape of a Trp-cage protein under an aqueous condition and at high level of urea concentration that would otherwise be very difficult to simulate using the approach of molecular dynamics simulation alone. Trp-cage is a small fast-folding protein with 20 amino acids that has been studied extensively by experiments^22-24 and it is often used to gauge the validity of force fields²⁵ and new computational methods^26-31. Our method is built on integration of several well-established approaches such as coarse-grained protein models^{32, 33} (Side Chain $C_{\alpha}$ Model, SCM), reconstruction of all-atom protein models from SCM models⁶ (SCAAL), and all-atomistic molecular dynamics simulations^{34, 35}. Our strategy is composed of three steps: (1) The energy function for coarse-grained molecular simulation is derived from the potential of mean force (PMF) from the all-atomistic simulations that contain certain chemical interference using Boltzmann inversion method^{36, 37}; (2) Coarse-grained protein representations in a thermodynamic ensemble of interest are selected according to a Metropolis criterion³⁸ and all-atomistic protein conformation are promptly reconstructed by effectively incorporating an all-atomistic protein model as a template; (3) Folding free energy landscape of a protein that uses reconstructed all-atomistic protein models built from step (2) as initial conformations is effectively simulated by all-atomistic molecular dynamics. A schematic overview of the algorithm is presented in Fig.1 and the focus of our study in part lies in the transition between the energy function for coarse-grained models and the PMF.

While the method of Boltzmann inversion has been applied in other studies^{36 39}, the main issue that limits its practical use is the justification of a reliable reference state to relate the non-bonded energy function of a CG protein model to the PMF obtained from all-atomistic molecular simulations⁴⁰. Here, we attempted to resolve this problem by using an energy function taken from a matrix of knowledge-based potential (or statistical potential) determined from the Protein Data Bank⁴¹ and tested our approach on a small protein, Trp-cage. The rationale is the following: given that statistical potentials for coarse-grained molecular simulations and the force fields for all-atomistic molecular simulations have been independently derived from different experiments, both have been successfully applied for the investigation of research topics that address protein dynamics and structural interactions. By establishing an empirical relationship between PMF from all-atomistic systems and the energy function for non bonded interactions in coarse-grained systems, we may be able to apply this knowledge for an effective integration of multiscale molecular simulations as well as to improve the exploration of the folding energy landscape.

We constructed a folding energy landscape of a small protein Trp-cage under an aqueous condition and 8M urea concentration using all-atomistic replica exchange molecular dynamics (AA-REMD) and MultiSCAAL. Selected atomistic distances in dominant ensemble structures of Trp-cage using both methods were compared with NOE results from experimental NMR studies²⁴. Under the aqueous condition both methods produce nearly the same dominant protein structures. Interestingly, under 8M urea conditions MultiSCAAL was able to render a broader energy landscape of Trp-cage than AA-REMD by providing a wider variety of protein conformations in which side chains are able to flank out of the hydrophobic pocket in the presence of urea. The result of a flipped tryptophan captured by MultiSCAAL better explained short atomistic distances to neighboring amino acids indicated by NOE measurements²⁴.## II. THEORY AND METHODS

1. ENERGY FUNCTION DEPENDENT ON CHEMICAL INTERFERENCE

1.1 Boltzmann inversion

An energy function was developed for coarse-grained molecular simulation that accommodates chemical interference using Boltzmann inversion^{37, 42, 43} from the data derived from all-atomistic molecular dynamics simulations. The pair correlation function between any two amino acid types $i$ and $j$ at a distance $r$ in solvent type $\alpha$ is $g_{ij}^\alpha(r)$ . $r^*$ denotes the first highest peak of $g_{ij}^\alpha(r)$ , where $g_{ij}^\alpha(r^*)$ is a maximum. $g_{ij}^\alpha(r)$ relates to the potential of mean force, $U_{ij}^\alpha(r)$ , between the same pair of amino acids through Boltzmann inversion at temperature $T$ by the following formula³⁶ :

$U_{ij}^\alpha(r) = -k_B T \ln \left[ \frac{g_{ij}^\alpha(r)}{\rho_o} \right] \quad \text{eqn 1}$

where $\rho_o$ is the average density of the system (amino acid pairs and the solvent) and $k_B$ is the Boltzmann constant. The average density $\rho_o$ is used to normalize the pair correlation function at distances, greater than the excluded volume radius. The solvent mediated interactions $\varepsilon_{ij}'^\alpha$ for every pair of amino acids $i$ and $j$ is $U_{ij}^\alpha(r^*)$ . We next shift $\varepsilon_{ij}'^\alpha$ by a constant, $V_o$ ,

$\varepsilon_{ij}^\alpha = \varepsilon_{ij}'^\alpha + V_o \quad \text{eqn 2}$

where $V_o$ is obtained from a Threonine pair by setting $\varepsilon_{TT}'^\alpha$ from the simulation equal to $\varepsilon_{TT}^\alpha$ from the statistical potential of the same amino acid pair⁴¹.A Lennard-Jones potential (LJ), $V_{ij}^a(r)$ , was used to approximate the overall profile of $U_{ij}^a(r)$ ⁴⁴ and it is the energy function for the same type of amino acids in coarse-grained molecular simulation:

$V_{ij}^a(r) = \varepsilon_{ij}^a \left[ \left( \frac{r_{ij}^o}{r} \right)^{12} - 2 \left( \frac{r_{ij}^o}{r} \right)^6 \right] \quad \text{eqn 3}$

$\varepsilon_{ij}^a$ is the solvent-mediated interaction of an amino acid pair $i$ and $j$ in solvent type $\alpha$ . $r_{ij}^o$ is the bonding distance which will be further elaborated in Section III.1

1.2 Computation of pair correlation functions from all – atomistic molecular dynamics simulations with explicit solvent molecules

In order to fit the parameters of solvent-mediated interaction, $\varepsilon_{ij}^a$ , we computed the correlation functions of pairs of amino acids in $\sim 2\text{M}$ concentration for each type of amino acid solvated in explicit water. Amino acids were represented by the United Atom Model of the amber force field ff03ua⁴⁵ without caps (as it was the same construct in another study³⁶), and the water molecules were represented by the TIP3P model⁴⁶. Sodium and chlorine ions ( $\text{Na}^+$ , $\text{Cl}^-$ ) were added accordingly, in order to neutralize systems with charged groups. The cutoff for non-bonded interactions is 1nm. Electrostatic interactions were treated with the Particle Mesh Ewald method (PME)⁴⁷. The direct space cutoff for electrostatic interactions is 1nm, while the Fourier space part was computed using a grid spacing of $1\text{\AA}$ and cubic spline interpolation. A periodic cubic box of a fixed volume with a side $L = 40\text{\AA}$ was used for initial preparation of the system.The system went through several preparation steps before the production run. First, energy minimization was performed using the conjugate gradient method for 1000 steps. Then, each system was heated to 300K in an incremental temperature step of 100K under NVT condition for 50ps. To achieve adequate water density, we performed another equilibration step at the constant pressure of 1.0 atm and of temperature 300K (NPT), maintained by the Andersen algorithm⁴⁸. Finally, another 50 ps constant volume and temperature (NVT) equilibration was performed prior to the following NVT production run. In the production run, a 50 ns NVT simulation was performed.

To extract $g^{\alpha}{ij}(r)$ of each amino acid pair $i$ and $j$ from the simulation, we created a histogram from the distances among the different amino acid pairs with a bin size of 0.1Å. The distance between two amino acids is measured from the heavy atom that is closest to the center of mass of a side chain (Table 1) of each amino acid. This procedure was performed for all 210 combinatorial amino acid pairs, yielding a matrix of interactions $\varepsilon{ij}^{1aq}$ .

We repeated the same procedures to extract the solvent mediated interactions, $\varepsilon_{ij}^{ur}$ , for each amino acid pair in the presence of 8M urea. The urea molecules are represented by the same AMBER force field as the one used for the amino acids.

2. COARSE-GRAINED MOLECULAR SIMULATION ON TRP-CAGE

Coarse-grained (CG) molecular simulations were performed on a mini protein Trp-cage to investigate its folding energy landscape. Trp-cage has 20 amino acids (PDBID: 1L2Y) and it is represented by a side-chain $C_{\alpha}$ model (SCM), in which each amino acid is represented by two beads (excluding glycine) (see the Method Section in the Supplement). The stochastic dynamics to represent the solvent effects is represented by Langevin equations of motion⁴⁹ (see the MethodSection in the Supplement). In the aqueous solution (in absence of chemical interference) the nonbonded interactions follow the Betancourt-Thirumalai statistical potential⁴¹ in which threonine-threonine interaction was used as the reference to fit hydrophobic parameters obtained from partition chromatography experiments⁵⁰. In the 8M urea solution, the nonbonded interactions were taken from the Boltzmann inversion as described in Section 1.1.

In the production run, the Replica Exchange Method (REM)⁵¹ was applied to enhance sampling efficiency by incorporating multiple copies (replicas) of molecular simulation at a broad range of temperatures. Exchanges between neighboring replicas $i$ and $j$ are accepted with a probability:

$P_{acc} = \min\{1, \exp[(\beta_i - \beta_j) \cdot (U(r_i) - U(r_j))]\} \text{ eqn 4.}$

where $\beta = 1/k_B T$ and $U(r)$ represents the potential energy of the system. 16 replicas are distributed between 280K and 580K. An in-house version of AMBER6⁵² is used, where the Langevin equations of motion are integrated in a low friction limit. Each replica produces ~200,000 statistically significant conformations sampled with time separations greater than one characteristic correlation time⁵³.

3. RECONSTRUCTION OF ALL-ATOMISTIC PROTEIN STRUCTURES FROM COARSE-GRAINED MODELS

3.1 Reconstruction algorithm: Side-Chain $C_\alpha$ model to All-atom (SCAAL)

In order to reconstruct a desired all-atomistic structure from its coarse-grained SCM model, its bead positions were used as a part of harmonic constraints and applied to an all-atomistic protein template in the energy minimization step, using the conjugate gradient method.The prototype of SCAAL for the reconstruction of several compact structures was introduced in our prior study⁶. For each residue, $C_\alpha$ positions from the SCM were used as position constraints for $C_\alpha$ in the backbones from the all-atomistic template. The constraints on a side chain were imposed on a heavy atom with the closest proximity to the center of mass of the side chain which is typically less than 1Å. Selection of the chosen side chain heavy atom for each amino acid is listed in Table 1.

Next, through the energy minimization, harmonic constraints imposed by the chosen beads bring the all-atom template to the desired structure without the need for building a protein from individual atoms. The spring constants $k_\alpha = 1000\text{kcal/mol}\cdot\text{\AA}^2$ are used for the constraint applied on a $C_\alpha$ bead, while for a side chain bead is $k_b = 40\text{kcal/mol}\cdot\text{\AA}^2$ . These values are derived after extensive testing to ensure that the reconstructed structures are accurate and do not experience steric frustrations. The concept of using an atomistic “template” for protein reconstruction is depicted schematically in Fig. 2a and the flowchart of the SCAAL reconstruction algorithm is illustrated in Fig. 2b.

Reconstruction of unfolded conformations from coarse-grained protein models: To test our reconstruction algorithm, a pool of 67 non-redundant tested proteins was selected from the protein data bank using the same criteria as in Feig et al⁵⁴ that is exclusively determined by X-ray diffraction experiments (resolution of 1.00 Å or better). The pdb ID’s of these structures are provided in the Supplement. Unfolded conformations of each tested protein are generated at high temperatures. To ensure such unfolded conformations are of little resemblance to the minimized native structures, we set the criteria of structural selection for unfolded structures where the RMSD (root-mean-square-deviation) of heavy atoms with respect to the minimizednative structure exceeds $10\text{\AA}$ and its radius of gyration is at least 15% greater than the minimized native state. Each unfolded conformation is coarse grained into an SCM model (keeping $C_\alpha$ and the center of mass of side chains) and reconstructed into all-atomistic protein models using SCAAL.

4. ALL-ATOMISTIC MOLECULAR SIMULATIONS FOR THE FOLDING ENERGY LANDSCAPE OF TRP-CAGE

In this study, we constructed the folding energy landscape of Trp-cage using two different methods: (1) Using the method of MultiSCAAL we developed in this work (Section 4.1), and (2) using the standard all-atomistic molecular dynamics with the Replica Exchange Method (AA-REMD) as described in Section 4.2.

4.1 MultiSCAAL

The purpose of MultiSCAAL is to enhance the sampling of all-atomistic molecular dynamics simulations by selecting a large set of initial conditions selected from the CG distributions. This scheme allows all atomistic molecular dynamics simulation, following all-atomistic force fields, to efficiently probe and refine the conformations that are strategically reconstructed from CG protein models. Our scheme is distinct from the concept of Resolution Exchange^{55, 56} that performs iterative conformation exchanges between the CG and AA molecular simulations. Instead, we concentrate on the proper selection of initial AA conformations based on CG protein model with knowledge-based energy function that can be adjusted to different environmental conditions as detailed below. First, the conformations sampled from the ensembles of CG molecular simulations in the presence and absence of urea were selected based on the Metropoliscriterion³⁸. The potential energy of the most populated configuration (global minimum in the free energy) in the CG ensemble is $E_{ref}$ , and the energy of state $i$ that is randomly chosen from the CG ensemble is $E_i$ . The acceptance probability $P_{acc}$ at temperature $T$ is defined as following:

$P_{acc} = \min(1, \exp(-\Delta)), \text{ with } \Delta = \frac{E_i - E_{ref}}{k_B T}, \text{ eqn. 5}$

$k_B$ is the Boltzmann constant. One in every 1000 conformations was randomly sampled from the simulation data and subsequently reconstructed into all-atomistic representations by SCAAL as initial conditions for subsequent all-atomistic molecular dynamics simulations.

Secondly, a reconstructed all-atomistic protein model was solvated by explicit water molecules in a periodic cubic box. Depending on the size of a Trp-cage, the sizes of each water/urea box were different for each conformation, ranging between 40Å to 45Å. If the structure was reconstructed from the coarse-grained molecular simulation with the energy function for urea condition, then an 8M urea box is used for solvation. Finally, after solvating the system, a short equilibration was run in a constant volume and temperature of 300K. All-atomistic molecular dynamics simulations of Trp-cage were run at 300K for 1-ns.

4.2 All-atomistic molecular dynamics with replica exchange method (AA-REMD)

A set of AA-REMD was performed on Trp-cage for aqueous conditions and with 8M urea, respectively. Simulations were performed at constant volume and constant temperature (NVT) conditions, in a periodic cubic box as in similar studies²⁷. The AMBER force field ff99SB⁵⁷ was used for Trp-cage, and urea and water molecules were represented by the TIP3P model⁴⁶. Each system initially underwent energy minimization using a combination of steepest descent and conjugate gradient steps, followed by heating to 300K under NVT conditions with a temperaturestep of 100K and then equilibrated at 300K in NPT conditions to adjust the correct solvent density in each system. Finally a set of short equilibration runs of 0.1ns is performed under NVT conditions in order to create the different initial conditions for the Replica Exchange simulations. For each system, different replicas are equilibrated at 40 temperatures between 280 K and 540K. Replica Exchange Method (REM) simulations were applied using these 40 different replicas, each with different initial conditions, generating ~50,000 different configurations from 5,000 exchanges where each exchange is attempted at 2,000 steps. The fraction of exchanged replicas over the attempted ones (exchange rate of REM) is ~20%.### III. RESULTS AND DISCUSSION

1. Relationship of the statistical potential and the energy function derived from the potential of mean force

One of the important steps for multiscale molecular simulation is to bridge the energy functions between protein models with different resolutions. There is a need to develop a reliable algorithm that can accurately transfer essential information such as chemical interference from high-resolution model to low-resolution model without losing its key features. In this regard, we adopted the method of Boltzmann inversion (eqn.1) that provides a link between the potential of mean forces derived from the all-atomistic molecular dynamics simulation and the energy function for the coarse-grained molecular simulation, which has been applied in other studies^{36, 37, 39}. However, computation of the potential of mean force from all-atomistic molecular dynamics simulation is inherently complex and inefficient. In addition, it is questionable whether the thermodynamic properties of coarse-grained molecular simulations can be accurately transferred by the solvent-mediated interaction derived from the potential of mean force. Here, we developed a new and simple approach using the statistical potential parameter table (SPPT) to resolve these issues.

We provided a method to calibrate a reference state of the energy function for coarse-grained molecular simulation (eqn.2) by incorporating statistical potentials from prior bioinformatics studies (eqn.3). Although the interaction parameters for all-atomistic molecular dynamics simulations and the statistical potentials for coarse-grained molecular simulations are both derived from different sets of experimental data, the two should be correlated and this relationship can be used to bridge the molecular simulations in different resolutions. Using this relationship, we can further develop a statistical potential parameter table (SPPT) of amino acidinteractions with chemical interference from all-atomistic molecular dynamics simulations with corresponding chemical conditions.

To determine the relationship of the two energy functions, first we performed all-atomistic molecular simulations of each amino acid pair in aqueous conditions and computed its pair correlation function $g_{ij}^{aq}(r)$ . Using a pair of threonine-threonine (TT) as an example, we calculated its pair correlation function, $g_{TT}^{aq}(r)$ shown in Fig. 3a. After Boltzmann inversion, the solvent-mediated interaction, $\varepsilon_{TT}'^{aq}$ , was obtained from its potential of mean force $U_{TT}^{aq}(r^*)$ (Fig. 3b), where $r^*$ locates the major peak of $g_{TT}^{aq}(r)$ in Fig. 3a. All of the possible 210 solvent-mediated interaction $\varepsilon_{ij}'$ between a pair of amino acid $i$ and $j$ in aqueous condition were obtained from the same framework as the one described above.

Next, we plotted $\varepsilon_{ij}'^{aq}$ against the energy function from the Betancourt-Thirumalai statistical potential⁴¹ of the same amino acid pair, $\varepsilon_{ij}^{aq}$ in Fig. 3. The two sets of parameters correlate very well with a linear regression coefficient 0.79. The distributions of $\varepsilon_{ij}'^{aq}$ and $\varepsilon_{ij}^{aq}$ are also similar (See Supplement Fig S1). The vertical shift of $\varepsilon_{ij}^{aq}$ can be offset by matching Thr-Thr interaction in both parameter sets with $V_o = 0.31 \text{ kcal/mol}$ . This conceptual paradigm was expanded from the Betancourt-Thirumalai study in which the statistical potentials of Miyazawa-Jernigan⁵⁸ were shifted using Thr-Thr interaction as a reference state to match experimental studies. By doing so, we established a reliable tool in transferring the energy functions in all-atomistic and coarse-grained molecular simulations, in which both parameters were tested against different experimental factors.In a Lennard-Jones potential $V_{ij}^{aq}(r)$ (eqn.3) for coarse-grained molecular simulation, $r_{ij}^o$ is the bonding distance between a pair of amino acid $i$ and $j$ . The potential was obtained by the measurement of the distance between the centers of mass of two interacting amino acids. The observed discrepancy between $r^*$ and $r_{ij}^o$ (Fig. 3b) is due to the fact that $r^*$ was sampled from amino acid pairs with uneven atomistic structures, while $r_{ij}^o$ was computed assuming a smooth spherical side chain. This difference exists in all amino acid pairs, but it is not dependent on solvent conditions, therefore, we simply keep $r_{ij}^o$ in coarse-grained molecular simulation for the energy function in Fig. 3c.

Once we established a reliable relationship of the energy functions between all-atomistic and coarse-grained molecular simulation, we were able to repeat the same procedure on amino acid pairs in the presence of 8M urea as the aqueous condition in order to obtain SPPT under the chemical inference of 8M urea. The solvent-mediated interactions, $\epsilon_{ij}^{ur}$ , derived from the potential of mean force, were offset to $\epsilon_{ij}^{ur}$ for the coarse-grained energy function by the same $V_0$ as the one used for the aqueous condition. $\epsilon_{ij}^{ur}$ for all pairs of amino acid $i$ and $j$ in the presence of 8M urea is listed in Table S1. $\epsilon_{ij}^{ur}$ is overall less than $\epsilon_{ij}^{aq}$ and reflects weakened interactions of amino acid pairs in the presence of 8M urea, leading to protein destabilization. The contact energies of the Betancourt-Thirumalai statistical potential are provided in Table S2 as reference.

2. Coarse-grained molecular simulation on Trp-cage under urea condition

With an energy function that can incorporate different conditions, we produced the folding energy landscape of Trp-cage using coarse-grained molecular simulations for both aqueous and 8M urea conditions (Fig. 5 a, b respectively). The free energy landscape at 300K isplotted as a function of the root-mean-square-deviation (RMSD), which provides a quantitative measure of conformational changes and the radius of gyration ( $R_g$ ) that determines the size of a protein.

In Fig.5a, the folded Trp-cage in water is in the region of $2.0\text{\AA} \leq RMSD \leq 3.5\text{\AA}$ and $7.3\text{\AA} \leq R_g \leq 8.0\text{\AA}$ . Under high concentration of urea in Fig.5b, the main free energy basin is in the region of $2.0\text{\AA} \leq RMSD \leq 3.5\text{\AA}$ and $7.7\text{\AA} \leq R_g \leq 8.5\text{\AA}$ . While the main basin shifts to higher values of $R_g$ in the presence of 8M urea, the population of structures with low $R_g$ decreased. In addition, the population of structures with $R_g > 8.5\text{\AA}$ is enhanced in the presence of denaturant conditions. We used a self-organizing neural net^{59, 60} clustering technique^{61, 62} to distinguish the structures without setting any prior structural assumptions.

Typical representative conformations sorted by the clustering method from the two basins are presented for water (Fig.6a) and for 8M urea (Fig.6b). The dominant structure found in the aqueous coarse-grained molecular simulations is similar to the NMR structure obtained from experiments²⁴. The radius of gyration of this ensemble is $\sim 7.6\text{\AA}$ which is very similar to that of the NMR structure of Trp-cage ( $\sim 7.5\text{\AA}$ ). The position of Trp6, as shown in Fig.6e, illustrates that the hydrophobic core of Trp-cage remains stable throughout the simulations.

In contrast, under 8M urea conditions the coarse-grained molecular simulations demonstrated that the size of the radius of gyration of the most dominant cluster grows ( $R_g \sim 8.2\text{\AA}$ ), compared to the aqueous simulations. In this state, the position of Trp6 is no longer “caged” by the hydrophobic residues of the protein (Tyr3, Leu7 Pro12, Pro17, Pro18, Pro19) as shown in Fig. 6b. This agrees with previous theoretical studies of Trp-cage that suggest a similar mechanism of unfolding for Trp-cage, where the structure does not unfold to a random-coil conformations but rather increases in size and exposure of the hydrophobic residues^{25, 26}.### 3. Reconstruction of all-atomistic Trp-cage structure from coarse-grained protein models

The accurate reconstruction of high-resolution protein models from low-resolution ones is a necessary step to incorporate chemical interferences in a multi-scale molecular simulation. Several strategies based on the search for a globally minimized state are used for reconstructing reduced protein to all-atomistic structures over the last years^63-65. Although a strong dependence on rotamer libraries in some of these algorithms has enabled fast and accurate reconstruction methods for the native state protein structures, it is questionable whether the accuracy of such methods can still be achieved or not for protein conformations far from their native states. This aspect is particularly important in the development of a multiscale molecular simulation approach in which protein structures produced at a condition overwhelmingly different from a globally minimized state (e.g. crystal structures). Without an accurate method for protein reconstruction in unfolded conformations, the validity of an algorithm for multiscale molecular simulation is jeopardized.

We tested the accuracy of reconstruction of unfolded proteins conformations using SCAAL (See Method section) against PULCHRA⁶³, in Fig.7. Both of the two methods require a $C_{\alpha}$ bead and the center of mass of side chain as input. However, the major difference lies in the fact that PULCHRA requires a side-chain rotamer library, which is obtained from the protein data bank for the reconstruction of side chains, while SCAAL uses a randomly selected all-atomistic protein conformation of a tested protein as a template for side chain reconstruction.

We produced unfolded protein structures (See Method section) using high temperature molecular simulation models, coarse-grained them into SCM low-resolution protein models and reconstructed them back to all-atomistic protein models using SCAAL and PULCHRA. We thencompared the RMSD between these reconstructed structures and the original chosen unfolded atomistic structures on a basis of residue types as an indicator for the accuracy of the two reconstruction methods in Fig.7. An overall performance regarding accuracy undertaken by SCAAL is much better than PULCHRA by 30%. The average RMSD constructed by SCAAL is within 1.33 Å, and it worked better on amino acids with larger side chains (e.g. R, H, W, K, F, Y, and E) than PULCHRA. This suggests that a reconstruction of side chains based on physical models using harmonic constraints will be more reasonable than the incorporation of rotamer libraries (based on information of native folded states) as far as the reconstruction of unfolded states is concerned.

4. Free energy landscape of Trp-cage produced from all-atomistics molecular simulations with REM and MultiSCAAL

A 2D-free energy landscape as a function of RMSD and $R_g$ of Trp-cage produced by all-atomistic molecular simulations with the replica exchange method (AA-REMD, see Method section) at 300K under aqueous and 8M urea is shown in Fig. 8 (a, b), respectively. In Fig.8a, there is a major population of conformation in aqueous solvent characterized by $RMSD \sim 2.5 \text{ Å}$ and $R_g \sim 7.5 \text{ Å}$ . In Fig.8b, the position of the major population shifts to $RMSD \sim 3.5 \text{ Å}$ and $R_g \sim 8.0 \text{ Å}$ in the presence of 8M urea, indicating that the population of swelled conformations induced by urea increases compared to the ones in water.

Using the same RMSD and $R_g$ parameters, a 2-D free energy landscape is also produced by all-atomistic molecular simulation with MultiSCAAL at 300K, under aqueous and 8M urea in Fig.8 (c, d) (See Method Section). In comparison to Fig.8(a, b), the area of free energy landscape explored by MultiSCAAL is broader in Fig.8(c, d), indicating enhanced sampling for both Trp-cage systems (in water and urea). Even though the position of the main basins in the free energy landscape are similar, the structural characteristics and ensemble distribution of Trp-cage sampled by the MultiSCAAL scheme contains a wider variety of structures than the ones in the AA-REMD simulations as indicated by greater value of RMSD. To further investigate the differences in the structural distribution in the free energy landscape obtained by the two methods (e.g. AA-REMD and MultiSCAAL) we used the clustering technique to distinguish the structures without setting any prior structural assumptions. The most dominant structures obtained from each method, under different conditions, are shown in Fig. 6 for visual comparison.

In the following sections, we investigated dominant ensemble conformations of Trp-cage in aqueous environment and 8M urea in detail:

(a) Aqueous environment: We used the clustering method ^{61, 62} to characterize the three most dominant structures in aqueous conditions sampled by AA-REMD and by MultiSCAAL as shown in Fig.9 at 300K. The first most dominant structures in the folded state with $R_g \sim 7.4 \text{ \AA}$ (Fig.9 a & d) obtained using both approaches are similar given that the calculated RMSD of the heavy atoms is $< 1.0 \text{ \AA}$ . Using a characteristic hydrogen bond between the hydrogen atom (HE1) of W6, and the oxygen atom (O) of R16 (bond W6-R16) that is unique in the folded state of Trp-cage (determined by NMR²²), we evaluated the dominant clusters of the folded states obtained from AA-REMD and MultiSCAAL by the presence of this hydrogen bond in clusters. For the AA-REMD, this hydrogen bond is present in 75% of the structures in the most dominant cluster and for the MultiSCAAL it is 78.3%. This result further supports the resemblance of the two most dominant clusters obtained from the two enhanced sampling techniques in an aqueous environment and they agree with experimental findings.In the second and third most dominant clusters obtained with both methods in aqueous solvent (Fig.9 b & c), we find that the structures from AA-REMD simulations are similar to the ones belonging to the most dominant cluster, ranging from 7.36 Å–7.49 Å. However, in the case of the MultiSCAAL simulations, while Trp–cage retains its folded conformation, the structures differentiate in the position of the side chains of the main hydrophobic residues in Fig.9(e & f) resulting in a richer population of structures with similar $R_g$ but higher values in RMSD due to a wide range of orientation of side chains. Sampling all these changes in different orientations is a very difficult problem in all-atomistic molecular simulations with explicit water models.

(b) 8M urea: In the dominant clusters, the unfolded states of Trp–cage obtained from all atomistic molecular simulation with REM (AA-REMD) and MultiSCAAL differ in the extent of packing of Trp6 against prolines (Fig.10 a & d). For both dominant clusters the radius of gyration increases to $R_g \approx 8.10$ Å in the presence of 8M urea which agrees with previous studies^{24, 25} where an increase of ~10% of the radius of gyration induced by 6M urea was observed. Using the aforementioned W6-R16 characteristic hydrogen bond that is unique to the folded state of Trp-cage, we found that its presence is only in 0.03% and 0.08% of the dominant structures obtained by AA-REMD and MultiSCAAL, respectively (Fig.10 a & d).

Although the sizes of the conformations generated by MultiSCAAL and AA-REMD are similar, the structural details are quite different. From the conformations obtained by the MultiSCAAL approach, the sidechain of Trp6 (an indole group) can completely flip outward and exit from the hydrophobic core of a Trp-cage protein as illustrated in Fig.10d. In addition, when comparing the second and third most dominant clusters from both methods, we can see that in the case of REM, the sampled structures are similar, with $R_g \sim 8.09$ Å and $R_g \sim 8.10$ Å respectively (Fig. 10 b & c). In contrast, the structures in the second and third dominant clusters from theMultiSCAAL simulations are different with $R_g \sim 7.89 \text{ \AA}$ and $R_g \sim 8.20 \text{ \AA}$ , respectively (Fig.10 e & f). In order to evaluate the quality of these structures, we turn to a direct comparison with available experimental data in the next section.

5. Comparison with NMR NOE distances

In a recent experimental NMR study of Trp-cage at 6M urea²⁴, NOE distances were determined between the protons of the side chain (indole group) of Trp6 and the following amino acids: Ile4, Leu7, Pro12 and Arg16. It was found that under high concentration of urea and at low temperature (278K) the majority of these NOE distances were shorter than the ones in the native state of Trp-cage, even though the radius of the molecule as obtained from diffusion experiments²⁴ was found to be $\sim 8 \text{ \AA}$ , that is not far from the size of Trp-cage in aqueous condition. In this regard we calculated the NOE distances between the same proton pairs as in these experiments, for the three most dominant clusters obtained from both AA-REMD and MultiSCAAL simulations under 8M urea condition (Table 2) as a gauge to validate the accuracy of the folding energy landscape of Trp-cage under high content of urea obtained by two different methods. Although the urea concentration in our study is slightly greater than the one used in the experiments, it is still representative of a Trp-cage under high concentration of urea.

In the case of the clustered Trp-cage structures obtained from AA-REMD the majority of the NOE distances greatly deviate from the experimental values, as seen in Table 2. In contrast, the computed NOE distances of the dominant cluster of the Trp-cage structures obtained from MultiSCAAL exhibit remarkable similarity to the one measured by the NMR experiments, in spite of the small difference in the concentration of urea between the experiment and simulations. Ensemble structures sampled by MultiSCAAL match the distance constraints of theNMR experiments better than the ones obtained by AA-REMD, indicating the improved sampling in MultiSCAAL. Examining the Trp-cage structure in the dominant cluster from the MultiSCAAL simulation, we determined that a key factor that contributes to the shortening of the NOE distances is the exit of the side chain of Trp6 outside of the hydrophobic core (Fig. 11). This topological arrangement reduces the distances between proton in Trp6 (HE3) and Ile4 (HG2) to 6.1 Å, which is much closer to the experimental value (4.5 Å), than the distance of 8.0 Å in the ensemble structures sampled by AA-REMD simulations.

We further investigated the reason behind this topological arrangement of the Trp6 by measuring the number of water and urea molecules that are present in the hydrophobic pocket of the protein (e.g. by following the water and urea molecules shared by amino acids Trp6 and Arg16). In the average ensemble structures of Trp-cage in the dominant cluster sampled by AA-REMD, there are 0.53 water molecules and 1.94 urea molecules that appear in the hydrophobic pocket while in the average ensemble structure of Trp-cage in the dominant cluster sampled by MultiSCAAL there are 0.97 water molecules and 1.86 urea molecules in the hydrophobic pocket. This suggests, that there is a better chance for the presence of water molecules inside the hydrophobic core when the side chain of Trp6 (indole group) points away from the core as being adequately captured in the MultiSCAAL simulation and this structural feature can better account for the short NOE distance between Trp6 and other amino acids. Recently, it was demonstrated in a computational study²⁵ that in order to match the three NOE distances between Ile4 and Trp6 from NMR experiments²⁴ under 6M urea at 278K, the system had to be heated to 360K where the unfolding procedure could be expedited. Because MultiSCAAL incorporates coarse-grained protein models with a fewer number of side chain beads, it is easier to use coarse-grained models for exploring different orientations of side chain positions. This avoids the high entropic cost ofrearranging solvent water molecules around side chains when Trp6 exits the hydrophobic core, so that sampling efficiency of protein conformations solvated by explicit water molecules can be greatly enhanced. We believe that it is for this reason that the structures sampled by MultiSCAAL simulation scheme can better match with the experimental NOE distances at a temperature closer to the experimental condition.

6. Overall algorithm performance of MultiSCAAL

We benchmarked the performance of the MultiSCAAL against AA-REMD by measuring the computational time in terms of CPU hours (CH) on 160 CPUs for each AA-REMD simulation on a Linux cluster (AMD Opteron 2.3GHz) at the University of Houston. In order to produce a 50-ns all-atomistic molecular simulation of Trp-cage using AA-REMD a total of 200,000 computing hours (CH) was needed.

All-atomistic molecular simulations with MultiSCAAL consist of three parts. In the first part, the coarse-grained molecular simulation with REM requires ~20,000 CH to complete. In the second part, for the protein reconstruction, the computational burden is negligible compared to the total length of the simulations (< 100CH). In the final part, 1-ns standard all-atomistic molecular dynamics simulation for each reconstructed protein structure as initial conditions takes ~64,000 CH. Upon the completion of MultiSCAAL, 1,280-ns of simulation data is produced in ~84,000 CH. MultiSCAAL simulation provided a considerably enhanced sampling efficiency and lower computational cost than the standard AA-REMD simulations with the total simulation length being ~25 (1280/50) times greater in less computational hours.

IV. ConclusionWe developed an effective MultiSCAAL method that connects protein models in different resolutions and integrates the potential of mean force from all-atomistic molecular dynamics simulations and energy functions for coarse-grained molecular simulations. Our method can be used to enhance the sampling efficiency of protein conformations in a broad range of phase space, particularly in the presence of chemical interference that would be otherwise very difficult to simulate by molecular dynamics alone. Using Trp-cage as a protein model, the folding energy landscape in aqueous condition and at high concentration of urea constructed by MultiSCAAL is broader than AA-REMD, indicating that MultiSCAAL is able to produce a wider distribution of ensemble structures. The NOE distances of the ensemble conformations under high concentration of urea in the dominant structural cluster obtained by MultiSCAAL matched better with NMR experiments than AA-REMD. MultiSCAAL is effective because it uses a reduced representation of side chain beads in a coarse-grained protein model without explicit solvent molecules and this allows it to explore different side-chain orientation much faster than all-atomistic protein models in the presence of excessive water molecules. The algorithm of MultiSCAAL provides genuine transition between high-resolution and low-resolution protein models which helps overcome local entropic traps due to solvation of large side chains in different orientations. In addition, the energy function in the presence of urea for coarse-grained molecular simulations is built from the potential of mean force of all-atomistic molecular dynamics simulations, Boltzmann inversion method and statistical potential and that expedites the computational process needed for equilibrium. Modeling and computation of protein interactions in a cell has been a big challenge, however, MultiSCAAL is a promising and effective sampling scheme we developed for multiscale investigation of chemical interference on protein interactions.## V. Acknowledgement

MSC would like to thank the National Science Foundation (MCB 0919974) and the University of Houston for the support of this research. AS would like to thank the UH Writing Center for improving the readability of this manuscript. We also thank the Texas Advanced Computing Center and the Texas Learning and Computation Center at UH for providing computational resources.**Tables:**Table 1. List of the heavy atoms used in SCAAL to represent the side chain of each amino acid.

Amino acid	Side chain atom	Amino acid	Side chain atom
GLY	-----	ASN	C_γ
PHE	C_γ	ASP	C_β
TRP	C_δ2	THR	C_β
TYR	C_γ	ALA	C_β
GLN	C_γ	GLU	C_γ
SER	C_β	PRO	C_β
VAL	C_β	ARG	C_δ
ILE	C_γ1	HIS	C_γ
LEU	C_γ	MET	S_δ
LYS	C_δ	CYS	S_γ

**Table 2.** 1^st row in bold: NOE distances reported in reference²⁴, from photo–CIDNP (Chemically Induced Dynamic Nuclear Polarization) experiments in 6M urea. Rows 2 to 4: Distances calculated between the same protons as in the experiments, for the first, second and third dominant clusters from AA-REMD simulations in 8M urea. Rows 5 to 7: Distances calculated between the same protons as in the experiments, for the first, second and third dominant clusters from MultiSCAAL simulations in 8M urea.

	Trp6 HE3 Ile4 HB	Trp6 HE3 Ile 4 HG2	Trp6 HE3 Ile 4 HD1	Trp6 HE3 Leu 7 HD1	Trp6 HE3 Leu 7 HD2	Trp6 HH2 Pro 12 HG2	Trp6 HE1 Arg 16 HB2	Trp6 HE1 Arg 16 HB3
NMR (Å)	4.2	4.5	4.5	4.0	3.4	3.3	4.2	3.8
AA-REMD (Å) 1^st	8.0	8.0	8.3	3.95	8.1	4.4	7.0	7.9
AA-REMD (Å) 2^nd	8.6	9.5	8.9	3.7	7.6	4.3	7.0	6.7
AA-REMD(Å) 3^d	8.8	9.8	8.8	3.9	8.0	8.3	10.1	11.2
MULTISCAAL (Å) 1^st	6.8	6.1	6.6	4.0	6.5	4.0	3.8	3.6
MULTISCAAL (Å) 2^nd	8.6	8.8	8.6	3.8	9.6	6.3	5.4	6.5
MULTISCAAL (Å) 3^d	7.6	8.5	9.5	4.1	9.1	8.5	7.4	6.2

### Captions:

Figure 1: A schematic diagram in a multiscale algorithm where a protein configuration switches from all-atomistic (AA) to coarse-grained (CG) representation and vice versa. A side chain- $C_\alpha$ model (SCM) is used as a coarse-grained model. The reconstruction of a protein in an AA representation from CG representation is achieved by SCAAL. The Lennard-Jones (LJ) parameters for an AA representation follow an AMBER force field, while for a CG representation they follow a statistical potential based on bioinformatics and the potential of mean force from the AA molecular dynamic simulations via Boltzmann inversion method. The dynamics of an AA protein is governed by the Newtonian equations of motion. The dynamics of a CG protein is governed by the Langevin/Brownian equations of motion.

Figure 2: (a) A schematic representation of the SCAAL reconstruction method with the use of an all-atomistic protein structure as a template and the positions of coarse-grained side-chain- $C_\alpha$ model (SCM) as harmonic constraints. (Left) $C_\alpha$ beads are in red and the heavy side-chain beads are in yellow. The two beads hold the positions through harmonic constraints for a projected reconstructed all-atomistic protein model. A randomly chosen all-atomistic protein structure that can be far from the crystal structure is introduced as a structural template and shown in a solvent accessible surface area mode. (Right) After the structural reconstruction by SCAAL, an all-atomistic representation of a projected protein structure is created (myoglobin, PDBID 1A6M, is used for illustration). (b) A flow chart of the SCAAL reconstruction method.

Figure 3: (a) Pair correlation function $g_{ij}^{aq}(r)$ for Thr-Thr (solid line) and Val-Val pairs (dotted line) derived from all-atomistic molecular dynamics simulations in aqueous condition. $r = r^*$

Xet Storage Details

Size:: 53.8 kB
Xet hash:: 18d7d38d04500ba158ebca5d474ce368e4fac83f54a332d7e5e69869e92c1af1

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.