Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1
PhD thesis submitted November, 2016. This work is supported by Imperial College London, and performed as
partial fulfilment of the PhD in Molecular Biosciences, Faculty of Natural Sciences, Department of Life Sciences,
Division of Cell and Molecular Biology, Imperial College London, United Kingdom.
Nicolas Edmond Jean Génin is with Imperial College London, United Kingdom (corresponding author to provide,
phone: 0044-7453275275; e-mail: [email protected]).
He is under the supervision of Dr. R. Weinzierl, and co-supervision of Prof. M. Buck and Dr. A. De Simone, with
access to the Sir Alexander Fleming Building facilities, South Kensington Campus.
Investigation of the nucleotide triphosphate
diffusion into the active site of RNA Polymerase
N. E. J. Génin
PhD thesis submitted to Imperial College London
in partial fulfilment for the degree of
PhD in Molecular Biosciences
November 2016
2
Declaration of originality
I hereby declare the work presented in this thesis to be original, to belong solely to the author, except
stated otherwise, in which case it is rigorously referenced to the best of the author’s knowledge.
3
Copyright declaration
The copyright of this thesis rests with the author and is made available under a Creative Commons
Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit
the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that
they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear
to others the licence terms of this work.
4
Abstract
RNA Polymerase can be seen as a mobile molecular structure orchestrating the movement of substrate
NTP and nucleic acids, regulated by some control molecules (transcription factors) and the sequential
interplay of the enzyme domains. For the last 15 years, loading of rNTPs into the active site of the
enzymatic complex has been regarded more or less as a settled issue. Based on the first generated crystal
structures, substrates were thought to load via a pathway termed secondary channel (CH2). The latter
well-accepted paradigm regarding a fundamental aspect of the transcription process is refuted and a new
model, relying on overlooked structural characteristics (CH3, CH4), and accommodating a large body
of pre-existing information, is presented. Important implications involve notably the fact that CH2 is
mainly an exit channel, and that NTPs are selected prior to delivery into the catalytic center. Overlapping
partially with the new loading hypotheses, details about substrate discrimination, error recovery and the
translocation mechanism, which has been an open question in the domain for the past 20 years, are
discussed. Accelerated and Steered Molecular Dynamics simulations are computed and enable to gain
informative insight about the dynamics of the diffusion process. In-depth conformational and
electrostatic analyses are discussed and allow gauging propensity for substrate accommodation.
5
Acknowledgements
The author is particularly grateful to Dr. R. Weinzierl, supervising this project, for his continuous
support, for his precious guidance, and for having given the opportunity to the author to undertake this
doctoral project. The author is thankful to Prof. M. Buck and Dr. A. De Simone, co-supervising this
project, for their advice and support. Thanks are also due to post-doctoral researcher Dr. C. Amin who
worked in his group for her help and support. Finally, gratitude is expressed to Imperial College London
alumni students J. Wingfield and A. Valeva, for their support.
6
Table of Contents
List of tables .......................................................................................................................... 8
List of figures ........................................................................................................................ 9
List of abbreviations ............................................................................................................ 12
Chapter 1: Literature review ............................................................................................... 14
1. Introduction ................................................................................................................. 15
2. Secondary channel theory ........................................................................................... 18
3. Main channel theory .................................................................................................... 22
4. Non-controversial properties of CH2 and dynamic error correction .......................... 27
5. The ratchet issue .......................................................................................................... 31
6. The meting issue and details on cTFs ......................................................................... 38
7. Considerations on nucleotide selection ....................................................................... 52
8. Discussion ................................................................................................................... 58
9. Concluding remarks .................................................................................................... 66
Chapter 2: MD methods ...................................................................................................... 67
1. Introduction ................................................................................................................. 68
2. Metabolite pool ........................................................................................................... 69
3. Forcefields ................................................................................................................... 73
4. Accelerated MD simulations ....................................................................................... 75
5. Steered MD simulations .............................................................................................. 80
Chapter 3: Elongation Complex reconstruction .................................................................. 82
1. Introduction ................................................................................................................. 83
2. 3D Rotation ................................................................................................................. 84
3. Illustrative case: adding a single nucleotide ................................................................ 86
4. Transformations .......................................................................................................... 89
5. Principle application: constructing a complete EC ..................................................... 97
6. Closing remarks ......................................................................................................... 110
Chapter 4: Advanced Characterization of the Diffusional Pathways ................................ 112
1. Introduction ............................................................................................................... 113
2. Geometric pathway analysis ...................................................................................... 113
2.1. Introduction .................................................................................................... 113
2.2. Principle of the algorithm ............................................................................... 116
7
2.3. Detailed description of the algorithm ............................................................. 122
2.3. 1. Refine starting point ......................................................................... 122
2.3. 2. Virtual sphere scan ........................................................................... 137
2.3. 3. Walk forward along pathway axis .................................................... 139
2.3. 4. Convert COM map to distance bins ................................................. 140
2.3. 5. Calculate cross section area .............................................................. 141
3. Electrostatic analysis ................................................................................................. 142
Chapter 5: Results and Discussion .................................................................................... 143
1. Introduction ............................................................................................................... 144
2. Simulation summary ................................................................................................. 145
3. Results ....................................................................................................................... 148
3.1. Diffusional zones ............................................................................................ 148
3.2. CH2 Analysis ................................................................................................. 154
3.3. CH3A Analysis .............................................................................................. 159
3.4. CH3B Analysis ............................................................................................... 164
3.5. CH3C Analysis ............................................................................................... 169
3.6. CH3D Analysis .............................................................................................. 177
3.7. CH4 Analysis ................................................................................................. 178
3.8. Misloading recovery investigation ................................................................. 179
4. Discussion ................................................................................................................. 181
5. Future works .............................................................................................................. 191
6. Conclusions ............................................................................................................... 193
References ......................................................................................................................... 195
Appendix 1 ........................................................................................................................ 209
Appendix 2 ........................................................................................................................ 240
8
List of tables
Table 1: Comparison of nucleotide base discrimination between several studies for enzyme with deleted
TL domain.
Table 2: Comparison of nucleotide ribose discrimination between several studies for enzyme with
deleted TL domain.
Table 3: RNA nucleotides to be added.
Table 4: DNA nucleotides to be added.
Table 5: Alignment of an entire template helix to three reference anchoring points.
Table 6: aMD simulation summary.
Table 7: sMD simulation summary.
9
List of figures
Figure 1: Cross section through Sc RNAP II.
Figure 2: Cutaway view of rNTP loading via CH2.
Figure 3: CH3 access to the main channel.
Figure 4: Electrostatic Fork melting mechanism.
Figure 5: Comparison of FL2 interaction with downstream DNA in Tt RNAP.
Figure 6: Comparison of FL2 interaction with downstream DNA in Sc RNAP II.
Figure 7: TFIIS shielding of RNAP II secondary channel.
Figure 8: 5’-3’ direction of DNA extension.
Figure 9: 3’-5’ direction of DNA extension.
Figure 10: Backbone extension template for both the 5'-3' and the 3'-5- directions of DNA extension.
Figure 11: Nucleotide attachment to the DNA backbone host in the 5’-3’ direction.
Figure 12: Nucleotide attachment to the DNA backbone host in the 3’-5’ direction.
Figure 13: Schematic diagram of the first rotation transformation to align a nucleotide backbone to be
incorporated on DNA 5’ end.
Figure 14: Schematic diagram of the second rotation transformation to align a nucleotide backbone to
be incorporated on DNA 5’ end.
Figure 15: Translation transformation attaching the aligned backbone to DNA 5’end.
Figure 16: DNA nucleotide and backbone references to attach a new base group on the 5’ end.
Figure 17: Schematic diagram of the first rotation transformation to align a nucleotide base group to be
incorporated on DNA 5’ end.
Figure 18: Schematic diagram of the second rotation transformation to align a nucleotide base group to
be incorporated on DNA 5’ end.
Figure 19: Schematic diagram of the translation transformation attaching a new base group to DNA 5’
end backbone.
Figure 20: Schematic diagram of missing nucleotides in PDB#2E2H.
Figure 21: Comparison fit between initial downstream tDNA structure and superposed extended helix.
Figure 22: Comparison fit between initial downstream ntDNA structure and superposed extended helix.
Figure 23: Visualization of downstream DNA reconstruction.
Figure 24: Initial fitting of upstream ntDNA.
Figure 25: Visualization of the initial fitting of ntDNA template relative to the enzymatic structure.
Figure 26: Second fitting of upstream ntDNA.
Figure 27: Visualization of the second fitting of ntDNA template relative to the enzymatic structure.
Figure 28: Mutation of ntDNA template nucleotides to match Table 4 sequence.
Figure 29: Fitting of missing RNA nucleotides.
Figure 30: vdw representation of the full nucleic complex before potential energy minimization.
10
Figure 31: vdw representation of the full nucleic complex after potential energy minimization.
Figure 32: Schematic diagram of the main dimensions of a pathway.
Figure 33: Schematic diagram of a pathway cross section layer.
Figure 34: Pathway axis of an irregular channel.
Figure 35: Schematic diagram of the visualization through a pathway.
Figure 36: Projection of pathway points onto a tested direction.
Figure 37: Axis scan.
Figure 38: Contour scan.
Figure 39: Interlining atoms extraction.
Figure 40: Virtual sphere scan method.
Figure 41: Virtual sphere scan pathway axis detection.
Figure 42: Cross section area calculation.
Figure 43: CH2 and corridor pathways.
Figure 44: CH3 view from CH2.
Figure 45: Side view of CH3.
Figure 46: Side view of CH3C, CH3D and CH4, relative to CH2.
Figure 47: Front view of CH3C, CH3D and CH4.
Figure 48: Side view of CH3C, CH3D and CH4, relative to CH4.
Figure 49: Bottom view of CH3D entrance to CH3.
Figure 50: Front, side and back view of CH2 pathway axis.
Figure 51: CH2 minimal radius along diffusional path heatmap.
Figure 52: CH2 cross section area along diffusional path heatmap.
Figure 53: CH2 Electrostatic NTP interaction along diffusional path heatmap.
Figure 54: CH2 force-distance plot.
Figure 55: Front and side view of TL closing of opening CH3A.
Figure 56: Front, side and back view of CH3A pathway axis.
Figure 57: CH3A minimal radius along diffusional path heatmap.
Figure 58: CH3A cross section area along diffusional path heatmap.
Figure 59: CH3A Electrostatic NTP interaction along diffusional path heatmap.
Figure 60: CH3A force-distance plot.
Figure 61: Front, side and back view of CH3B pathway axis.
Figure 62: CH3B minimal radius along diffusional path heatmap.
Figure 63: CH3B cross section area along diffusional path heatmap.
Figure 64: CH3B Electrostatic NTP interaction along diffusional path heatmap.
Figure 65: CH3B force-distance plot.
Figure 66: GTP bound at CH3B entrance.
Figure 67: Longitudinal view through CH3C.
11
Figure 68: Side view of CH3C pathway axis.
Figure 69: CH3C minimal radius along diffusional path heatmap.
Figure 70: CH3C cross section area along diffusional path heatmap.
Figure 71: CH3C Electrostatic NTP interaction along diffusional path heatmap.
Figure 72: CH3C force-distance plot.
Figure 73: NTP diffusion through CH3C state 1.
Figure 74: NTP diffusion through CH3C state 2.
Figure 75: NTP diffusion through CH3C state 3.
Figure 76: NTP diffusion through CH3C state 4.
Figure 77: NTP diffusion at CH3D entrance.
Figure 78: CH4 force-distance plot.
Figure 79: Pre-translocation protein re-adjustments occurring near the active site.
Figure 80: Mechanistic basis for pre-translocation.
Figure 81: Schematic representation of EC-RNAP coordination with substrate diffusion trajectory.
Figure 82: Schematic representation of on-pathway state 1.
Figure 83: Schematic representation of on-pathway state 2
Figure 84: Schematic representation of on-pathway state 3.
Figure 85: Schematic representation of on-pathway state 4.
Figure 86: Schematic representation of off-pathway state 1.
Figure 87: Schematic representation of off-pathway state 2.
Figure 88: Schematic representation of off-pathway state 3.
Figure 89: Schematic representation of off-pathway state 4.
Figure 90: Schematic representation of off-pathway state 5.
Figure 91: Schematic representation of off-pathway state 6.
Figure 92: Schematic representation of off-pathway state 7.
12
List of abbreviations
RNAP: RNA Polymerase
Sc: Saccharomyces cerevisiae
Ec: Escherichia coli
Tt: Thermus thermophilus
Ta: Thermus aquaticus
Mj: Methanocaldococcus jannaschii
WT: Wild Type
EC: Elongation Complex
BH: Bridge Helix
TL: Trigger Loop
FL2: Fork Loop 2
SW2: Switch 2 domain
TN: Transition Nucleotide
TF: Transcription Factor
cTF: cleaving Transcription Factor
NAC: Nucleotide Addition Cycle
DS: Downstream
A site: Active site
E site: Entry site
PS site: Pre-insertion site
tDNA: DNA template strand
ntDNA: DNA non-template strand
NTP: nucleoside triphosphate
NMP: nucleoside monophosphate
NDP: nucleoside diphosphate
rNTP: ribo nucleoside triphosphate
cNTP: cognate ribo nucleoside triphosphate
ncNTP: non-complementary ribo nucleoside triphosphate
dNTP: deoxy nucleoside triphosphate
dNMP: deoxy nucleoside monophosphate
ATP: adenosine triphosphate
GTP: guanosine triphosphate
CTP: cytidine triphosphate
UTP: uridine triphosphate
TTP: thymidine triphosphate
13
A: adenine
G: guanine
C: cytosine
U: uracil
T: thymine
PPi: inorganic pyrophosphate molecule
Pi compound: molecule formed by the association of multiple pyrophosphates
aMD: accelerated Molecular Dynamics
sMD: steered Molecular Dynamics
MD: Molecular Dynamics
VMD: Visual Molecular Dynamics
GPU: Graphic Processing Unit
CPU: Central Processing Unit
PDB: Protein Data Bank
PDB#: Protein Data Bank accession code
PME: particle mesh Ewald
vdw: van der Walls
CH1: Main channel
CH2: Secondary channel
CH3: Tertiary channel
CH3A: Tertiary channel opening A
CH3B: Tertiary channel opening B
CH3AB: Section of the tertiary channel formed by opening A, B and the tertiary channel itself
CH3C: Tertiary channel opening C
CH3D: Tertiary channel opening D
CH4: Quaternary channel
COM: Point lying on a pathway axis
14
Chapter 1
Literature Review
15
1. Introduction
RNA Polymerase is a nanoscopic machine located inside the cell nucleus, which is responsible for
transcribing sections of DNA information into mRNA. During the synthesis process, the NTP substrates
enter the molecular machine and reach a zone called the active site where they are assembled into an
RNA chain. According to the largely accepted paradigm the substrates load to the catalytic center via a
pathway termed “secondary channel” (also referred to as CH2 in this thesis). The latter channel is
localized beneath the active site, consists of a narrow corridor (≈ 7-12 Å in diameter, ≈ 15 Å in length)
leading directly to the active site cavity, extending towards the outside of the enzyme, and leading to a
large conic section occupying about two thirds of the pathway length and called “funnel”. Access to the
active from the secondary channel is enabled when the trigger loop is bent into an open conformation
and when the EC is in the post-translocated state (i.e. the RNA 3’ end closes against the BH) [Gnatt, et
al., 2001; Wang, et al., 2006] and disabled otherwise. TL refolding reduces the dimensions of the
secondary channel at the entrance to the active site from 15 * 22 Å in the open conformation to 11 *11
Å in the closed conformation [Vassylyev, et al., 2007B]. “Pore” is usually used to refer either to the
narrow corridor or to the entire tunnel. For more clarity, in this review, “sec. channel” (CH2) or “pore”
will be used to refer to the entire tunnel and “corridor” for the narrow pathway in proximity of the A
site. The theory according to which the NTPs primarily load to the active site via this pathway will be
referred to as the sec. channel theory (CH2 theory). RNAP also possesses a main channel, which will
also be referred to as CH1, allowing the insertion of the DNA inside the enzymatic complex. The main
channel is delimitated by the two largest Rpb1/2 sub-units and the Rpb5 sub-unit. It forms an elbow
shaped corridor across the crab-claw-like shape of the enzymatic complex separating the jaws of the
claw, and comprises a downstream section (which accommodates 12-13 base-pairs of downstream DNA
[Naryshkina, et al., 2006; Kireeva et al., 2010]) and an upstream section [Semenova, et al., 2005;
Kashkina, et al., 2007]. The sections intersect at the catalytic center [Semenova, et al., 2005; Kashkina,
et al., 2007]. The DNA bases are incrementally channeled from the downstream to the upstream
direction during NAC. During translocation (forward movement of the enzyme on the nucleic acids),
the DNA strands are unwound at the downstream boundary of the main channel and rewound at the
upstream edge of the elbow shaped channel [Naryshkina, et al., 2006; Kireeva et al., 2010]. Also during
the process, the upstream tDNA strand is associated with the RNA transcript and forms a RNA-DNA
hybrid (8-9 base-pairs long), which resides at the beginning of the upstream channel near the upstream
boundary of the transcription bubble [Naryshkina, et al., 2006; Belogurov, et al., 2009; Kireeva et al.,
2010]. The RNA chain when separating from the hybrid is extruded through a pathway termed RNA
exit channel [Vassylyev, et al., 2009]. An alternative theory for the diffusion of NTPs to the catalytic
site has proposed that the primary route of substrate diffusion would be via the main channel (termed
main channel theory or CH1 theory in this review).
16
Figure 1: Cross section through Sc RNAP II. tDNA, ntDNA, RNA and GTP in the A site, are shown in lime,
light blue, cyan and red respectively. RNAP II surface is shown in gray. The secondary and main channels
are indicated by dark blue and yellow dashed rectangles respectively. Enzyme structure is PDB#2E2H
([Wang, et al., 2006]).
Both theories agree on the NAC two-metal ion mechanism (molecular operations that are involved in
the polymerization reaction). The consensus proposition is the following. The nucleotide addition step
is presumed to involve two Mg2+ ions, one stably associated with the enzyme (MgA) located on an Rbp1
aspartyl residue at the entrance of the corridor (from the active site) and the other only transiently (MgB),
entering with the NTP [Cramer, et al., 2001; Kettenberger, et al., 2003; Wang, et al., 2006]. Prior to
catalysis, the MgB2+ ion binds to O- atoms of the incoming NTP polyphosphate tail and forms a NTP–
MgB complex [Sigel, et al., 2005; Langelier, et al., 2005; Maoileidigh, et al., 2011]. If the incoming
NTP (called NTP + 2) is the correct nucleotide, the complex is allowed to bind to the insertion site (MgA
site), while MgB binds to an aspartyl residue located near the active site [Abbondanzieri, et al., 2005;
Maoileidigh, et al., 2011]. NTP + 2 is then hydrolyzed producing nucleoside monophosphate (NMP)
and pyrophosphate (PPi) [Abbondanzieri, et al., 2005; Maoileidigh, et al., 2011]. MgB is coordinated
by the β and γ phosphates of NTP + 2 (in reality an NMP) [Stano, et al., 2002; Langelier, et al., 2005].
MgA interacts with the pyrophosphate 3′-OH group of NTP + 1 on the RNA 3’end, thereby lowering its
affinity for the hydrogen, to activate the -OH group for nucleophilic attack on the α-phosphate of NTP
+ 2 where MgB is located [Steitz, et al., 1998; Stano, et al., 2002; Langelier, et al., 2005; Abbondanzieri,
et al., 2005; Landick, et al., 2005; Maoileidigh, et al., 2011]. This results in the formation of a
CH2
CH1
17
phosphodiester bond. The PPi molecule (β and γ phosphates of the NTP + 2) and the MgB ion form a
MgB-PPi2- complex (usually referred to as PPi for convenience). PPi is then expelled through the
secondary channel and the polymerase translocates along DNA and the RNA transcript to free the
nucleotide addition site (register +1), allowing for binding of the next NTP. The sequential order
between PPi release and translocation is currently a matter of debate. According to [Martinez-Rucobo,
et al., 2013], the NAC was elucidated with NTP-containing EC crystal structures of RNAP II and of
bacterial RNAP.
In this review, I will first investigate the secondary channel theory, before considering the elements of
the alternative theory. Then the non-controversial properties of the secondary channel together with
dynamic error correction processes partly involving the latter channel will be examined in order to raise
potential implications for our investigation about the substrate mode of diffusion. I will then discuss one
of the main issues disputed in published literature which concerns the translocation model. The model
seems indeed particularly important to decide between the two substrate modes of entry. Thereafter, the
availability of DS registers discussed in the melting issue sub-section will be investigated, before raising
implications for transcription factors (TF) and substrate diffusion. How nucleotides are discriminated
will next be discussed, and we will see how the mechanism fits in each substrate loading model. Finally,
a general discussion will be undertaken.
18
2. Secondary channel theory
In 1999, the first mention of the secondary channel as a possible pathway for NTP diffusion to the active
site was made simultaneously, in the September issue of Cell magazine, by Zhang et al. [Zhang, et al.,
1999] and Fu et al. [Fu, et al., 1999], based on the observation of the newly generated x-ray
crystallography data of bacterial RNAP and eukaryotic RNAP II at 3 and 5 Å resolution respectively.
The postulate was proposed because the active site appeared directly connected to the exterior of the
enzyme through the secondary channel, and the latter seemed to be the only unobstructed pathway for
NTP diffusion. The hypothesis was subsequently restated by numerous researchers, based on the
generation and observation of T7 RNAP, T. thermophilus RNAP and S. cerevisiae RNAP II x-ray
structures [Korzheva, et al., 2000; Cramer, et al., 2000; Cramer, et al., 2001; Gnatt, et al., 2001; Bushnell,
et al., 2002; Vassylyev, et al., 2002; Westover, et al., 2004A; Kettenberger, et al., 2004; Temiakov, et
al., 2004; Temiakov, et al., 2005; Wang, et al., 2006].
The first sets of evidence in favor of the secondary channel theory came from the fact that NTPs were
observed pre-bound at the entrance of the corridor in proximity of the active site, indicating that NTPs
travelled through the CH2 pathway. In 2003, a non-template entry site (E site) for pre-binding of the
NTP substrate prior to NAC was first hypothesized by [Sosunov, et al., 2003]. From their biochemical
experiments, the researchers observed increased fluorescence (which was directly correlated to
nucleotide imprisonment in the enzymatic complex) when non-complementary nucleotides were
inserted. This was interpreted as a nucleotide binding phenomenon in a non-template site, as the active
site could normally only accommodate complementary nucleotides. However, other biochemical studies
have suggested that NTPs could bind to an allosteric or non-template site in the main channel, which
could explain the increased fluorescence stated above without validating CH2 as the main diffusion path
(details in further paragraphs). In 2004, Westover et al. [Westover, et al., 2004A] extended the
diffraction limit of RNAP II crystals to 2.3 Å, allowing to refine the inspection of the complex. A
mismatched NTP was directly observed bound to a site adjacent to the A site, in the secondary channel,
and consequently the hypothesis was raised that nucleotide selection includes an initial binding to an
entry site beneath the active center [Westover, et al., 2004A]. The entry site (E site) hypothesis was
reinforced by Wang et al. [Wang, et al., 2006] in 2006 on the basis of additional crystallographic data.
19
Figure 2: Cutaway view of rNTP loading via CH2. tDNA, RNA, GTP in PS site, GTP in E site and GTP in
A site are shown in light blue, lime, orange, hashed purple and yellow respectively. Mg2+ ions are
represented as black spheres. MgB site is shared between the PS and the E site bound nucleotides. The
pathway represented on the figure is the corridor section of the secondary channel leading to the active site.
Protein wall surface is represented in grey. The figure combines structural information of PDB#1R9T for
the E site [Westover, et al., 2004A], PDB#2O5J for the A site, [Vassylyev, et al., 2007B] and PDB#2PPB for
the PS site, [Vassylyev, et al., 2007B].
In 2004 and 2005, Temiakov, et al. in [Temiakov, et al., 2004] and [Temiakov, et al., 2005], and
Kettenberger et al. in [Kettenberger, et al., 2004], using Fourier Electron Density map calculations
applied to RNAP complexes cocrystallized with a non-hydrolyzable NTP analog, discovered a
preinsertion site to which the NTP substrate was thought to bind before accessing to the insertion site
where it undergoes catalysis. Although these results could seem in line with the E site postulate exposed
above, some important distinctions are to be made. First, the preinsertion site (PS) is located differently
than the E site exposed above. Indeed, the PS site is located at register i + 1 where the incoming NTP
bounds. The orientation of the register in the preinsertion state is such that the bound NTP is oriented
towards the secondary channel and the polyphosphate tail could therefore be partially inserted and/or
bound there, even though the i + 1 register resides in the A site. As such, only a small fraction of the PS
site can be considered as overlapping the secondary channel. In contrast, the E site resides entirely
outside the active center. Second, the PS site hypothesis does not validate CH2 (secondary channel)
theory, as the NTP could be carried there by pre-binding to tDNA, whereas the E site postulate does
seem to validate CH2 theory, as the only obvious access to the site is via the pore.
MgA
MgB
20
In 2004, Mukhopadhyay et al. [Mukhopadhyay, et al., 2004], observed that the insertion of the peptide
microcin J25 led to transcription inhibition in bacterial RNAP. Inhibition was partially competitive with
NTPs (e.g., high concentrations diminished inhibition) leading the researchers to the conclusion that the
toxin molecule interfered at the level of NTP delivery or NTP binding. Because the authors found that
microcin J25 fitted inside and obstructed almost perfectly CH2 and appeared to block passage of a NTP
molecule, they proposed that impediment of substrate diffusion to the active center was part of the
inhibition function: “MccJ25 inhibits transcription by interfering with NTP uptake by binding within
and obstructing the RNAP secondary channel—acting essentially as a cork in a bottle”. It follows that
the hypothesis according to which the secondary channel served substrate loading was reinforced.
Further evidence for CH2 accommodating substrate uptake was proposed by the following results from
Holmes et al. in 2006 [Holmes, et al., 2006]. They found that D675Y and D675V substitutions in Ec
RNAP reduced transcription fidelity. Because the residue is located inside the secondary channel, at
relative distance from the catalytic center, the researchers proposed that it played a role in
electrostatically filtering incoming substrates. While still considering that NTPs could diffuse via
multiple routes, they postulated that NTPs would load via CH2 at least sometimes.
In addition to the secondary channel theory biochemical and structural evidences stating the existence
of an E site that could bind NTPs in a preliminary step, and that the secondary channel delivers
substrates, a probabilistic model based on diffusion computational simulations from Batada et al.
[Batada, et al., 2004] seemed to both reinforce the plausibility of the E site hypothesis and to validate
CH2 as a plausible diffusion pathway, as well as yielding informative details about the diffusional
properties of the channel. The fact that the sec. channel would serve as the main entry route for NTPs
would suggest that the structure of the pathway plays a role in NTP diffusion to the active site and in
substrate discrimination. In their publication, Batada et al. studied the effect of the pore topology and
electrostatics on NTP diffusion. Their MD simulations allowed them to calculate that the topology of
the pore alone (i.e. restriction due to the funnel opening and pore walls), in the absence of an electrostatic
potential, reduced the rate of NTPs accessing the A site by a factor 1/16800. They also found that the
corridor had a strong negative electrostatic potential, reducing the rate of NTPs accessing the E site (note
that the authors considered electrostatic impediment for diffusion to the E site and not the A site) by a
factor 1/300. According to Batada and colleagues, this induced a total restriction in NTP diffusion by a
factor (1/16800) × (1/300) = 2 × 10-7. Correlating this result with the 1012.s-1.M-1 collision rate between
RNAP and NTPs and 1 mM concentration of substrate (assumed physiological) seemed to allow
successful diffusion to the A site at a level of 200 NTPs per second. Because of steric requirements for
binding, the authors then suggested that successful delivery would be reduced by one order of
magnitude: hence 20 NTP.s-1, or even two orders of magnitudes, i.e. ≤ 20 NTP.s-1. The authors then
stated that their ≤ 20 NTP.s-1 calculated rate was consistent with the ≈10 NTP.s-1 synthesis rate by RNAP
II in vivo. From their MD simulations, Batada et al. also calculated an enhanced NTP diffusion rate to
21
the A site in case of prior NTP binding to the E site (with a minimum transient binding time of 10 ns
calculated from chemical dissociation constants). These results seemed to improve their diffusion model
and appeared consistent with the E site hypothesis.
Another Molecular Dynamics investigation confirmed that the secondary channel was the most suitable
route for accommodating substrates [Zhang, et al., 2015A]. A comparative conformational analysis with
the program CAVER ([Chovancova, et al., 2012; Kozlikova, et al., 2014; Pavelka, et al., 2016]) between
the main and the secondary channel was carried out, and it was concluded that the latter was more
suitable to accommodate NTP substrates. It was also proposed that a substrate remaining in the funnel
is energetically more favorable than if it lies within the main channel, because of decreased Coulombic
repulsion.
22
3. Main channel theory
The first evidence in favor of the main channel theory arose from the 2001 study from Foster, Holmes
and Erie. By using alternative biochemical transient-state kinetic techniques, the group measured the
kinetics of single NTP incorporation steps as a function of NTP concentration for Ec (Escherichia coli)
RNAP [Foster, et al., 2001]. In their first experiment, they measured the rate of CMP incorporation as a
function of CTP concentration, where CTP is the next nucleotide (templated NTP) to be incorporated.
They noted that the substrate-saturation curve representing CMP incorporation kinetics as function of
CTP concentration had a quadratic dependence on CTP concentration. From this emerges that the
kinetics are biphasic (not hyperbolic as expected from the secondary channel paradigm) and thus that
RNAP must contain a second NTP binding site in addition to the catalytic site, which acts as an allosteric
effector, accelerating the incorporation of the templated NTP, where the next NTP to be added (CTP) is
both the substrate and the allosteric effector. In another experiment, they measured the rate of CMP
incorporation as a function of different concentrations of ATP, GTP and UTP (and with low CTP
concentration for matters of experimental convenience to force a control incorporation state). The
kinetics this time showed that non-templated NTPs did not affect the rate of incorporation, indicating
template specificity for the allosteric function of the binding site. Finally, they measured the kinetics of
AMP incorporation (where AMP is the next nucleotide to be added) as a function of AMP-CPP
concentrations, which showed that the templated but non-incorporable ATP analog accelerated AMP
addition (i.e. activated transcription to the fast state). From these important results the following
conclusion can be made. RNAP possesses an allosteric binding site in addition to the catalytic site, where
templated but not mismatched NTPs increase the rate of NTP incorporation, and where the allosteric
site probably resides downstream of the template DNA chain in the main channel. This confirmed an
early hypothesis by Nierman et al. ([Nierman, et al., 1980], cited by Foster et al.) drawn from the study
of transcription initiation kinetics stating that RNAP may contain a NTP binding site in addition to the
catalytic site. It was also postulated that NAC can either occur in a fast or slow state (consistent with a
publication from Davenport et al. in 2000 [Davenport, et al., 2000], cited by Foster et al.), with the
transition to the fast state being induced by the NTP binding to the allosteric site (tDNA i + 2 site). In
2003, Holmes and Erie presented new compelling evidences in favor of a secondary binding site in the
main channel [Holmes, et al., 2003]. They assembled mutant DNA templates and observed that the DNA
sequence one base pair downstream from the site of NTP addition affected the rate of subsequent NTP
incorporation. In 2003, Nedialkov, Burton et al. found results consistent with the main channel theory
using pre-steady state kinetics [Nedialkov, et al., 2003]. A running two-bond protocol was built and the
experimental protocol consisted of four ECs termed C40, A43, G44 and G45, which corresponded to
standard elongation positions. C40 EC is advanced to A43 by adding specific concentrations of NTPs.
After stalling briefly, A43 establishes a steady state distribution between a paused and an active EC.
The active A43 EC is such that when GTP concentrations are added, the complex moves to the G44 and
23
G45 positions where the rapid rates of elongation enable to reproduce the synthesis rates experimentally.
In this setup, G44 rates indicate recovery from a stalled A43 position, and G45 rates indicate processive
elongation from G44 to G45 (including RNA-DNA hybrid and tDNA translocation). As such, these ECs
positions capture snapshots of the steps corresponding to critical NAC sequential processes. For
example, translocation and pyrophosphate release are thought to occur between the synthesis of the G44
and G45 bonds (G44 corresponds to the synthesis of a first bond attaching substrate NTP to the growing
RNA chain and G45 corresponds to the synthesis of the next incorporation bond) and if G44 or G45 are
monitored exclusively then information about translocation could be distorted. The reaction pathways
are stimulated with TFIIF and HDAg (hepatitis δ antigen, elongation stimulant) elongation factors. The
supervision of the formation rates of the A43, G44 and G45 EC positions as a function of GTP substrate
concentrations led to the following observations. Recovery from a stalled EC and processive transition
from one bond (incorporation event) to the other can be highly dependent on the incoming NTP,
indicating that NTPs could pre-bind to a non-catalytic site in the main channel and play a role in driving
and/or triggering translocation. Furthermore, it is to be underlined that, inconsistent with the secondary
channel theory and confirming the results published in 2001 from Erie et al. [Foster, et al., 2001] and
Palangat et al. [Palangat, et al., 2001], the measured rates of NTP incorporation as a function of NTP
concentration did not reflect a hyperbolic dependence. In 2004, using the same RNAP II ECs as above
(notably A43, G44 and G45), in conjunction with TFIIF (which stimulates forward translocation) and
TFIIS (which factor appeared to improve the quality of the kinetic experimental data by promoting RNA
cleavage and re-start), Zhang and Burton ([Zhang, et al., 2004]) monitored the kinetic pathway between
the key transcription steps embodied by the control EC positions. In other words, they evaluated the
dependence between translocation and nucleotide addition in the interval of two bonds (two nucleotide
incorporation events). By using new quench techniques, they were able to measure the rate of substrate
tightening to the active site (termed G44 isomerization, correlated to the enzymatic complex confining
the active site and detected with EDTA quench) prior to phosphodiester bond formation (termed G44
chemistry, detected with HCl quench). The G44 isomerization state reflects substrate accessing the A
site. At higher GTP concentrations, EDTA quench rate curves for G44 isomerization were biphasic,
consistent with the NTP allosteric effect depicted above. Also, because the isomerization rate proved to
be rapid and not rate-limiting, they concluded that at high GTP concentrations, elongation kinetics were
not dependent on GTP loading. Instead, they found that template-dependent binding of substrate NTP
was coupled with the completion of the previous NAC (indicating that NTPs must pre-bind in the pre-
translocated EC), with the rate-limiting steps being translocation and PPi release. The results appeared
consistent with a NTP-driven translocation mechanism where downstream substrate NTPs pre-bound in
the main chain have a functional effect on subsequent NTP incorporation and inconsistent with the
secondary channel theory requiring rapid Brownian ratchet translocation and rapid PPi expulsion.
Furthermore, Batada et al. computational diffusion calculations [Batada, et al., 2004] indicated that the
≤ 20NTP. s-1 loading is rate limiting, but in the study, one of the measured NTP stable loading rate was
24
1450 +/- 330 s-1, indicating that loading was not rate-limiting for human RNAP II. Finally, in line with
their transient-state kinetics data from 2003 [Zhang, et al., 2003] and inconsistent with the secondary
channel theory, Burton et al. observed that the occlusion of the pore with TFIIS did not appear to hinder
NTP loading. In their 2005 publication [Gong, et al., 2005], using millisecond kinetics quench-flow
techniques (developed by the laboratory), Burton and co-workers yielded crucial results in favor of the
main channel theory by using a fascinating experimental approach based on a phenomenon termed
isomerization reversal, whose principle is the following. Translocation is blocked by α-amanitin
(mushroom toxin). High incoming NTP substrate concentrations (corresponding to the template i + 2
NTP), by promoting forward translocation on the EC blocked by α-amanitin, induce isomerization
reversal and dislodge (i.e. reverse the isomerization of the A site) the i + 1 NTP (isomerized i + 1 NTP
about to complete bond synthesis). This phenomenon is possible because tightening of the active site
(which is reversible) occurs before phosphodiester bond formation (which normally becomes
irreversible when PPi is released). Isomerization requires substrate sequestration in the A site, and
detection is allowed by the fact that the MgB ion of the i + 1 GTP becomes shielded from EDTA
chelation. Also, the metal ion not being inactivated by EDTA quench allows GTP to proceed to
phosphodiester bond formation. Quenching with HCl on the other hand stops the reaction instantly
giving precious information about the timing of the bond formation. The researchers experimentally
applied the principle as follows. A 40-CAAAGGCCTTT-50 template was used. Elongation was then
monitored between G44 and G45 (44 and 45 nucleotide RNAs ending in 3’-GMP) starting at a stalled
A43 EC, where G44 corresponded to an isomerized complex (substrate tightening in the A site), G45
corresponded to an incorporated NTP (GTP has formed the phosphodiester bond) and A43 represented
the post-translocated EC where the GTP substrate loads to the i + 1 and i + 2 sites. If an EDTA quench
was added, i + 1 GTP was inactivated, but not i + 2 GTP which was not protected from chelation by the
A site. In the continuing presence of i + 2 GTP substrates, at early EDTA addition (0.002s),
isomerization was not detected, i.e. more G44 product was observed, but at prolonged EDTA quenching
(0.1s), isomerization reversal was detected, i.e. more A43 product was observed, indicating that i + 2
GTP dislodged the catalytic GTP. Also, the slow convergence of EDTA (isomerization time) and HCl
(bond synthesis time) curves indicated a coupling between translocation (hypothesized in their research
to be NTP-driven) and PPi release (which coincided with the end of the phosphodiester bond synthesis),
because high concentration of GTP-Mg2+ (detected in G45 by HCl quenching) appeared necessary to
force G44 bond completion. The three following experimental results using the experimental principle
explained above demonstrate binding of substrate NTP in the pre-translocated EC at the i + 2 and i + 3
downstream sites. First, the researchers showed that i + 2 and i + 3 NTPs contribute to isomerization
reversal. With a 40-CAAAGCCTTT-49 template, i + 2 and/or i + 3 CTP stimulated isomerization, while
dCTP did not (indicating precise selectivity at downstream sites), neither did GTP, ATP, or UTP. With
a 40-CAAAGACTTT-49 template, both i + 2 ATP and i + 3 CTP contributed to i + 1 GTP expulsion,
but i + 2 ATP, i + 2 UTP, i + 2 CTP alone, or i + 2 ATP in conjunction with i + 3 UTP did not. Also, in
25
the presence of dCTP, the EDTA and HCl quench curves converged slowly, indicating that it is i + 2
and/or i + 3 CTP which drove G44 bond completion. Second, a dynamic error correction process was
postulated thanks to the following experimental results. With a 40-CAAAGCCTTT-49 template, CTP
cancelled misincorporation of AMP for GMP (i.e. induced isomerization reversal of incorrect i + 1
AMP), but UTP did not. The researchers underlined that physiologically, not just in the presence of α-
amanitin, dynamic error correction occurs. Third, regulation of downstream template opening was
suggested. With a 40-CAAAGTCTTT-49 template, CTP or UTP alone did not appear to stimulate the
formation of the post-translocated A43 EC, indicating that combination of i + 2 and i + 3 optimally
triggered the formation. In 2007, Burton and colleagues pursued their isomerization experiments [Xiong,
et al., 2007]. They showed that NTP substrates templated at i + 2, i + 3 and i + 4 sites, but mismatched
NTPs, matched dNTPs and matched NDPs, could not induce isomerization reversal of the i + 1 site.
With a 40-CAAAGCCUUU-49 template, NTP binding at downstream sites was demonstrated because
accurately templated CTP and possibly UTP at i + 4, i + 5 and i + 6, had an effect on the fate of i + 1
GTP loaded in the active site. When 2.5 mM CTP and UTP are substituted with 5 mM ATP (an NTP
that is not accurately templated at adjacent downstream sites), isomerization reversal was significantly
reduced. Also, when CTP and UTP were replaced with CTP and ATP, the substitution of UTP with ATP
appeared to slightly reduce isomerization reversal indicating a role for the i + 4 (UTP templated) binding
site. A second experiment using the same template tested the requirements for i + 2 and i + 3 CTP sites
occupancy and i + 4, i + 5 and i + 6 UTP sites occupancy. The results were as follows. Reversal was
weak in reaction lacking CTP, strong in reactions containing GTP, CTP and UTP, and weak for the
combination using GTP and UTP but substituting CTP with dCTP or CDP. Also, dTTP, dUTP and UDP
did not stimulate reversal in the presence of CTP. Otherwise, the observation of the separation between
the EDTA and HCl curves seemed to indicate that at high CTP and UTP concentrations, CTP and UTP
induced increased translocation strain on the EC. In contrast, at low CTP and UTP concentrations, a
reduced translocation pressure was postulated to be applied against the translocation block by the
downstream NTPs. The latter corroborated a regulation role for downstream NTPs on translocation. In
their 2008 study [Kireeva, et al., 2008], Kireeva, Burton, et al., found that in mutant E1103G RNAP II,
the predominantly pre-translocated EC experienced a dramatic increase in NTP sequestration (at least
1200 isomerization events per second) as compared with the wild type EC, which is inconsistent with
the maximum isomerization events which could be accommodated by a NTP diffusion through the pore
according to Batada et al.’s diffusion calculations [Batada, et al., 2004]. In addition, the only way NTPs
could enter a predominantly pre-translocated EC would be during hypothetical short pre- to post-
translocated EC time windows (assuming the EC could oscillate between these positions), rendering the
successful diffusion through the secondary channel even less plausible. In their 2011 publication
[Kennedy, et al., 2011], Kennedy and Erie, using transient state kinetics and a mutant of RNAP, put
forward the following results. First, pre-incubating the complex with an NTP at i + 2 site increased the
subsequent rate of NAC, suggesting the existence of a NTP allosteric site in the main channel. Second,
26
pre-incubating the complex with an ATP at i + 2 led to its rapid sequestration in the active site after the
incorporation of a second CTP nucleotide. This was detected by HCl/EDTA quench assays revealing an
accumulation of enzyme-substrate in the complex, and suggested that CTP was sequestered prior to its
incorporation. Also, EDTA quench measures indicated that the sequestered ATP was committed to bond
formation prior to incorporation of CMP. Therefore, the quench data indicated that RNAP can
simultaneously imprison CTP and ATP prior to incorporation of CMP, which seemed to indicate that
the ATP had to be sequestered in a non-catalytic site without being released from the enzyme after CMP
incorporation. Consequently, it was suggested that NTPs can bind to a site in the main channel (i + 2)
that is involved in the regulation of NAC. In a paper published in 2006, Holmes et al. [Holmes, et al.,
2006] observed that mutating Ec RNAP residues R678 and D814, which in the secondary channel
loading model appear to interact with the nucleotide phosphate group and to coordinate MgB bound on
the NTP, virtually did not affect the transcription kinetics. This result seemed very inconsistent with
CH2 theory. In accordance with the results of the kinetic experiments exposed above, three single-
molecule studies [Abbondanzieri, et al., 2005; Larson, et al., 2012; Dangkulwanich, et al., 2013] seemed
to yield consistent information. Using an optical trap assay, the researchers measured the step magnitude
and velocity of translocation events, under assisting or opposing forces, from which they derived the
force dependence of the NAC. They found that the experimental force-velocity data supported a kinetic
model involving a secondary substrate binding site in the pre-translocated state. Finally, we will see in
chapter 5, that available routes for substrate diffusion to the downstream section of the main channel,
accommodating NTP pre-binding, exist. For the sake of the argument, the additional pathways will be
referred to as the tertiary channel (CH3) in the rest of this chapter.
Figure 3: CH3 access to the main channel. tDNA i + 3, i + 2 and i + 1 registers are represented in yellow,
green and blue respectively. i and i - 1 registers are represented in red and are bound to the RNA chain
colored in orange. The GTP substrate in the active site and bound to i + 1 register is represented in pink.
Protein walls are represented in grey surface. Enzyme structure is PDB#2E2H ([Wang, et al., 2006]).
27
4. Non-controversial properties of CH2 and dynamic error correction
While the diffusion function of the secondary channel is a matter of debate concerning its role in
channeling NTP substrates to the active center for catalysis (which implies exchanging correct/wrong
substrates in and out of the torus structure), it is accepted as an exit channel for incorrect NTP and PPi.
At this stage, scarce information is available about the kinetics of incorrect substrate expulsion, but
recent studies have pointed out interesting information concerning the properties of the pore involved
in PPi expulsion.
The pore serves as an exit tunnel for two PPi release events: after NAC and after TFIIS/GreA/B cleavage
[Zhang, et al., 2004; Sims III, et al., 2004]. In 2011, Da et al. [Da, et al., 2011] investigated the kinetics
of PPi release on the microsecond timescale by applying a Markov state model (predictive calculation
method allowing to guess a simulation pathway during a prolonged period of time across known control
states) using all-atom MD simulations and single-mutant simulations. They found that the PPi molecule
experienced a hopping behavior during its expulsion where hopping sites at the inner extremity of the
pore in the active site and further down in CH2 accelerated the release. The conserved positively charged
residues, such as yeast RNAP II residues Rpb1 K518, 619, 620, 752 and H1085 were shown to offer
constructive electrostatic interactions with the negatively charged (Mg−PPi)2− group, and to play an
important role in the expulsion. The authors note that all five residues are highly conserved among
species. Interestingly, K619 and 752 are located in the E site. Hence, the authors propose that these
residues, which could play a role in attracting the negatively charged substrate during NTP entry, could
have the double purpose of facilitating the expulsion of the positively charged PPi molecule.
In 2013, Da and colleagues [Da, et al., 2013] using the same experimental approach as above, studied
the dynamics of PPi release in Tt RNAP. They observed that the expulsion rate of the inorganic
pyrophosphate molecule was three-fold faster than in yeast RNAP II and occurred at a submicrosecond
timescale. Similarly, to the mechanism proposed for eukaryotic RNAP II, they found that PPi exit was
facilitated by favorable electrostatic interactions with basic residues in the secondary channel (K908,
912, 780, 1362 and 1369). The authors suggested that one of the causes of the faster expulsion dynamics
in the case of bacterial RNAP could result from the shorter dimensions of CH2.
In addition to its diffusion properties, CH2 also has non-diffusion functions (non-controversial at this
stage) which are RNA backtracking site and TFIIS/GreA/B binding site. In contrast to DNA Polymerase,
RNAP can backtrack the nascent transcript (through the secondary channel) in order to correct
transcription errors or to allow regulatory pauses to occur, whereas DNA Polymerase requires alternative
processes (notably the implication of exonucleases). This embedded fidelity/regulatory mechanism
underlines the amazing precision and efficiency of RNAP and renders the molecular machine as a master
piece of Engineering. First, the concept of RNAP backtracking with the latest postulates about the
molecular mechanisms underlying such a process will be investigated. Then the TFIIS and GreA/B TFs
28
which bind in the secondary channel and are involved in the RNAP error correction processes will be
presented. Other TFs (bacterial) which bind in CH2 include DksA and Gfh1.
RNAP enters an off-pathway state when it aborts processive transcription. The latter off-pathway state
can be subdivided into two states [Xie, 2012]. The first state is referred to as pausing or arrest and
corresponds to a brief suspension of transcription (1–6 s for multi-subunit RNAP) where RNA does not
normally backtrack [Nudler, et al, 1997; Shaevitz, et al., 2003], but where the elongation rate is regulated
[Xie, 2012]. Pausing is thought to be induced by signals coded directly into the DNA template, that is
to say to be triggered by specific tDNA sequences [Herbert, et al., 2006]. The second state usually
involves prolonged pauses (> 20 s for multi-subunit RNAP) where the enzyme experiences backtracking
[Xie, 2012]. The process of the latter state is the following. RNAP can literally rewind its forward step-
wise motion along DNA and RNA, and slide in the opposite direction on the nucleic acids in order to
reset the transcription mechanism several base-pairs backwards or in order to expel a full aberrant RNA
chain. The roles of backtracking include transcription error recovery, control of transcription elongation
(function slightly distinct from error recovery), recovery from pause-arrest, exposition of damaged DNA
for repair, termination of elongation and initiation (where the enzyme cycles between several RNA
synthesis and extrusion phases until a 13-15 nucleotide long RNA chain has been successfully
synthesized [Batada, et al., 2004; Vassylyev, et al., 2007A; Nudler, et al., 2012]. In such a process, the
DNA molecule can be directly extruded through the downstream main channel outside of the enzyme,
but the 3’ end of the nascent RNA transcript, being located at the center of the complex, needs a pathway
inside the RNAP for accommodating its retrograde motion. CH2 serves this very purpose as it connects
to the active site where the RNA 3’ end lies and offers an empty cavity for the transcript to be extruded.
According to Martinez-Rucobo and colleague in [Martinez-Rucobo, et al., 2013], RNA backtracking
through the secondary channel has been elucidated thanks to the direct observation of the phenomenon
in RNAP crystallographic data. According to Xie in [Xie, 2012], knowledge about the transcription
pausing characteristics arose from single-molecule studies of RNAP.
The backtracking state is triggered by destabilized RNA–DNA hybrid [Nudler, et al., 1997; Shaevitz, et
al., 2003; Sosunov, et al., 2003; Greive, et al., 2005; Kireeva, et al., 2005; Zenkin, et al., 2006]. An
incorporation error leads to a weakening of the hybrid, which in turn increases the probability of
backtracking [Nudler, 2009]. The mechanism by which the hybrid loosens its contacts from the active
site has been theorized by Vassylyev et al. in [Vassylev, et al., 2007A] and Xie in [Xie, 2012]. According
to the former group, when the hybrid is packed in the active site, it forms polar and van der Waals
interactions with conserved protein residues. They propose that the protein structure may act as a shape-
sensor of the hybrid, where incorrect RNA sequence leads to increased repulsive van der Walls
interactions between the protein and the hybrid. The shape-sensor theory was foreseen in 2001 by [Gnatt,
et al., 2001]. In 2012, Bochkareva et al., using transcription assay kinetic techniques generated results
consistent with the shape-sensor theory [Bochkareva, et al., 2012]. Xie on the other hand proposes the
29
following model. During correct transcription elongation, the RNA-DNA hybrid is not unwound which
induces a positioning of the RNA 3’end away from the secondary channel. However, if an incorporation
error occurs, the resulting mismatch in the nascent hybrid is likely to cause the RNA chain to lose its
canonical A form and to be deviated from the DNA. This deviation could highly increase the probability
of the RNA to position in front of CH2, allowing its extrusion. The author also underlines that when the
RNA-DNA pair is not unwound, which corresponds to correct transcription, the 3’end of the RNA chain
is positioned at the i site and structurally prevents the enzyme from translocating backwards. In support
of Xie’s model, frayed RNA 3’end has been observed in crystallographic structures consisting of a
misincorporated nucleotide [Sydow, et al., 2009A; Sydow, et al., 2009B; Wang, et al., 2009]. In addition,
Toulokhonov et al. in 2007 ([Toulokhonov, et al., 2007]) found results consistent with RNA 3’ end
fraying during the elemental pause state (probably preceding the other off-pathway states such as
backtracking). Nudler in [Nudler, 2012], summarizes the mechanism by stating that incorrect substrate
pairing would facilitate backtracking and its own expulsion through the secondary channel, and therefore
backtracking may assist in NTP selection. In addition, the author proposes in [Nudler, 2012] and
[Nudler, 2009] that the trigger loop may play a role in allowing the backtracking process to occur. The
trigger loop close conformation depends indeed on the accuracy of the loaded NTP. However, the extent
at which backtracking causes or is caused by the trigger loop conformation change does not seem fully
elucidated at this stage.
According to Wang et al. in [Wang, et al., 2009], RNA backtracking is reversible for one or a few
nucleotides, but becomes irreversible afterwards. Transcription factors TFIIS for eukaryotic RNAP II
and GreA/B for bacterial RNAP have the ability to rescue an arrested RNAP in a backtracked state, by
cleaving off the RNA chain and facilitating transcriptional restart. Their mechanism of action is the
following (reviewed in [Conaway, et al., 2003; Sims III, et al., 2004; Nudler, 2009; Cheung, et al.,
2011]). Both TFIIS and GreA/B TFs possess a long protrusion which inserts in the secondary channel,
with a tip referred to as NTD (coil-coiled N-terminal domain) reaching the active center. NTD is thought
to provide a basic and two acidic residues interacting chemically with the active site [Nudler, 2009].
The acidic residues interact with MgA and mobilize MgB triggering a chemical reaction termed
pyrophosphoryolysis (RNA hydrolysis, reverse of the polymerization reaction) resulting in the cleavage
of the RNA backtracked transcript. In other words, the factors allow separating the backtracked biased
chain from the non-backtracked chain, and this separation is done directly in the active site. The cleavage
reaction is driven by a two metal-ion-hydrolysis mechanism [Kettenberger, et al., 2003; Sosunov, et al.,
2003], which is identical to the two metal-ion mechanism driving the NTP addition cycle, with the fine
distinction that MgA binds the +1 RNA phosphate to align the scissile bond, in contrast to its binding
of the RNA 3’ -OH group during nucleotide addition [Cheung, et al., 2011]. The secondary channel can
accommodate both the transcript and the TF protrusion, while not impeding the expulsion of the
transcript. It is also hypothesized that the protein conformational changes induced by TF insertion
30
realign the RNA chain in the hybrid, therefore allowing forward elongation to resume [Kettenberger, et
al., 2004; Cheung, et al., 2011].
In 2011 [Cheung, et al., 2011] and 2013 [Martinez-Rucobo, et al., 2013], Cramer and colleagues have
brought forward informative details. They suggested that the NTD charged residues might catalyze
proton transfer during the cleavage reaction. The researchers found that the backtracked RNA was gated
from the secondary channel by a tyrosine residue. They postulated that during backtracking, the RNA
chain bypasses the gating residue until it binds to a site in the sec. channel, termed backtrack site. They
proposed that TFs may facilitate reactivation by competing with the residues in the secondary channel
binding the extruded transcript (therefore helping detaching the chain) and by locking the trigger loop
away from the transcript. Their findings help to refine what is known about the sec. channel non-
diffusional properties (e.g., to shed some light on CH2 residues forming part of the backtrack site).
An additional error recovery mechanism has also been described [Zenkin, et al., 2006; Sydow, et al.,
2009 A; Sydow, et al., 2009 B; Wang, et al., 2009; Martinez-Rucobo, et al., 2013] where the RNA chain
can backtrack its aberrant tailing residue in reaction to an incorporation error, but where the enzymatic
complex does not need to be rescued by a transcription factor. Instead, an intrinsic cleavage phenomenon
occurs. The backtracking motion results in the positioning of the nascent 3’end at a position termed “P”
for proofreading site by [Wang, et al., 2009], which corresponds to the +2 site of backtracked RNA,
where hydrolysis of the scissile phosphodiester bond is stimulated by the favorable chemical
configuration of the active site. According to Wang et al. in [Wang, et al., 2009], TFIIS cleavages occur
more than 100 times faster in vivo as relative to intrinsic cleavages. Therefore, one can consider the TF
stimulated cleavage as the main error recovery pathway. The intrinsic cleavage state is irrelevant to the
properties of the sec. channel, but is relevant for gauging dynamic error avoidance processes that could
occur in both the main channel and secondary channel loading models.
31
5. The ratchet issue
The hypothesis according to which NTPs load to the active site via CH2 in order to bind directly to the
DNA template register i + 1 was shown to be very consistent with a model depicting the translocation
mechanism and termed the Brownian-ratchet model. The latter model is largely accepted and seems to
be confirmed by a large amount of experimental evidences. The main channel theory on the other hand
seems inconsistent with one of the postulates of the Brownian-ratchet model, which has resulted in a lot
of controversy. In this section, we will demonstrate that the evidences do indeed validate most of the
Brownian-ratchet model. But a very important point will be raised: while the Brownian-ratchet model
is essentially correct, one of the two following axioms might be wrong. The incoming NTP acts as the
ratchet bias in the active site, or alternatively the EC experiences several oscillations during processive
synthesis. I will show that the Brownian-ratchet evidences do not necessarily contradict the main
channel theory. In other words, while the secondary channel Brownian-ratchet model could be partially
erroneous, its main assumptions are probably right; a Brownian-energetic mechanism seems to be indeed
involved and is consistent with the main channel theory. We will first consider the translocation
background, theory and implications, generally, then we will have a closer look to the problem.
Following an early postulate about thermal energy fluctuations powering molecular motors, Guardajo
and Sousa in 1997 [Guardajo, et al., 1997], as well as Oster and Wang in 2002 [Oster, 2002; Wang, et
al., 2002], proposed that RNAP translocation was driven by a Brownian ratchet. More or less at the same
time, the secondary channel theory was formulated. The assumption according to which NTP substrates
diffuse through the secondary channel and load in the active site during the post-translocated EC, seemed
to be almost perfectly in line with the more general Brownian-ratchet model. Because the latter model
seemed to be validated from several experimental proofs, it ironically seemed to validate the CH2 theory
in return. While the specific translocation model is still an open question at this stage, experimental
work generally agrees with the fact that translocation can oscillate (although whether it can oscillate in
the fast state or whether the oscillations are rapid or not, is still disputed), and with the fact that the
Brownian molecular storm seems to be the source of energy of the powerful translocation mechanism
(RNAP can be viewed as force-generating for this reason). The latter assumptions seem supported by
strong structural [Gnatt, et al., 1997; Westover, et al., 2004A; Westover, et al., 2004B; Wang, et al.,
2006; Brueckner, et al., 2008; Vassylyev, et al., 2007A], biochemical [Komissarova, et al., 1997A;
Komissarova, et al., 1997B; Bai, et al., 2004; Bar-Nahum, et al., 2005; Guo, et al., 2006; Damsma, et
al., 2007; Brueckner, et al., 2008; Hein, et al., 2011; Maoileidigh, et al., 2011; Malinen, et al., 2012;
Nedialkov, et al., 2012; Imashimizu, et al., 2013], statistical [Wang, et al., 1998; Tadigotla, et al., 2006;
Yu, et al., 2012], single-molecule [Abbondanzieri, et al., 2005; Bai, et al., 2007; Larson, et al., 2012;
Dangkulwanich, et al., 2013] and Molecular Dynamic [Woo, et al., 2008; Feig, et al., 2010; Da, et al.,
2011; Silva, et al., 2014] evidences. Furthermore, details about specific protein domains involved in the
translocation process have emerged, such as the contribution of the TL [Wang, et al., 2006; Vassylyev,
32
et al., 2007A; Feig, et al., 2010], the BH [Tan, et al., 2008; Weinzierl, 2010A; Weinzierl, 2010B;
Weinzierl, 2011; Kireeva, et al., 2012; Silva, et al., 2014] and the FLoop [Miropolskaya, et al., 2014].
The most popular model, in line with the secondary channel theory, relies on an elegant and simple
concept. The elongation complex oscillates spontaneously between two states: post-translocation and
pre-translocation, and the binding of a NTP in the former state would constitute the ratchet bias. Forward
elongation is triggered by a single and simple event: cognate NTP loading to the active site in the post-
translocated EC. Movies summarizing the whole process have been presented by Cramer et al. in
[Brueckner, et al., 2009; Cheung, et al., 2012] and Silva et al. in [Silva, et al., 2014].
The detailed process is the following. In the absence of NTP in the A site, RNAP slides back and forth
on the nucleic acids frame structure within a single base-pair interval. The EC can be considered to
oscillate freely between two-states: pre- and post-translocation states. The post-translocation process
drives the EC from the pre- to the post-translocated state, where tDNA register i + 2 shifts above the
bridge helix into the active site and occupies the i + 1 register, and the i + 1 register slides towards the
RNA transcript occupying the i register bound to the RNA 3’ end. During the pre-translocation process,
i + 1 register shifts to i + 2 and i register to i + 1. The template register that oscillates between the i + 1
and i + 2 registers is called the transition nucleotide (TN). The latter tDNA base slides back and forth
above the bridge helix. The translocation processes and states are to be differentiated. The pre-
translocated state occurs after the pre-translocation process has been completed and is precisely reached
when the EC has formed a particular geometry: some protein conformational changes have occurred
such as the – 90° tDNA rotation and the straightening of the bridge helix. In the pre-translocation state,
access to the active site is prevented from CH2 because the RNA 3’ end (register i) has shifted in the
active site and because the bridge helix has partially invaded the active site. The post-translocation state
occurs precisely after the post-translocation process has occurred, when the tDNA strand has undertaken
a + 90° rotation, the bridge helix has adopted a bent conformation, and the TN facing the secondary
channel becomes available for base-pairing. If a NTP loads in the A site during the post-translocated
state, the backward oscillation of the TN is disabled and a new oscillation is enabled. The TN now at
register i + 1 cannot shift backwards anymore. Instead the post-translocated template base i + 2
(equivalent to pre-translocated base i + 3) becomes the new transition nucleotide. As such, the loaded
NTP has incremented the ratchet one base-pair forward. More precisely, the ratchet-bias behavior of the
NTP can be considered as follows. While the NTP is inserted in the catalytic center and polymerization
chemistry occurs, backward translocation is impeded. Therefore, the translocation oscillation is biased
towards the forward motion. While the substrate is being added to the RNA 3’end, forward translocation
proceeds and the oscillation process is reset one template-base forward. Therefore, the nucleotide cycle
has occurred between two post-translocation events: the first one places the TN in the A site, the next
one shifts the next template register (i + 2) to the A site. It is also during this post-translocation 1 to post-
translocation 2 time window, that a base-pair in the DS bubble is melted (according to the main channel
33
theory, it would probably be i + 3 or i + 4), while a DNA pair is reassociated upstream. Interestingly,
and counter-intuitively, between post-translocation 1 and post-translocation 2, the EC will be
momentarily in the pre-translocated state (with the newly added substrate in the A site attached to RNA
3’end and kinked bridge helix) without having experienced any pre-translocation motion. It follows that
the pre-translocated state can be divided into two different categories: transient pre-translocated state
between two post-translocation motions and standard pre-translocated resulting from a pre-translocation
motion. In the absence of substrate, the translocation process is not reset one step forward after the
shifting of the TN in the A site, because the unbound template register does not allow to alleviate the
upward pawl, but oscillates between the pre- and post- translocated states, where the TN successively
enters and leaves the catalytic cavity. Concerning the location of the DNA registers, the following
consideration is useful. i + 2 base in pre-translocation (normal state) is equivalent to the i + 1 base in
post-translocation 1 (for free 1 increment oscillations), i + 2 in pre-translocation (normal state preceding
addition) is equivalent to i in post-translocation 2 (after addition of NTP) and i + 2 in pre-translocation
(transient state) is equivalent to i + 1 in post-translocation 2.
Otherwise, an immediate question that can be raised is why translocation oscillates on a single base-pair
increment. The answer is that RNAP cannot slide on an interval of several nucleotides because it is
locked between two pawls: the upward pawl consisting of the post-translocated protein geometry
including the previously added NTP and the downward pawl consisting of the pre-translocated protein
geometry. This explanation seems however inconsistent with the fact mentioned above stating that an
NTP addition occurs between two consecutive post-translocation events. Further explanation is that
when the incoming matched NTP loads in the post-translocated EC, it triggers protein conformational
changes that unlock forward translocation. Therefore, not only does the loaded NTP bias the ratchet
towards forward translocation, but it also temporally inactivates the upward pawl and consequently
allows one more round of forward translocation. In the secondary channel theory, the EC experiences
several translocation oscillations until the A site is bound by a NTP, allowing RNAP to increment its
cognate register one base forward.
The main channel theory implies an already bound TN, where several translocation oscillations appear
inconsistent with the NTP-TN pair binding in the active site and acting as the ratchet bias. Because then
only one forward translocation would block backward translocation. On the other hand, the secondary
channel theory is consistent with several translocation oscillations, where the incoming NTP binds the
TN after being loaded in the active center via the secondary channel and where such a binding acts as
the ratchet bias. It follows that for the main channel theory to be correct, one of the following must be
incorrect: either the EC does not oscillate but only proceeds forwards, or the ratchet bias is not located
in the A site but it is a binding event in the downstream bubble that biases the ratchet to post-
translocation. However, both scenarios are consistent with a Brownian-ratchet mechanism. In the first
case, the elongation complex could be simply locked to the post-translocation mode, where backward
34
translocation is forbidden, but where the base pair entering the active site allows the upward pawl to be
shifted. Therefore, it is almost equivalent to the Brownian-ratchet model. The second scenario resembles
even more the Brownian ratchet mechanism, with the fine distinction that the ratchet bias trigger point
is located in the downstream channel, not in the A site.
In [Holmes, et al., 2003], Holmes and Erie suggest that binding of the NTP in the downstream channel
facilitates translocation by locking the EC in the post-translocation mode. In other words, allosteric NTP
would abort the translocation oscillations of the EC between pre- and post-translocation. That is to say
that the EC would shift from post-translocated state 1 to post-translocated state 2. The EC would not
experience backwards motion where the TN shifts behind the bridge helix. In contrast, the TN would
shift in a unilateral direction: forward shift where the TN slides in front of the bridge helix and becomes
the i + 1 template register. As mentioned above, this model is consistent with the Brownian ratchet
model if one of the postulates is put aside: the EC does not necessarily oscillate. The Molecular Dynamic
observations of translocation oscillations (e.g., [Silva, et al., 2014]) could then be explained by the fact
that the observed enzymatic complex is not in processive elongation. Indeed, the main channel theory
is consistent with translocation oscillations during non-processive transcription, because the allosteric
effect of downstream NTPs could not be accounted for and/or because the complex is substrate free, not
allowing sequential energy redirection triggering events to occur (triggered by interactions with NTP).
Also, consistent with the EC not oscillating are the observations that the pre-translocated state is
dominant, when no or scarce template NTP is present (e.g., [Kireeva, et al., 2008], [Dangkulwanich, et
al., 2013]). The postulate that translocation only proceeds forward, in normal transcription (four NTPs
present in solution, fast state) seems more plausible than translocation oscillations. This is inferred
because an oscillating already bound NTP-dNMP at TN position seems to pose a few issues. For in and
out motions to occur, the entering NTP would need to not bind to the A site (binding of MgB to Rpb1
D481, Rpb2 D837, and biding of NTP to MgA site). For this to happen, the NTP polyphosphate tail
would need to be shielded from the A site. It seems unlikely to explain how this could occur, even while
considering the hypothesis that the PPi from the previous NAC stays in the A site during a while and
plays the role of shield or the hypothesis that the A site is shielded by active center geometry (e.g., by
the TL). An alternative solution could be that translocation oscillations are so strong that binding to the
A site does not trap the NTP, and that the enzymatic complex requires binding in the DS bubble (e.g.,
at i + 4 position) in order to bias the ratchet forward. However, NTP diffusion and hence binding in the
DS bubble is not rate limiting in the main channel theory if substrates are not provided at subsaturating
amounts, and therefore one can consider that immediately (to simplify) after a DS register becomes
available, it is paired. Consequently, the hypothesis according to which a binding event in the DS bubble
bias forward the ratchet is inconsistent with the hypothesis of several translocation oscillations. It
follows that the whole assumption of translocation oscillations can be eliminated if the main channel
theory is correct, because it is hard to imagine what would trigger the ratchet forward if it is not a NTP
35
binding event. As a conclusion, the solution of forward translocation locking seems much simpler and
therefore is probably the right solution. Furthermore, forward translocation locking fits extremely well
in a general and extended model of translocation (explored in chapter 5). Also, a subsequent conclusion
is that the observation of pre-translocated states in experiments is consistent with forward locking,
because translocation oscillations can occur in the absence of substrate and because pre-translocation
can occur in reaction to a misloading event, an incorporation error, or during pause/arrest (e.g., triggered
by specific DNA sequence). Furthermore, we have seen that there exists a transient pre-translocated
state that does not originate from any pre-translocation motion.
Burton and colleagues in [Zhang, et al., 2004] and [Burton, et al., 2005] claim that the allosteric effect
of NTP on transcription means that the downstream dNMP-NTP pair drives forward translocation. In
particular, Burton et al. claim in [Zhang, et al., 2004] that “the dNMP-NTP basepair is thought to drive
RNA DNA hybrid displacement”. The following objections could be raised. First, the allosteric effect
of downstream NTP on transcription and hypothetically on translocation is not equivalent with the axiom
that NTP drives translocation. In fact, NTPs could very well facilitate the decoupling of the Brownian
energy in order to accelerate forward translocation, without providing any additional energy. Next,
downstream NTPs could attenuate a rate-limiting factor (e.g., PPi release) distinct from the hypothetical
rate-limiting forward translocation process and therefore allow translocation indirectly (and hence
transcription) to accelerate, without directly driving the translocation.
As a summary of the ratchet issue discussed above, the NTP-translocation and allosteric models might
be right regarding the fact that binding of a NTP has an allosteric effect on translocation, but seem wrong
when they imply that translocation could be energetically NTP-driven. The Brownian ratchet model,
although it might assume wrong hypotheses such as substrate diffusion through the secondary channel,
and spontaneous translocation oscillations, might be correct concerning the source of energy and might
describe translocation occurring in a substrate free enzyme. The main channel theory seems to be
consistent with a Brownian ratchet mechanism. More importantly, because the forward locking
postulate, which is inconsistent with CH2 theory, seems to fit particularly well in an extended model of
translocation (described in chapter 5), the main channel theory gains very serious credibility.
After these general considerations, let us have a closer look at the mechanism. Although it could appear
that there is an argument about the fundamental details of translocation, this is not necessarily the case
when observing closely the conditions under which the process oscillates or not. In particular, if
biochemical and MD experiments are investigated more thoroughly and privileged over other
experimental methods such as x-ray data (by essence more static is less informative), the following
picture emerges. The literature is actually consistent with the EC oscillating, but in non-processive (i.e.
not fast) elongation and subsaturating/null substrate concentration [Bar-Nahum, et al., 2005; Feig, et al.,
2010; Silva et al., 2014; Dangkulwanich, et al., 2013]. Because if for some reasons i + 2 is not bound
36
(subsaturating substrate concentrations, substrate-free enzyme, no presence if i + 2 NTP, etc.), then there
is no obstacle in the CH1 model as to why translocation would not oscillate. Consistent also with the
idea of i + 2 not being bound at subsaturating concentrations is the fact that NTP binding is rate limiting
if not supplemented at sufficient amounts [Bai, et al., 2004; Tadigotla, et al., 2006; Bai, et al., 2007;
Dangkulwanich, et al., 2013]. A second point to investigate further is if literature data is actually
consistent with translocation being locked forward when NTP binding is not rate limiting. Of particular
interest is the study of [Dangkulwanich, et al., 2013] where the researchers were able to derive almost
all the kinetic parameters related to translocation in a very precise manner.
The authors yield the kf forward kinetic parameter by solving the following equations.
𝑘𝑓 = 𝑘0. exp(𝐹. ∆/𝑘𝐵𝑇) (1)
𝑘𝑏 = 𝑘0. exp(−𝐹. (1 − ∆)/𝑘𝐵𝑇) (2)
ψ(t) = (𝑘𝑓/𝑘𝑏)−0.5. (exp(−(𝑘𝑓 + 𝑘𝑏)𝑡)/𝑡). (2𝑡(𝑘𝑓/𝑘𝑏)
−0.5)−1 (3)
Where, 𝑘𝑓, 𝑘𝑏, 𝑘0 are the forward, backward and intrinsic zero-force stepping rate constants
respectively, 𝐹 is the assisting or opposing force, ∆ is the transition state distance at each step, 𝑇 is the
temperature, 𝑘𝐵 is the Boltzmann constant, 𝑡 is time.
They assume in their model of probability density of pause duration ψ(t), that the pause distribution
probability is equivalent to a diffusion in one direction, then a return to original place. They deduct from
there the distribution of pauses. To simplify, let us have the following reasoning. The shorter the detected
pause, the smaller the probability that it has occurred (<0.2 for a pause <0.5s); then the longer the pause,
the higher the probability that it has occurred (if the pause >4s, the probability is >0.8). If the pause is
longer than 10s, the probability converges to certainty.
They cumulate the distribution of the pause duration probabilities (converging to 1), which yields k0.
They then solve (1) and find 𝑘𝑓.
Inputting 𝑘𝑓 in the equation describing the foward kinetic parameter when a nucleosome barrier is
present gives the factor 𝛾𝑈 (fraction of the time the nucleosomal barrier is unwrapped):
𝑘𝑓(𝑛𝑢𝑐𝑙) = 𝛾𝑈. 𝑘𝑓
The researchers then calculate the forward translocation rate 𝑘1 and the catalysis rate 𝑘3, by using the
following trick: the nucleosome roadblock induces an asymmetry in the kinetic equations below
allowing to separate 𝑘1 from 𝑘3:
𝑉𝑚𝑎𝑥(𝑛𝑢𝑐𝑙) = ((𝛾𝑈. 𝑘1. 𝑘3)/((𝛾𝑈. 𝑘1) + 𝑘3)). 𝑑
37
𝑉𝑚𝑎𝑥 = ((𝑘1. 𝑘3)/(𝑘1 + 𝑘3)). 𝑑
Where 𝑉𝑚𝑎𝑥(𝑛𝑢𝑐𝑙) and 𝑉𝑚𝑎𝑥 are the maximal pause-free velocitites in the presence and absence of
nucleosomal DNA, and where 𝑑 is the stepping distance.
In the end, the following important rates are calculated.
𝑘1(𝑝𝑜𝑠𝑡 − 𝑡𝑟𝑎𝑛𝑠) = 1/112 = 0.0089 𝑠
𝑘3(𝑐𝑎𝑡𝑎𝑙𝑦𝑠𝑖𝑠) = 1/35 = 0.02857 𝑠
𝐸𝑙𝑜𝑛𝑔𝑎𝑡𝑖𝑜𝑛 (𝑝𝑎𝑢𝑠𝑒 − 𝑓𝑟𝑒𝑒 𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦, 2𝑚𝑀 𝑁𝑇𝑃𝑠) = 26.7 𝑛𝑡. 𝑠−1 = 0.037453 𝑠. 𝑛𝑡−1
It worth noticing that 𝑘1 + 𝑘3 ≈ 0.037453 𝑠. 𝑛𝑡−1. Hence, accuracy standard error put aside, elongation
is virtually equivalent to translocation time plus catalysis time. Their study shows that elongation is
indeed locked forward (𝑘1 is the forward translocation rate), and that NTP binding is not rate limiting
(given amounts not diverging far from physiology). Their results seem to fit virtually perfectly with a
locked post-translocation model of elongation (given that NTPs are supplemented in an amount
comparable to physiology), and consequently with the CH1 model. Another recent experimental work
is consistent with translocation being locked forward in the fast state [Nedialkov, et al., 2012]. Finally,
as mentioned in the main channel theory section, recent single-molecule researches also corroborate that
there exists a secondary binding site that is not related to the translocation state [Larson, et al., 2012;
Dangkulwanich, et al., 2013].
Taking the elements altogether, we can hypothesize that it is indeed the trapping of i + 2 NTP at i + 1
position during post-translocation that constitutes the very first step that will lead to the ratchet being
incremented forward and that translocation is locked forward in normal processive elongation. This
resolves the problem of the NTP leaving the catalytic center and coming back to CH1, where the
mechanism would not be rectified and the upward pawl would not be unblocked. Hence, while not
invalidating CH2, it is at least consistent with CH1, where an already bound i + 2 register seems not
easy to reconcile with rapid oscillations. But if for some reasons i + 2 is not bound (subsaturating
substrate concentrations, substrate-free enzyme, no presence if i + 2 NTP, etc.), then there is no obstacle
in the CH1 model, as to why translocation would not oscillate: free translocation oscillations in the
absence of substrate (binding to i + 2) appears to be indeed correct. In chapter 5, a general
translocation/NTP loading mechanism will be proposed.
38
6. The melting issue and details on cTFs
It has been suggested from several studies that substrate pre-binding in the main channel was impossible
as DNA strands were evidenced to be fully associated (the opposite is referred to as melted) up to the i
+ 2 or i + 3 registers. For example, Vassylyev et al. and Kashkina et al., on the basis of structural and
biochemical data, have proposed that i + 2 was paired [Vassylyev, et al., 2007A; Vassylyev, et al.,
2007B; Kashkina, et al., 2007]. The melting evidences seemed to confirm the secondary channel as the
only possible pathway. It is worth mentioning that paired i + 2 register seems rather inconsistent with
free translocation oscillations during fast transcription, as the disjointed breakings of the hydrogen bond
between ntDNA and tDNA strands at i + 2 register would appear to be too energy costly. It follows that
if i + 2 register was really paired (a fact that will be proven wrong in this chapter), it would probably
only leave forward translocation locking as a plausible option anyway. In this section, we will
investigate the theory of strand separation, before analyzing structural and footprinting biochemical
experiments which could be informative about downstream DNA association and finally expanding on
the role of transcription factors, which appear to both play a key role in DNA melting and to impose
new conditions in order to decide between the modes of substrate channeling to the catalytic center.
Before, proposing an extended theory of strand separation, let us demonstrate that the mechanism is
likely to be universal, at least for bacterial RNAP and eukaryotic RNAP II, as well as gain insight on
the relative positions of the strands. In 2009, Andreacka et al. using single-molecule Fluorescence
Resonance Energy Transfer (smFRET) resolved the trajectory of the ntDNA strand in yeast RNAP II
[Andreacka, et al., 2009]. They found that the ntDNA strand passed above lobe region (Rpb2 272-278),
close to rudder (Rpb1 305-324) residues 309-315, near FL1 (Rpb2 461-480) residue 471, that the nt and
tDNA strands separated near FL2 (Rpb2 501-511) residue 504, and that most residues were conserved
in human RNAP II. In 2012, Zhang et al. resolved the structure of Tt RNAP IC using a complete ntDNA
strand (PDB#4G7O, [Zhang, et al., 2012]). The ntDNA strand was resolved on its full length and its
trajectory is very consistent with the FRET results from Andreacka and colleagues, where the strand
shifts at 90° from the tDNA strand near register i + 2, pointing towards the inside of the enzyme, before
looping backwards and perpendicularly above the tDNA strand and running outside of the enzymatic
complex. Therefore, one can reasonably postulate that the mechanism of strand separation is universal.
In 2011, Kireeva et al., [Kireeva, et al., 2011], using a RNAP mutant lacking the FL2 loop interacting
with i + 2, showed that FL2 did not play a significant role on melting. It is commonly believed that
electrostatic “switches” adjacent to the DS bubble are responsible for DNA melting. For instance, in
2004, Kettenberger and colleagues ([Kettenberger, et al., 2004]) proposed that three Rpb1 positively
charged residues (R326, K330, and R337) belonging to switch region 2 and two Rpb1 negatively
charged residues (E1403, E1404, and E1407) belonging to switch region 1 could separate the strands
39
near i + 2 to i + 4 registers, with the negative amino-acids repelling the ntDNA strand, while the positive
amino acids pulled the tDNA strand away from the helix axis.
In this paragraph, let us propose a reviewed and extended theory of strand separation based on the
analysis of yeast RNAP II structure (PDB#2E2H, [Wang, et al., 2006]). Observation of the electrostatic
configuration of amino acids near DNA register i + 2 enables to propose an “electrostatic fork” theory
of strand melting. The electrostatic fork comprises three zones of charged amino acids. Zone 1 consists
of Rpb1 residues K1102 (ε TL), R840 (ε BH), R1386 (ε switch 1) and attracts ntDNA strand downwards
and towards the left (towards inside of enzyme). Zone 3 comprises Rpb1 residues R839 (ε BH), K330,
K332, R337, and attracts the tDNA strand on the right further upstream. And zone 2 consisting of Rpb1
negatively charged residues E1403, E1404, E1407, all belonging to switch 1 region, creates a buffer
area preventing tDNA strand to be attracted towards zone 1 and ntDNA strand to be attracted to zone 3.
Rpb1 residue E884 (ε BH) appears to play a subtle role, pushing tDNA towards zone 1 and pushing
ntDNA strand away. The principle is summarized in Figure 4.
Figure 4: Electrostatic Fork melting mechanism. Key junction area is represented, where tDNA and ntDNA
strands melt. Left figure displays separation of tDNA (light blue vdw representation) and ntDNA (cyan vdw
representation) and is taken from PDB#2E2H (Wang, et al., 2006]) RNAP II structure. Electrostatic zone
pulling tDNA towards direction A, electrostatic zone 3 deviating ntDNA towards direction B (allowing
looping above tDNA), and electrostatic zone 2 creating a wedge region between zone 1 and 3, are shown in
magenta, light pink and dark pink surface representations respectively. tDNA i + 2 register is indicated in
yellow. Right figure displays the same information as the left figure; with the distinction that the specific
electrostatic residue indexed are indicated and that the residues are represented as cubes allowing to
simplify and characterize the Electrostatic Fork region as consisting of two attractive and one repelling
layers. DNA strands are represented as ribbons.
40
In addition to this key fork junction region, other residues appear to guide in a subtle manner the DNA
strands. Downstream deviation of the ntDNA strand is initiated around registers i + 9 to i + 11 thanks to
Rpb1 R175, K100 and Rpb2 R337 residues. Upstream guidance of the ntDNA strand, notably in order
to initiate perpendicular looping of the chain above the T strand, is performed by Rpb1 TL residues
R1100, K1109 and K1112. Otherwise, downstream deviation of the tDNA strand could be initiated
around i + 6 to i + 11 positions by Rpb2 residues K228, K257, R261, K277 (ϵ lobe region) and K471 (ϵ
FL1). Furthermore, amino acids K228, K257, K277 and K471 could have the double purpose of guiding
the upstream section of the ntDNA strand (positions i to i – 6) above the tDNA strand.
From the above electrostatic model of strand separation, it follows that the downstream bubble needs to
close sufficiently for optimal DNA melting, in order to bring the deoxyribonucleic helix with the
electrostatic protein residues close together. Another fact to be considered is that temperature might play
a direct role in promoting DNA melting. In 1983, Kirkegaard et al., [Kirkegaard, et al., 1983], using
cytosine methylation DNA footprinting found that melting of an Ec RNAP IC was strongly dependent
on temperature.
The crystallographic experiments performed on single subunit, bacterial RNAP and eukaryotic RNAP
II, which could be informative about downstream DNA association, will be reviewed. Several remarks
are to be stated before investigating the structural data. Although the i + 2 base in pre-translocation is
equivalent to the i + 1 base in post-translocation, translocation conserves the relative positions of the
bases. For example, if i + 2 is melted in pre-translocation, then the position will also be melted in post-
translocation, as the relative position of the t and ntDNA strand will not change. Only the position of i
+ 1 register relative to the RNA 3‘end will change, for RNAP walks away from RNA in post-
translocation and walks towards RNA in standard pre-translocation or catalytic site is occupied by newly
added NTP in transient pre-translocation (see ratchet issue above). Therefore, in this sub-section,
numbering of the nucleic registers will ignore the translocation state in order to focus on the melting
properties. In addition, unresolved RNA and DNA registers that are positioned outwards, near the
external surface of the enzymatic complex, will be ignored when resolution of the nucleic bases is
discussed, as they do not bring informative detail about DNA strand separation in the DS bubble. The
resolution of tDNA registers will be ignored as in almost all the structures they are resolved due their
stabilization with the wall of the downstream bubble. Particular focus will be given to the molecular
resolution of ntDNA registers close to the active site, because when a base is resolved or unresolved, it
corresponds to a well-ordered or disordered base respectively, which can give insight by extension to
strand melting. In other words, if the base is not-resolved from electron density refinement and if the
length of the non-template strand used in the crystallographic experimental procedure included the latter
base, it means that the strands might be unpaired at this position, because one would imagine that strand
association stabilizes the ntDNA strand. However, this is not definite evidence as a NT base could
41
disordered (i.e. mobile) while being associated. Stronger evidence of melting is when a ntDNA base is
resolved and observed melted.
First, let us analyze crystallographic/electron density refinement data in favor of i + 2 pairing. In 2002,
Tahirov et al. ([Tahirov, et al., 2002], PDB#1H38), as well as Temiakov et al. in 2004 ([Temiakov, et
al., 2004], PDB#1S0V), resolved the atomic coordinates of viral T7 RNAP EC using a tDNA, ntDNA
and RNA strand template of 18, 10 and 8 base lengths respectively. tDNA and RNA strands can be
considered as complete, ntDNA strand stops at i + 1. Observation of PDB#1H38 and PDB#1S0V shows
that ntDNA was resolved on its full length, up to i + 1, DNA strands are associated up to i + 2. In the
former PDB structure, ntDNA i + 2 base is slightly shifted relative to the opposite tDNA register, as the
base competes with protein residue F644, and could be considered as partially melted. In the latter
structure, the downstream DNA bases are ill-aligned (helix keeps its canonical form but base moiety-
hydrogen bonds are out of plane), which indicates that downstream DNA is partially disordered. From
2007 to 2012, several crystallographic experiments were performed on Tt RNAP and are the following.
In 2007, Vassylyev et al. generated PDB#2O5I ([Vassylyev, et al., 2007A]) and PDB#2O5J
([Vassylyev, et al., 2007B]) using tDNA and RNA templates which can be considered as complete and
a ntDNA template stopping at i + 1 register. In both structures, downstream DNA duplex is observed
well-ordered and paired up to i + 2 register. In the Tt RNAP IC from Zhang et al. with ntDNA resolved
virtually on its full length [Zhang, et al., 2012], DS registers were observed paired up to i + 2 and well-
ordered. DNA strands were mismatched between i + 1 and i – 6 positions, which did not appear to affect
the downstream DNA stability. Finally, a structural study of yeast RNAP II from Cheung and Cramer
in 2011 [Cheung, et al., 2011], showed paired i + 2 register in arrested RNAP II EC (PDB# 3PO2), with
a ntDNA resolved up to i + 1 register, using a nucleic template stopping and containing a mismatch at i
+ 1 position.
Next, let us review structural data which does not display downstream nucleic association and therefore
could support i + 2 melting. In 2013, Weixlbaumer et al. generated two sets of atomic coordinates for a
Tt RNAP paused EC (PDB#4GZY, 4GZZ, [Weixlbaumer, et al., 2013]), using a ntDNA strand stopping
at i + 2 register. Both structures are virtually identical, display ntDNA bases paired and resolved up to i
+ 4, and a downstream bubble largely open. i + 2/i + 3 bases are not resolved, which could be consistent
with the DNA pair being melted at these positions. Otherwise, from 2001 to 2011, several generated
yeast RNAP II structures could support downstream unwinding. In 2001, Gnatt and co-researchers
conducted a crystallographic study of RNAP II (PDB#1I6H, [Gnatt, et al., 2001] using ntDNA that can
be considered as complete (stops at i - 10). The authors proposed that the strands were melted from i +
4 register and upstream because their Electron Density data only exhibited double-helix DNA up to i +
5. However, the evidence for this is not strong as the Electron Density data was weak and discontinuous
allowing only an approximate localization of double-stranded downstream DNA. In 2004, Kettenberger
et al., using a full ntDNA strand, resolved the bases of the latter chain up to i + 3 (PDB#1Y77, 1Y1W,
42
[Kettenberger, et al., 2004]). The structure consisted of a TFIIS bound RNAP II. Although the fact that
a mismatch at i + 2 position was present in the nucleic template does not allow to draw a conclusion
concerning the register, the fact that H-bond alignment deviation occurs from register i + 4 could indicate
partial melting from i + 4 and upstream. Westover et al. in 2004, and Wang et al. in 2006, using the
same nucleic template consisting of ntDNA running up to i + 5 position, generated PDB#1R9T
([Westover, et al., 2004A]) and 2E2H ([Wang, et al., 2006]) respectively, which both displayed the
following. i + 6 DNA bases were misaligned indicating a possible deviation initiation and i + 5 base of
ntDNA chain was resolved and observed melted. Brueckner and colleagues solved the structure of an
RNAP EC in 2008 (PDB#2VUM, [Brueckener, et al., 2008]) with a ntDNA strand stopping at i + 3. i +
4 position was observed paired, yet it is to be noted that RNAP II was bound to α-amanitin. Because i +
3 position was not detected, one can postulate its melting. In 2011, Cheung and Cramer generated a
second set of atomic coordinates using the same nucleic template and experimental setup as exposed in
previous paragraph, which corresponded to a RNAP II reactivation intermediate (PDB#3PO3, [Cheung,
et al., 2011]). This time, i + 2 ntDNA position was not resolved, indicating its possible melting.
The structural data presented above seems puzzling. In viral RNAP structure from Tahirov et al.
([Tahirov, et al., 2002]), ntDNA base i + 2 is slightly shifted relative to the opposite tDNA register, as
the base competes with protein residue Phe:644, and could be considered as partially melted. On the
other hand, Temiakov et al.’s structure ([Temiakov, et al., 2004]) displays i + 2 association. For T.
thermophilus RNAP, some experiments seem to support associated i + 2 register ([Vassylyev, et al.,
2007A; Vassylyev, et al., 2007B; Zhang, et al., 2012]), while others support the possibility of its melting
([Weixlbaumer, et al., 2013]). The same holds for yeast RNAP II, where PDB#3PO2 structure ([Cheung,
et al., 2012]) supports i + 2 association and where the structures listed in the previous paragraph are
consistent with i + 2 melting. Recent developments even display up to i + 6 melting in a complete
transcription bubble [Barnes, et al., 2015]. In this paragraph, we will resolve this apparent dilemma and
demonstrate that the structural data is particularly inconclusive concerning DNA melting. First, all the
structures display a downstream bubble that is reasonably or largely open. However, as mentioned in
the theory of strand separation, it is possible that the downstream part of the main channel needs to close
sufficiently in order to trigger the electrostatic separation mechanism (as DNA is to be brought close
enough to the key electrostatic protein residues). More importantly, the tri-dimensional configurations
resolved by the x-ray and Electron density studies are partly unnatural due to crystal packing
(mechanical constraint applied to certain domains between adjacent RNAPs in crystals) and/or
temperature (low un-physiological temperatures are used in order to prepare the crystals). As exposed
in the theory of strand separation, melting could be dependent on temperature. It is worth mentioning
that the physiological temperature at which T. thermophilus evolves is 65 °C, which is very far from the
experimental conditions. It cannot be excluded that the RNAP of this particular organism requires a
higher temperature to initiate DS bubble closing. Otherwise, let us propose a hypothesis concerning the
43
inconsistency of base resolution in experiments. Close observation of the T. thermophilus RNAP and
yeast RNAP II, shows that in the structures where i + 2 base is resolved, FL2 domain is in close
proximity (see Figure 5 for bacterial RNAP and Figure 6 for yeast RNAP II). It follows that FL2
promotes stochastically stabilization of DS DNA and hence its resolution. Another possibility, although
unlikely, for association being observed when FL2 closes on i + 2 ntDNA base is that the strands could
be melted when FL2 does not close and interaction with FL2 brings them together. In any case, the FL2
stochastic interaction explanation does not contradict the fact mentioned above that the domain is not
involved in strand separation. The domain only seems to allow stabilization of bases in crystallographic
experiments allowing their resolution. In other words, the discrepancy between the studies might be due
to the stochastic stabilization of the bases with protein domains. For the structures from Westover,
Wang, et al. ([Westover, et al., 2004A; Wang, et al., 2006]), it is to be noted that i + 5 is probably
resolved (although DNA is disordered) because it forms electrostatic interaction with one of the residue
of the trigger loop, which reinforces the idea that ntDNA base resolution requires interaction with the
protein structure. Non-resolution of ntDNA strand or deviation of bases is inconclusive as their
stabilization only requires stochastic stabilization with protein domains, and observation of i + 2
association is also inconclusive as the topology and experimental temperatures strongly distort the
structure and do not allow normal melting to occur.
Figure 5: Comparison of FL2 interaction with downstream DNA in Tt RNAP. FL2 domain and protein
walls are represented as lime and grey surfaces respectively. tDNA and ntDNA, are represented as red and
green ribbons. i + 2 tDNA register is indicated in blue. A) RNAP EC structure from [Vassylyev, et al.,
2007B] (PDB#2O5I) displays strong stabilization of i + 2 positions with FL2. B) RNAP IC structure from
[Zhang, et al., 2012] (PDB#4G7O) displays strong interaction between FL2 and i + 2 register. C) RNAP
paused EC structure from [Weixlbaumer, et al., 2013] (PDB#4GZZ), displays deviation of NT-strand near
i + 2 register and probably corresponds to a weak interaction of t and ntDNA i + 2 register with FL2 domain.
A B C
44
Figure 6: Comparison of FL2 interaction with downstream DNA in Sc RNAP II. FL2 domain and protein
walls are represented as lime and grey surfaces respectively. tDNA and ntDNA, are represented as red and
green ribbons. i + 2 tDNA register is indicated in blue. A) RNAP EC structure from [Cheung, et al., 2011]
(PDB#3PO2) displays strong stabilization of i + 2 positions with FL2. B) RNAP EC structure from [Cheung,
et al., 2011] (PDB#3PO3). This structure and C, E and F display weak interaction between FL2 and i + 2
registers. C) RNAP paused EC structure from [Westover, et al., 2004B] (PDB#1R9T). Nucleic acids are
indicated in CPK representation instead of ribbons because the strands are too distorted in the initial
structure. D) RNAP EC from [Kettenberger, et al., 2004] (PDB#1Y77). It is to be noted that the last ntDNA
strand base is i + 3 position, and that i + 2 ntDNA would probably position in front of tDNA register i + 2
(represented in blue ribbon). FL2 interaction is close to A) but i + 2 ntDNA base was not resolved. A possible
explanation could be that FL2 shape near the extremity of ntDNA is concave, while for the tDNA strand it
is convex, inducing an unstable interaction. Otherwise, distribution of electrostatic charges might disfavor
ntDNA strand interaction. E) RNAP EC from [Wang, et al., 2006] (PDB#2E2H). F) RNAP EC from
[Brueckner, et al., 2008] (PDB#2VUM).
A B C
D E F
45
Now, let us investigate biochemical experiments tackling the DNA melting issue. In 2007, Kashkina et
al. [Kashkina, et al., 2007], proposed that multi-subunit RNAPs did not melt any downstream base-pairs
and therefore that the main channel theory could not be right. The downstream melting was detected
using the following biochemical approach. A template strand scaffold was modified with a pyrrolo-
cytosine (pC) or 2-aminopurine fluorescent base analogue at i + 1, i + 2, i + 3 or i + 4 position. In case
of stacking with adjacent bases, which is thought to be strengthened when the DNA is double-stranded,
the fluorescent base quenches. Therefore, strand separation is detected by high fluorescence apparition.
The researchers proposed that only i + 1 register was melted and that the main channel theory was
discarded because for T7 and Ec RNAP, as well as for Sc RNAP II, the fluorescent data did not show
strong fluorescence either at i + 2 as the minimum requirement for the main channel theory nor up to i
+ 4 for multiple-substrate pre-loading. Consistent with the latter claim, for yeast RNAP II, strong
fluorescence at i + 2 tDNA probe only appeared after addition of i + 1 NTP, leading to the shift of i + 2
NTP in the active site and strand separation. However, let us have a close look at their scientific
correlations. First the study from Kashkina et al. seems completely inconclusive as the levels of
fluorescence detected do not accurately match a simple event of strand separation. In other words,
correlating the fluorescent values, which do not converge in clear distinct sub-groups, to a single event
of strand separation, does not make any physical sense. For Ec RNAP, Figure 2B (therein) seems to
indicate that i + 2 could be partially melted as the relative fluorescence is smaller (about 40%) than that
of the melted i + 1 register but higher (about 33%) than that of the i + 3/i + 4 registers further
downstream. However, the possibility that the i + 2 register experiences reduced quenching due to
decreased confinement in the main channel (e.g., i + 2 is kept strongly separated from upstream register
i + 1 by the bridge helix) cannot be excluded. Hence, although the data could indicate i + 2 partial
melting, it could also indicate an unrelated phenomenon. In any case, the authors’ claim stating that i +
2 register is associated seems to be very questionable. For yeast RNAP II, Figure 2C (therein) shows
different values of fluorescence for a given register between two experiments: difference of about 20%,
25% and 25% for i + 1, i + 2 and i + 3 registers respectively. The latter is an indication that their method
of strand separation is inaccurate. Figure S2 (therein) shows a level of fluorescence of about two-fold
higher for the i + 3 register as compared to i + 2 for bacterial EC. Also, the fluorescence quenching is
much higher (60-80 %) than that of eukaryotic and viral EC (40-45%). Therefore, not only the detected
fluorescence emissions do not appear to exactly indicate strand separation, but also the latter emissions
could depend on other factors such as the type of the surrounding nucleic acids and the type of RNAP.
The exhibited results of Kashkina et al.’s experiments appear inconclusive. Finally, the authors’ claim
that i + 2 is associated is directly contradicted by several biochemical studies supporting the opposite
phenomenon. In 1995, Zaychikov et al. conducted chemical footprinting on Ec RNAP [Zaychikov, et
al., 1995]. Melting up to i + 3 register was detected in some of the ECs. In 2004, Santangelo and Roberts
([Santangelo, et al., 2008]) using notably covalent DNA interstrand crosslinks, showed that inhibiting
downstream strand separation, impairs transcript release during elongation termination. Their data also
46
suggested that elongation termination normally consists in forward translocation on an interval of 4 base
pairs. The latter hypotheses taken together with the evidences that termination generally involve dA/dT
rich downstream sequence (which promotes strand unwinding), seems to suggest that the transcription
bubble involves the melting of i + 2 to i + 4 registers preceding and/or during elongation termination.
Although the above postulate concerns intrinsic transcription termination, it seems consistent with a
normal dissociation of a few base pairs downstream from the catalytic center during transcription
elongation. In 2009, Saeki and Svejstrup detected up to i + 3 register melting in yeast RNAP II with
potassium permanganate footprinting [Saeki, et al., 2009]. Consistent with the latter result, in 2011,
Kireeva et al. defended the partial melting of i + 2 register in their ECs and detected i + 3 register in
hybridization equilibrium in one EC, on the basis of potassium permanganate footprinting of yeast
RNAP II EC [Kireeva, et al., 2011]. Finally, in 2009, Andreacka et al. ([Andreacka, et al., 2009]) on the
basis of smFRET experiment on yeast RNAP II, suggested that DNA strands separated at i + 2 register,
which indicates its melting.
Now let us discuss the melting results exposed above. All the kinetics studies presented in the main
channel section ([Foster, et al., 2011; Holmes, et al., 2003; Nedialkov, et al., 2003; Zhang, et al., 2003;
Zhang, et al., 2004; Gong, et al., 2005; Holmes, et al., 2006; Xiong, et al., 2007; Kireeva, et al., 2008;
Kennedy, et al., 2011]) are indirect evidence of i + 2 melting, for pre-loading in the main channel
requires DNA to be in at least a partial melting state. Furthermore, substrate pre-binding in the
downstream bubble could not require significant strand separation. For example, a slight longitudinal
shift of the tDNA dNMP nucleotide could allow hybridization with an incoming NTP to occur. If
considering the tertiary channel as the substrate entrance in the downstream channel, only the tailing
part of the tDNA base would need to be oriented towards the pathway to allow for pre-binding.
Otherwise, the studies in [Gong, et al., 2005] and in [Xiong, et al., 2007] from Burton et al. defend the
melting of up to i + 3 and i + 4 positions respectively. Let us correlate the latter results with the melting
information presented above. How is it possible that i + 3 and i + 4 melting are not always detected?
First, in the [Gong, et al., 2005] study, i + 4 melting was not tested for, therefore one can postulate its
melting as the experimental conditions resemble the ones of the second study. A reasonable hypothesis
to be made is that i + 4 melting occurs because TFIIS in conjunction of TFIIF is present in the
experiments, whereas for the other melting researches exposed above, TFIIF is never present, and TFIIS
is sometimes present. It follows that TFIIF (possibly only in the presence of TFIIS) appears to promote
downstream melting. Another apparent inconsistency is the irregular detection of i + 3. It is possible that
the latter register exists in a hybridization equilibrium (as termed by Kireeva et al.) and stochastically
melts. One can also hypothesize that in real transcription activity conditions, i + 3 register could conserve
its melting. The mechanism for such a melting conservation could be rapid translocation hindering the
hydrogen bond stochastic reformation, or the presence of transcription factors (naturally present in cell)
such as TFIIF promoting downstream bubble re-adjustments and by extension DNA melting.
47
Alternatively, periodic or incoming NTP-triggered availability of i + 3 position could allow a NTP to
pre-load at the base. The extent at which TFIIS alone promotes DNA melting is unclear at this stage and
requires further investigation. Altogether, it is hypothesized that in physiological conditions (hence in
the presence of TFIIS and TIIF) downstream melting up to (and perhaps further downstream) i + 4
register is achieved. A subsidiary conclusion to be deducted is that experiments lacking the presence of
TFIIF could not accurately depict DNA melting. Finally, one can consider that the minimum melting
requirement for the main channel theory to hold and consisting in a melted i + 2 register is assured.
In this sub-section, we will investigate details of the cleaving transcription factors mechanism and what
the consequences are for our discussion about DNA melting and substrate loading. For synthetic matters,
the cleaving TF will be referred to as cTF. As mentioned in previous sections, cTFs exist in two forms:
TFIIS/SII for eukaryotic RNAP II and GreA/B for bacterial RNAP. Although the molecules are
sequence unrelated, they are considered to behave in the same way (e.g., both share the same basic
structural geometry and principle of action). Therefore, information about one type of cTF can be
approximately considered to apply for the other molecule. In this sub-section, only eukaryotic TFIIS
will be investigated and one will assume that the findings apply to GreA/B TFs. In addition, TFIIS
domain I will be ignored as it is not required for activity and only plays a minor role. The recent
Molecular Dynamic results from Eun et al. ([Eun, et al., 2014]) enabled to tackle the cTF mechanism in
a new way. The researchers found that TFIIS was in the folded form (also referred to as close form) in
solution, where the contracted linker region brings together domain III and domain II and the molecule
forms a compact mass reducing hydrophobic contacts with the surrounding solvent. This finding has a
very important implication, which is the following. TFIIS always binds in the folded form to RNAP II,
where domain II (and possibly a fraction of the linker region at a smaller extent) binds to the external
surface of the enzyme near the funnel entrance. Also, it follows that the insertion of the transcription
factor in the enzyme requires the molecule to switch from the folded to the unfolded form (where the
linker region extends outwards and longitudinally) after a binding event has occurred, allowing the
linker to insert inside the secondary channel, bringing domain III near the active site, while domain II
stays bound at the surface of the enzymatic complex. In short, the cTF elementary behavior can be seen
as following a two phases step: binding to the surface of the enzyme, then unfolding allowing insertion.
This process can also be viewed as a harpoon mechanism, where domain II is the fixed element shooting
away domain III via the linker acting as the rope, and where the domain III head holds the sharp arrow
(the acidic hairpin region at the extremity of domain III containing the key second metal ion allowing
the two-metal ion pyrophosphoryolisis cleaving reaction to occur) triggering the cleavage reaction.
Domain III can also affect the active site geometry such as the realignment of a distorted RNA chain.
Other key information concerning cTF arose from the 2003 and 2004 crystallographic experiments from
Kettenberger and colleagues. In the 2003 experiment [Kettenberger, et al., 2003], RNAP II complex
lacking nucleic acids were soaked with TFIIS and the resolution of the C alpha atoms (PDB#1PQV)
48
evidenced that the transcription factor was inserted inside the protein, i.e. that the linker region was
positioned inside the secondary channel and that domain III was located at the extremity of the channel
near the active site. In their 2004 study [Kettenberger, et al., 2004], the researchers soaked RNAP II
complex with a tDNA template consisting of 3’-AGTACTTACGCCTGGTCAT-5’ (C denotes i + 1
position), a 5’-TCATGAA-3’ ntDNA strand running from i + 3 to i + 9 registers, and a 5’-
CGGACCAGAA-3’ RNA molecule running from i to i – 9 registers. The DNA duplex did not contain
mismatches, neither did the RNA-DNA hybrid. The TFIIS molecule was resolved and observed inserted
inside the enzyme (PDB#1Y1V, nucleic acids are not present in the PDB structure but present in the
crystallographic process). In both experiments, the fact of soaking RNAP II crystals with TFIIS, induced
a TFIIS in the inserted form, although the complex needed not to be rescued by the latter molecule. This
information seems to raise several important conclusions. First, TFIIS can bind to any complex, even
when not needed. Second, because the fact that the molecule was resolved in the inserted form means
that this very conformation remained, it appears that an inserted TFIIS could unfold or unbind only after
a cleavage reaction has occurred. This can be inferred because in the 2003 study, a fully active TFIIS
was used, but no nucleic acids were present, forbidding a cleavage event to occur. In the 2004
experiment, the TFIIS used in the experiments was muted to neutralize its cleaving capability
(negatively charged hairpin residues D290 and E291 replaced by neutral alanine). Alternatively, the
possibility that the unphysiological crystallographic conditions altered a TF retraction process cannot be
excluded.
Another puzzling fact concerning TFIIS being resolved inserted inside the enzyme is how did the factor
switch from close to open conformation if the insertion of domain III was not needed? Two possibilities
arise: cTF automatically unfolds upon initial binding of domain II to RNAP inducing its inserted
conformation, or the crystallization process triggered an unnatural unfolding. Let us deepen the
unfolding consideration. The question to be raised is: what the source of energy and mechanism driving
cTF unfolding is? Eun et al. in [Eun, et al., 2014] suggest that hydrophobic forces can be excluded (based
on potential of mean force umbrella sampling calculations) and that the only remaining suspect is
protein-protein interactions. This seems to be indeed a credible explanation. However, one can then
wonder what the molecular mechanism underlying such protein-protein interactions is? A possible
explanation could be the following. During normal transcription, a fraction of the energy generated by
thermal fluctuations is liberated in the form of translocation oscillations. In the event of misincorporation
and entry in an off-pathway state, the RNA chain backtrack, RNAP binds to the backtracked transcript
via the secondary channel and the complex is immobilized. One could therefore imagine that the thermal
fluctuations would then increase on the structure of RNAP, as it cannot be released in the form of
translocation anymore. This additional vibratory constraint could propagate to the bound cTF and
facilitate its unfolding. A second possibility could be that fine conformational changes occur within
RNAP upon misincorporation and RNA backtracking and that this conformation changes somewhat
49
propagate to the external surface where cTF is located, and triggers an equilibrium change in the TF
structure allowing its unfolding. Finally, a third possibility could be that upon initial binding to RNAP,
the equilibrium conformation of the cTF immediately changes and allows it to switch automatically into
open (unfolded) conformation. Possibilities 1 and 3 imply that TFIIS automatically unfolds upon
binding and therefore would imply that TFIIS would necessarily interfere with hypothetical NTP
diffusion via the secondary channel. These mechanisms seem also more plausible than possibility 2, as
the latter seems to require complex long range conformational propagation along the molecular structure
of RNAP. The last pieces of information about TFIIS that will be exposed before being applied in the
discussion below are the elements brought forward by the 2003 and 2004 kinetics study from Zhang and
co-researchers [Zhang, et al., 2003; Zhang, et al., 2004]. They found that in the presence of TFIIF, TFIIS
did not hinder synthesis rates. Zhang and Burton also found (and confirming earlier studies) that,
combined with TFIIF, TFIIS suppressed elemental pause (where no backtracking occurs) by promoting
quick backtracking and/or re-entry in the active synthesis pathway. The latter observation seems
consistent with the hypothesis that TFIIS has a prolonged function during active synthesis, and
consequently seems consistent with the transcription factor staying bound permanently to the enzymatic
complex. However, at this stage there is no definite evidence supporting this hypothesis, as the TF could
stochastically bind and interfere with the structure without necessarily staying bound to or inserted in it.
In this paragraph, the above hypotheses will be implemented in order to investigate the possible
scenarios underlying TF function and to draw the implications for substrate loading. Three possible
outcomes can follow initial cTF binding to RNAP. First, the molecule stochastically binds to RNAP and
then unbinds if not needed, i.e. if no cleavage reaction is required. This possibility is inconsistent with
the structural data from Kettenberger et al. ([Kettenberger, et al., 2003; Kettenberger, et al., 2004])
exposed above and hence can be eliminated. Second, cTF binds to RNAP and then unfolds
automatically. It seems impossible to explain the maintenance of a high synthesis rate in the presence
of TFIIS (kinetic results from Zhang et al., [Zhang, et al., 2003; Zhang, et al., 2004]) in the secondary
channel paradigm because the insertion of the molecule appears to strongly hinder substrate loading via
the secondary channel (see Figure 7) and appears to only be able to aggravate the rate limiting factor of
substrate diffusion to the active site. In other words, the postulate of cTF automatic unfolding eliminates
the plausibility of the secondary channel theory. The third scenario is that TFIIS binds to RNAP, but
only unfolds if required (folds only if complex arrested and backtracked). The latter scenario can be
subdivided in three potential outcomes. First, TFIIS stays permanently inserted inside CH2. It follows
that NTP diffusion via the secondary channel would be greatly reduced, which is inconsistent with
substrate loading being rate limiting in the CH2 theory paradigm. On the other hand, it could be
consistent with loading via the main channel, if CH2 can accommodate both PPi expulsion and a bound
TF (see Figure 7), if PPi exit is not rate limiting, and if more generally active site chemistry and
enzymatic function can be maintained. The only two remaining possibilities which could be consistent
50
with cTF not impeding hypothetical NTP diffusion via the secondary channel would be if TFIIS unbinds
and diffuses out of the complex after cleavage or if TFIIS unfolds, stays bound to the exterior of the
enzyme and clears the path for NTP diffusion via the secondary channel after cleavage. However, the
diffusion probabilistic study from Batada et al. which can be seen as the very upper limit (discussed in
more details in the discussion section), would be reviewed downwards because during hypothetical cTF
unbinding or unfolding, NTP diffusion via CH2 can only be temporally hindered, and consequently this
imposes an even higher constraint on the rate limiting aspect of NTP loading in the secondary channel
paradigm. Interestingly, bound TF leaves intact opening B leading to the tertiary channel (see Figure 7;
for details about opening B, see chapter 5), which seems to indicate that RNAP maintains its substrate
loading/expulsion capacity during TF insertion in the main channel theory paradigm. It does not mean
though that TF stays inserted permanently. Nevertheless, because as mentioned above scenario 1 is
discarded (cTF unbinds if not needed), cTF can be considered as permanently bound. This hypothesis
can be raised for the following reason. Only scenario 2 and 3 appear to hold, where either TFIIS binds
to RNAP and only unbinds upon transcript cleavage, or where TFIIS permanently binds to RNAP and
only retracts upon cleavage. The former possibility is almost equivalent with the TFIIS staying
permanently bound to the enzyme, because after a hypothetical unbinding event (after cleavage), another
TFIIS present in the surrounding solvent would quickly stochastically bind to the enzyme. The time
length of the cleavage process (~10s) is so greatly higher than stochastic diffusion of TFIIS in solution
and subsequent binding that most of the time RNAP can be considered as bound. It follows that TFIIS
seems to stay attached to RNAP in a prolonged manner during transcription and hence could interfere
in a prolonged manner with substrate diffusion via the secondary channel. Finally, as discussed in this
paragraph, insertion of TFIIS inside the CH2 seems to enhance the complexity and requirements of the
secondary channel model and consequently renders the theory less plausible.
51
Figure 7: TFIIS shielding of RNAP II secondary channel. Sc RNAP-TFIIS complex is from [Kettenberger,
et al., 2004] (PDB#1Y1V). TFIIS is shown in CPK representation, protein surface is indicated in grey. A:
TFIIS shields a large section of the funnel entrance to the secondary channel. B: TFIIS does not seem to
reduce entrance through the tertiary channel (opening CH3B).
A
B
52
7. Considerations on nucleotide selection
We will investigate in this section the current information about nucleotide discrimination and show
how it fits in the main channel theory paradigm. The goal of this section is to answer to the following
questions. Is NTP pre-binding in the main channel consistent with discrimination mechanisms occurring
in the catalytic center? How is misloading recovery achieved in the main channel theory paradigm?
One could postulate that if NTPs are pre-selected in the downstream bubble, active center discrimination
mechanisms should not significantly affect the transcription fidelity. A simple explanation is that pre-
selection in the main channel constitutes only the first layer of discrimination and that selection is further
improved in the catalytic center. Consistent with kinetic, genetic and biochemical studies ([Svetlov, et
al., 2004; Wang, et al., 2006; Malagon, et al., 2006; Kaplan, et al., 2008; Kireeva, et al., 2008; Tan, et
al., 2008; Zhang, et al., 2010; Yuzenkova, et al., 2010; Kaplan, et al., 2012; Fouqueau, et al., 2013]), the
TL interaction network (yeast Rpb1 residues Q1078, L1081, N1082, H1085, R446, N479) constitutes a
significant proofreading checkpoint for base and ribose discrimination. However, kinetic experiments
performed on mutant enzyme with deleted TL or with inhibited TL (with α-amanitin or strepltilgyn)
[Kaplan, et al., 2008; Zhang, et al., 2010; Yuzenkova, et al., 2010; Fouqueau, et al., 2013] and to a lesser
extent other studies (e.g., [Svetlov, et al., 2004; Wang, et al., 2006]) by subtracting the total wild type
discrimination from the TL interaction network discrimination, enable to evidence that the first layer of
nucleotide selection is achieved without the TL. Authors term the latter state as open active center
discrimination. Not only consistent with discrimination occurring without the active site TL, but also
consistent with the first step of selection being achieved in the main channel (while considering
hypothetical substrate pre-binding at that location) are the kinetic experiments presented in the main
channel theory section ([Foster, et al., 2001; Palangat, et al., 2001; Holmes, et al., 2003; Nedialkov, et
al., 2003; Zhang, et al., 2004; Gong et al., 2005; Xiong, et al., 2007; Kennedy, et al., 2011]). The latter
studies are all consistent with base selection being achieved in the process of substrate pre-binding to
downstream DNA registers. It is easy to rationalize such discrimination with H-bonding energies
between complementary bases. Table 1 below summarize base identity verification results achieved by
mutant enzyme with deleted TL. Of course, as mentioned above, for the base moiety selection, pre-
binding in the main channel represents an obvious filtering mechanism (even though as shown in table
1, kinetic discrimination between cATP and ncGTP is only 4-fold for T. aquaticus according to
Yuzenkova and colleagues).
53
Table 1: Comparison of nucleotide base discrimination between several studies for enzyme with deleted TL
domain. The results colored in green and red are from [Yuzenkova, et al., 2010] and [Fouqueau, et al., 2013]
respectively. Ta, Ec and Mj are the abbreviations for T. aquaticus, E. coli and M. jannaschii RNAP
respectively. d is discrimination level and is defined by the ratio between (kpol/Kdis) for the correct substrate
and (kpol/Kdis) for the incorrect substrate, where kpol is the elongation rate (i.e., misincoporation rate in the
case of incorrect NTP) and Kdis is the dissociation rate. kd is kinetic discrimination and is defined by the
elongation rate divided by the misincorporation rate. ncNTP stands for non-complementary riboNTP and
cNTP stands for cognate riboNTP. cGTP/ncGTP field is filled (in comparison to cATP/ncATP,
cCTP/ncCTP and cUTP/ncUTP fields that are not) because the comparison arises from experiments
performed on different ECs, where i + 1 register pairs GTP and where i + 1 register does not pair GTP.
Table 2 summarizes kinetic experiment results performed with RNAP not containing a TL domain and
evidences that ribose discrimination is achieved in the open active center state. Moreover, isomerization
reversal kinetic studies from Burton and colleagues [Xiong, et al., 2005; Gong, et al., 2007] indicate that
downstream i + 2 and/or i + 3 complementary 2’dCTP did not stimulate isomerization of i + 1 NTP
(while CTP did) and that isomerization reversal was weak for incorrect i + 1 NTP (strong for CTP), and
that i + 4 to i + 6 complementary 2’dUTP (or 2’dTTP) did not stimulate reversal of i + 1 NTP (while
UTP did). Hence, one can hypothesize that the first step of ribose discrimination is indeed achieved in
the main channel, and that the above findings would be explained by the deoxynucleotide
(ribonucleotide is the right substrate) not binding to downstream DNA during the short time scales of
the kinetic experiments. However, an alternative explanation for the above isomerization observations
could be that 2’dNTPs remain bound to DS register, but impede the translocation sliding degrees of
freedom. Such a hindering effect could arise from an altered Watson-Crick geometry (tilted ribose ring)
inducing steric clashes in the channel. Alternatively, electrostatic and/or hydrophobic impediment could
occur from the fact that a deoxynucleotide lacks a hydroxyl group (negative electrostatic potential). The
isomerization experiments seem to corroborate the fact that dNTPs are discriminated against in the main
channel and is consistent with the fact that such a selection is achieved partly without the TL domain
interaction network in the active site.
Although, one can postulate that, as mentioned above for potential factors affecting translocation, bond
integrity is disfavored in the main channel for deoxynucleotides, and that the latter mechanism might
54
involve additional phenomena such as subtle electrostatic, hydrophobic and/or steric filtering (e.g.
during translocation by steric clash with atomic contacts of BH residue Y836), two likely suspects are
the fact that H-bonding to a dNMP base might have a higher affinity for a matched rNTP than for a
complementary dNTP (at this stage, H-bonding chemistry is still not fully elucidated) or that stacking
interactions differ in the case of adjacent deoxy and adjacent ribo nucleotides. In favor of very subtle
interactions occurring between adjacent NTPs or opposite NTP-dNMP pair are the results displayed in
Table 2, which seem to suggest that slight atomic property differences between the NTP types induce
dramatic discrimination differences. Also, Yuzenkova et al.’s finding that ribose rather than base
discrimination depends more on the TL interaction network, is consistent with the idea that H-bonding
discriminates much better the base moiety than the ribose ring and is consistent with the observation
from Fouqueau and colleagues that binding (and incorporation) of 2’dNTP by WT RNAP was 680 times
more frequent than for ncUTP. Consistent with the fact that 3’dNTPs are poorly (e.g., 3-fold kinetic
discrimination for 3’dATP against rATP for T. aquaticus RNAP) or not (e.g., 0.4 kinetic discrimination
for 3’dGTP against rGTP for T. aquaticus RNAP) discriminated against, is the observation that the 3’OH
is located more on the periphery from the adjacent NTP than the 2’OH. Part of the explanation for the
discrepancies between the selectivity levels may be the following. The NTP type (i.e., base identity) is
important, because depending on the type of H-bonding interaction it forms with the opposite dNMP
base, the base would tilt more or less the hydroxyl groups of the ribose moiety towards adjacent pre-
bound NTPs. Alternatively, a possibility that cannot be excluded is that the substrate types do not all
have the same probability of being misincorporated in the absence of the TL. Even if ribose
discrimination was not performed in the downstream bubble, and that the latter was only done in the
active center, it would not invalidate the main channel theory. According to Nick Mc. Elhinny et al.
([Nick McElhinny, et al., 2010]), there are 82-fold more rNTPs than dNTPs in yeast RNAP II. According
to Traut’s average concentrations ([Traut, et al., 1994]), the ratio is 47-fold more in mammalian cells. It
follows that only a small fraction of the time, the enzymatic complex would need to recover from a
misloaded dNTP in the active site, if no ribose pre-selection was performed.
55
Table 2: Comparison of nucleotide ribose discrimination between several studies for enzyme with deleted
TL domain. The results colored in purple, green and red are from [Zhang, et al., 2010], [Yuzenkova, et al.,
2010] and [Fouqueau, et al., 2013] respectively. Ta, Ec and Mj are the abbreviations for T. aquaticus, E. coli
and M. jannaschii RNAP respectively. d is discrimination level and is defined by the ratio between
(kpol/Kdis) for the correct substrate and (kpol/Kdis) for the incorrect substrate, where kpol is the elongation
rate (i.e., misincoporation rate in the case of incorrect NTP) and Kdis is the dissociation rate. kd is kinetic
discrimination and is defined by the elongation rate divided by the misincorporation rate. cd is
concentration discrimination and is defined by the ratio between incorrect and correct substrate
concentrations required to elongate half of the RNA transcript. 2’dNTP and 3’dNTP stand for
complementary 2’deoxyNTP and 3’deoxyNTP respectively, NTP stands for cognate riboNTP.
At first glance, discrimination mechanisms occurring in the active center could appear inconsistent with
the main channel theory. Indeed, in the secondary channel model, NTPs are verified directly in the active
site, and an incorrect NTP in the A site is simply expelled through CH2, freeing i + 1 register for
subsequent binding. However, in the main channel model, a misloaded NTP at i + 1 position seems more
problematic, as its expulsion would leave i + 1 register unpaired while DS registers are paired (e.g., i +
2, i + 3). Furthermore, while considering that the active site TL interaction network constitutes a second
layer discrimination and that the latter allows to detect errors from the first layer of selection (i.e.,
misloading), then the issue is: can the enzyme quickly recover from failures of the first layer? In this
paragraph, we will investigate potential recovery mechanisms. We will show that the main channel
loading model could very well accommodate pre-selection errors, hence sometimes allow the channeling
of wrong substrate in the catalytic center. Let us assume that pre-binding in the downstream channel is
granted by an opening connecting the site to the solution and let us term this opening tertiary channel
(CH3). Let us also assume that pre-binding in the downstream DNA channel can occur at i + 2 or i + 3
register sequentially (findings from [Xiong, et al., 2007] imply that the first available allosteric site could
be i + 4), that is to say that every nucleotide first binds at these sites before being incrementally shifted
to the upstream position after each nucleotide addition cycle. Two scenarios could explain how RNAP
would recover from a loading error in the main channel paradigm. The first recovery mechanism could
be the following. If an incorrect NTP is loaded from the main channel to the catalytic center, the TL
56
interaction network stimulates its expulsion (while forbidding catalysis) via the secondary channel.
Now, i + 1 is unpaired, while i + 2 and i + 3 are paired. One could postulate that the latter configuration
(i.e., “hole” at i + 1, while DS registers are paired) induces a deviation of tDNA strand, which in turn
weakens the RNA-DNA hybrid. Forward translocation could then be hedged and backtracking
promoted. Two steps of backtracking could reposition i + 1 register at the i + 3 pre-binding site, i + 2
and i + 3 NTPs could detach from the DNA by stalling against the tertiary channel walls and the non-
template strand could rewind with the template strand. If two steps of backtracking are too costly, one
could examine another possibility. In case of a wrong substrate in the active site, and its subsequent
expulsion via the pore, a simple pre-translocation event would reposition i + 1 register at i + 2 position.
Then, a NTP would simply need to rebind to i + 2, and i + 3 NTP-dNMP pair would not be affected.
This would only require the i + 3 pair not blocking the passage for i + 2 NTP, which seems to be validated
by the observation of structural data. The transcription process can then resume. Scenario one seems
more complicated because it involves the requirement of detachment of the downstream pre-bound
NTPs. However, because it is a known fact that the TF cleavage process occurs in an arrested EC where
backtracking normally consists of a several nucleotides length interval, it appears that for this
phenomenon to occur in the main channel theory paradigm, the detachment of the downstream substrates
must be possible. The backtracking of tDNA in the downstream bubble, which could withstand the base-
pair hydrogen bonds and not require rewinding of the downstream DNA strands, would only be a
possibility for a few registers. Because a longer backtracking conserving paired tDNA bases would
strongly interfere with the rewinding of the template and non-template strands. In short, scenario 1 does
not necessarily require downstream NTPs detachment, but such a phenomenon appears to occur in the
backtracking process that is notably involved in the elementary step of the cleavage process.
Authors, supporting the secondary channel theory, claim that all NTPs bind to the E site, while only an
NTP able to base-pair with i + 1 DNA position will bind to the A site [Batada, et al., 2004; Wang, et al.,
2006; Martinez-Rucobo, et al., 2013]. This could be interpreted as a pre-filtering mechanism for the
base identity occurring between the E and the A sites. The authors seem to suggest that all rNTPs do not
necessarily enter completely inside the catalytic site and bind to the i + 1 position. This raises an
immediate issue: how could a rNTP be selected at distance from i + 1 register, when what determine the
correct rNTP (out of the 4 types) are the properties of the i + 1 DNA base? Authors have suggested that
the TL could serve as bridge and allow to read at the same time DNA and a distant NTP. However,
Yuzenkova et al.’s findings (consistent with kinetic data and consistent with structural data) eliminate
this far-fetched possibility: the TL proofreading mechanism only concerns a NTP bound at i + 1 position
being located inside the active center. In addition, discrimination in the open active center state also
concerns a H-bonding event. In other words, in order to be discriminated against, all rNTPs need to try
and bind to the i + 1 register. The initial binding configuration in the catalytic center has been proposed
to concern a location distinct from the addition site, however this issue is not important for our
57
discussion. The bottom line is that in order to be discriminated against, rNTPs must position in front of
the i + 1 register hence enter inside the active center in the secondary channel theory paradigm. This has
a very important implication. The rate limiting aspect of NTP diffusion in the secondary channel model
is much more important than commonly considered (details in the next section) and the presence of the
hypothetical E site does not change the problem at all: most of the time, an incorrect NTP will load in
the catalytic site via CH2, try and bind to i + 1, and will have to diffuse away to clear the path for the
correct substrate. On the other hand, because pre-binding in the main channel theory constitutes the first
layer of discrimination, misloading/expulsion frequency is greatly inferior than that of the alternative
model.
58
8. Discussion
The secondary channel mode of substrate entry in the active site appears to suffer severe limitations.
First, the properties of the pathway pose an immediate issue. The last end of the channel comprises a
narrow corridor, which has a diameter oscillating between 7 and 12 Å (according to literature, but see
in-depth analysis in chapter 5), and can also completely contract. Not only is the corridor’s structure
very constricted, it also has a strong negative electrostatic potential. Incoming MgB-NTP complexes
have an electrostatic charge of - 2 and a minimum diameter of 6 Å. It follows that the substrate
experiences repulsion preventing it to approach the corridor and that the channel can only accommodate
one NTP at a time.
In their 2004 study, Batada and colleagues defended the plausibility of the pathway as mode of loading,
because taking into account the restrictions mentioned above seemed to allow a synthesis rate consistent
with the normal elongation rate in vivo [Batada, et al., 2004]. However, they failed to take into account
several important limiting parameters. First, the NTP trajectory to the active site is significantly more
obstructed than they considered. The corridor needs to exchange wrong substrates in and out. According
to the secondary channel theory, all substrates can enter the corridor without discrimination where they
can bind to the E site. As described in the previous section, in fact a rNTP bound to the E site needs to
rotate in the catalytic center and bind to i + 1 register (equated to A site in this paragraph for sake of
simplicity, but initial i + 1 binding site might be slightly distinct from A site, which has no importance
for our discussion) in order to be discriminated against. Because A and E sites are mutually exclusive
(both sites cannot be occupied simultaneously, notably because of shared MgB binding contact), let us
consider to simplify the problem that every time a rNTP binds to the E site, it then rotates to the A site,
and is expelled if it is the wrong rNTP or is incorporated if it is the right substrate. It follows that because
there are four different kinds of NTP, most of the time an incorrect NTP will bind to the E site, rotate to
the A site, and finally be expelled. Let us term this time window, where access of the correct NTP is
blocked, “NTP access” window. Nick Mc Elhinny’s substrate concentrations in yeast [Nick McElhinny,
et al., 2010] enable to investigate the issue in more details. NTPs concentrations in yeast are the
following. rATP: 3.0 mM, rUTP: 1.7 mM, rGTP: 0.7 mM, rCTP: 0.5 mM, dTTP: 30 μM, dATP: 16 μM,
dCTP: 14 μM, dGTP: 12 μM.
According to Traut’s average concentrations in mammalian cells, ribose compounds represent about
0.13 mM and Pi compounds 4.4 mM [Traut, et al., 1994]. The latter concentrations are informative about
the fact that while dNTPs could be neglected, Pi and ribose compounds would often encounter NTPs in
solvent, if average mammalian concentrations apply more or less to yeast organisms. Next, let us
consider substrate competition at the E site and let us simplify the problem by only considering rNTPs.
ATPs represent ~51 % of polymerization substrates, UTPs: ~ 29 %, GTPs: ~12 %, CTPs: ~8 %. One
could then postulate that if GTP, or CTP is the next nucleotide to be added, 88 to 92% of the time, NTP
59
access will be blocked by an alternative substrate. In other words, for a matched GTP or CTP to enter
the active center, 88 to 92 % of the time, the cognate substrate would need to wait insertion/expulsion
of the wrong NTP to occur. We can see that this dramatically discredits the secondary channel as a
plausible pathway. Batada et al. calculated the probability of successful diffusion by releasing one
substrate at a time from the entrance of the funnel and counted how many times the molecule eventually
binds to the E site. This assumption fails to include all the other molecules impeding the trajectory,
especially other substrates bound in the E site or misloaded rNTPs in the catalytic site. The authors
mention that “the rate of collisions resulting in binding may be further reduced by one or even two orders
of magnitude by the steric requirements for binding“ and they apply this constraint in their final
estimated diffusion rate. Just dividing the successful rate by 10 or more does not consist in a serious
calculation method, all the more so because as the authors do not specify the physical ground for their
assumption. Let us recalculate Batada et al.’s diffusion probability, by using realistic rNTP substrate
concentrations. According to the requirements of the CH2-hypothesis, the rNTP needs to be oriented in
a specific way with the polyphosphate tail oriented ahead (towards the active center) and longitudinally
with the axis of the corridor. It follows that successful diffusion must be decreased by a steric clash
factor. Let us take into account this impediment by using the steric clash factor of 𝑥 ≤ 0.1 proposed by
Batada and colleagues. It follows that the upper limit of diffusion probabilities is given by:
𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑑𝑖𝑓𝑓𝑢𝑠𝑖𝑜𝑛 𝑜𝑓 𝑟𝑁𝑇𝑃 = 𝑟𝑎𝑡𝑒 𝑜𝑓 𝑟𝑁𝑇𝑃𝑠 𝑎𝑐𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝐸 𝑠𝑖𝑡𝑒 × 𝑐𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑟𝑁𝑇𝑃
× 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑟𝑁𝑇𝑃 𝑎𝑐𝑐𝑒𝑠𝑠 𝑛𝑜𝑡 𝑜𝑐𝑐𝑢𝑝𝑖𝑒𝑑 𝑏𝑦 𝑤𝑟𝑜𝑛𝑔 𝑟𝑁𝑇𝑃 × 𝑠𝑡𝑒𝑟𝑖𝑐 𝑐𝑙𝑎𝑠ℎ 𝑓𝑎𝑐𝑡𝑜𝑟
By using Batada et al.’s rate of rNTPs accessing the E site (2×105. 𝑠−1. 𝑀−1), this equation can be
rewritten as:
𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑑𝑖𝑓𝑓𝑢𝑠𝑖𝑜𝑛 𝑜𝑓 𝑟𝑁𝑇𝑃 = (2×105) × 𝑐𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑟𝑁𝑇𝑃
× 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑟𝑁𝑇𝑃 𝑎𝑐𝑐𝑒𝑠𝑠 𝑛𝑜𝑡 𝑜𝑐𝑐𝑢𝑝𝑖𝑒𝑑 𝑏𝑦 𝑤𝑟𝑜𝑛𝑔 𝑟𝑁𝑇𝑃 × (𝑥 ≤ 0.1)
Hence,
𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑑𝑖𝑓𝑓𝑢𝑠𝑖𝑜𝑛 𝑟𝐴𝑇𝑃 = (2×105) × 0.0030 × 0.51 × (𝑥 ≤ 0.1) = ≤ 30.60 𝑟𝐴𝑇𝑃. 𝑠−1
𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑑𝑖𝑓𝑓𝑢𝑠𝑖𝑜𝑛 𝑟𝑈𝑇𝑃 = (2×105) × 0.0017 × 0.29 × (𝑥 ≤ 0.1) = ≤ 9.86 𝑟𝑈𝑇𝑃. 𝑠−1
𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑑𝑖𝑓𝑓𝑢𝑠𝑖𝑜𝑛 𝑟𝐺𝑇𝑃 = (2×105) × 0.0007 × 0.12 × (𝑥 ≤ 0.1) = ≤ 1.68 𝑟𝐺𝑇𝑃. 𝑠−1
𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑑𝑖𝑓𝑓𝑢𝑠𝑖𝑜𝑛 𝑟𝐶𝑇𝑃 = (2×105) × 0.0005 × 0.08 × (𝑥 ≤ 0.1) = ≤ 0.80 𝑟𝐶𝑇𝑃. 𝑠−1
Now, in order to compare with the assumed ~10 rNTP.s-1 in vivo RNAP II polymerization rate, let us
even further consider the very upper limit and assume that rNTPs are incorporated immediately after
binding or expelled instantly if non-cognate. Because DNA bases in the tDNA strand can be generally
considered as fairly evenly distributed in most organisms, the 10 rNTP.s-1 rate in vivo can be simplified
to an incorporation segment consisting of 2.5.s-1 of each NTP. In this ideal model (incorporation delay
ignored, NTP rotation through corridor delay ignored), and assuming a NTP bound in the E site can
60
even rotate to the A site (which still requires direct evidence), 1 second is not enough to incorporate the
right number of GTPs or CTPs. Hence, although the NTP concentration utilized in the calculation are
to be taken with care because intracellular compartmentalization processes could occur and represent an
unknown parameter, it appears unclear if the calculated diffusion probabilities are realistic.
Another issue to be raised with their study is the following. When estimating diffusion impediment
induced by the electrostatic potential, they equated a successful diffusion with a NTP binding to the E
site. However, there are no experimental evidences that a NTP bound in the E site can rotate to the A
site. This has only been inferred but never been observed. If the latter unproven axiom is wrong, then a
matched substrate binding to the E site is not at all equivalent with a successful diffusion to the catalytic
center. At this stage an rNTP bound to the E site still needs to undergo an almost 180o rotation through
the narrow corridor and therefore the diffusional impairment induced by the corridor dimensions and
electrostatics is not yet fully accounted for. If such a rotation does not occur, then the probability of
diffusion from the E site to the A site is likely to be greatly reduced. Indeed, the E site being located at
the first two thirds of the corridor, the full diffusional impairment induced by the corridor dimensions
and electrostatics are not fully accounted for. Furthermore, rotation from the E site to the A site seems
difficult to explain. When a matched NTP binds to the E site, MgB is temporally bound to the pore wall.
Consequently, the MgB contribution to the repulsion is partially neutralized because it is anchored to
the wall, and serves only as a rotor. MgB is positively charged and therefore, the rest of the NTP that
remains in free motion and that accounts for most of the total - 2 negative charge of the NTP still needs
to overcome the negative repulsion of the pore during the rotation. It could be hypothesized that MgB
temporarily screens the electromagnetic field lines of the potential allowing the NTP to rotate, yet this
seems far-fetched. There is no physical basis to explain how a bound NTP to the E site in an inverted
position would rotate. Furthermore, the fact that the crystallography experiments were able to capture
NTP in an inverted position in E site could indicate that this architecture remained and that the NTP was
unable to rotate (discussed in more details below). On the other hand, the fact that no NTPs pre-bound
in the main channel have been seen in crystallographic data could simply mean that they were not
immobilized long enough in that position. They could also be mistaken for paired bases.
The researchers claim that “delivery of NTPs by diffusion may be just sufficient to maintain the rate of
RNA synthesis“. However, for all the reasons mentioned above, it seems clear that the probability of
successful diffusion via the secondary channel is not sufficient at all to allow a physiological rate of
processive elongation. In short, because the calculated diffusion probability is already barely sufficient
and can be considered as the very upper limit, it seems that this study is in fact strong evidence against
the secondary channel theory. Their research has nevertheless enabled to yield crucial information about
the restrictions imposed by the pore’s properties on diffusion. It is to be noted that the restrictions
imposed by the secondary channel are very likely to apply to the other RNAP species. For example, the
negatively charged residues of the corridor (Rpb1 D481, D483, D485, E486, E822, D826, E1074, and
61
Rpb2 E529, E836, D837) are absolutely conserved among yeast, M. jannaschii, C. elegans, drosophila,
human and mouse. For the negatively charged residues that are directly adjacent to the pore: Rpb1 D356,
D526, Rpb2 D978 are conserved, Rpb1 E833, D1359 are highly conserved, Rpb1 E771 and Rpb2 D1100
are medium conserved. Also, bacterial RNAPs display a conic shaped secondary channel, which would
impose similar topological impairment, although the pathway is shorter.
In [Kireeva, et al., 2010], Kireeva, Burton, et al., underline that the calculated diffusion rates from
Batada and colleagues are 50 times slower, than the experimentally observed rates of the template-
specific NTP sequestration for human, yeast RNAP II and E. coli RNAP in [Foster, et al., 2001; Holmes,
et al., 2003; Nedialkov, et al., 2003; Zhang, et al., 2004; Kireeva, et al., 2008; Kireeva, et al., 2009].
According to the above successful diffusion rates, notably for CTP, and representing more than the very
upper limit, the issue would be even worse. So even if template specificity can facilitate successful
diffusion in the CH2 paradigm (e.g., suppress non-template roadblocks at the E site and greatly reduce
diffusion competition), it seems hard to explain such a sequestration rate (i.e. successful catalytic loading
rate) with the restrictions imposed by the channel.
Concerning the second computational study ([Zhang, et al., 2015A]), seemingly eliminating the main
channel as a credible substrate pathway, because both not favorable conformationally and
electrostatically, the experimentation carried out suffer from the following issues. First, the researchers
run a pathway detection program, CAVER ([Chovancova, et al., 2012; Kozlikova, et al., 2014; Pavelka,
et al., 2016]), to identify cavity routes inside the enzyme. The yielded proposed substrate accessible
zone within CH1 seems particularly absurd in light of the conformational results presented in chapter 5.
The work carried out in this thesis strongly refutes their conformational analysis. To run properly, the
CAVER program needs an initial starting pathway guess to be defined and it is possible that the authors
severely misused the computer tool. Second, the methodology of fitting NTPs directly into estimated
available empty areas (which estimation is initially wrong anyway) is very questionable: it does not shed
any light on the diffusion process. Third, they reach the conclusion that an NTP fitted inside the
secondary channel experiences less repulsion than a NTP inside the main channel. However, the
diffusion impediments generated by the secondary channel theory does not concern the entire secondary
channel, but only a select area: the last narrow section, which is the corridor. There is of course plenty
of space in the first two thirds of the secondary channel, which appears to serve another purpose than
substrate loading (conic shape is ideal for expelling inorganic pyrophosphates, misloaded NTPs, large
area to accommodate TFIIS, etc.). Finally, their electrostatic analysis is not corroborated by the work
presented in chapter 5. It is possible that their detected main channel substrate route (perpendicular to
CH1, and appearing to envelop circularly the ntDNA strand) is not indeed favorable electrostatically
because it is too close to ntDNA. Alternative pathways, such as CH3C or CH3A, have not been taken
into account. Their claim that the secondary channel is electrostatically balanced is refuted by this thesis,
but also notably by [Batada, et al., 2004].
62
Now let us examine specifically the E site evidence. The argument in favor of the secondary channel
theory is why NTPs would be observed bound in CH2, very close to the active site, if they load through
a different pathway, while no NTPs have been observed bound in the downstream channel in
crystallographic/Fourier electron density data. Several remarks can be made. First, binding in the E site
could represent a singular event, and not represent the normal reaction pathway. While biochemical and
rapid kinetic techniques could be more suitable for capturing the dynamic elongation process, the
experimental procedure allowing to generate enzymatic crystals does not represent processive
elongation. The enzymatic complexes are soaked in a solution containing only one type of NTP, which
forbids sequential processive elongation to occur. In short, it could be that the experimental conditions
do not allow normal processive elongation to occur and hence do not allow hypothetical normal loading
through the main channel. In other words, a possibility to be considered is that binding events to the E
site occur because the normal reaction pathway through the main channel is eliminated. Therefore, even
though diffusion through the secondary channel could be less favorable than loading through the main
channel during normal transcription rate conditions, it could become the default pathway in
crystallographic experimental conditions. It follows that even with the diffusional restrictions exposed
previously, if granted a sufficient amount of time, a NTP could very well successfully bind to the E site
rather than bind in the downstream bubble. Very important to mention is that the E site is located near
the beginning of the corridor, hence diffusion to the E site concerns the most favorable route through
CH2, as the main impediment of the pathway occurs from the corridor. Furthermore, Kireeva et al.
[Kireeva, et al., 2010] have suggested that in the experimental procedure used for generating crystals,
blocking chemistry at the i + 1 site (necessary to fix the i + 1 NTP) might disable substrate loading via
the main channel.
Now let us consider the possibility that loading through the tertiary channel and via the main channel,
was not distorted. The study from Batada et al. is consistent with NTPs being able to bind to the E site,
even if the event is rare. One could then object that it would disturb the main channel theory pathway,
for example by preventing the incoming NTP-dNMP pair to bind to the A site. However, in fact, in real
in vivo conditions (e.g., presence of type of rNTPs in the solvent buffer), binding to the E site could be
virtually permanently cancelled because of occupancy of the A site by the NTPs loaded from the main
channel. A possibility could be that in the fast state, NTPs never have time to bind to the E site, because
translocation could be locked forward and the E site could always be gated: nucleotide is being
incorporated which forbids access to the E site, then translocation brings new NTP in the active center
before access to the E site is clear (e.g., because PPi not yet released or because RNA 3’end gates binding
to the E site), which binds the next nucleotide to the A site and still forbids access. The cycle can resume,
and the E site will always be gated by the successive loading/incorporation of NTPs incrementally
translated from the main channel. If the loaded NTP is incorrect, then its expulsion would forbid access
to the E site, and rapid backtracking motion could prevent access. Finally, it could even be possible that
63
the enzymatic complex would support a few NTPs binding to the E site in normal transcription. The
requirement would then be that activity is not distorted. For example, RNAP could just wait for the NTP
to dissociate from the E site, or alternatively, the incoming NTP channeled from i + 2 to i + 1 position,
could expel the parasitic NTP bound to the E site, by competitive binding.
Other evidences were proposed for loading via the secondary channel. In 2009, Erie and colleagues
found that mutating E. coli residue D675 led to a significant increase of misincorporations [Erie, et al.,
2009]. The authors suggested that the residue played a role in filtering substrate diffusing through the
sec. channel. However, the residue is located directly adjacent to the bridge helix (notably, β’ 772, 775
and 779), and within relative electrostatic interacting distance from the TL tip. So the residue could very
well impede a key function. Studies on TL E1103G and bridge helix mutation have shown that the
domains affected fidelity, probably indirectly by affecting the bridge helix or directly by affecting TL
mobility. Hence, the D675 mutation does not prove anything. Otherwise, the negatively charged residue
might promote the electrostatic expulsion process: the secondary channel, and in particular the corridor,
can be seen as an electrostatic gun expelling negatively charged PPi molecule and negatively charged
misloaded NTP, as exposed in chapter 5. It follows that removing the electrostatic amino acid might
hinder the expulsion process of misloaded NTPs, hence indirectly promote transcription errors.
Concerning the microcin J25 evidence, let us show that it is very weak. First, concerning the residues
that bind the toxin molecule, the authors claim that “The side chains of the majority of implicated
residues are solvent accessible—directed into the lumen of the RNAP secondary channel or toward the
exterior of RNAP—and make no obvious interactions important for RNAP structure or function”
[Mukhopadhyay, et al., 2004]. However, this is an exaggerated statement. E. coli binding residues β’
775-777, 779, 780, 782-786, 789, 790 belong to BH, β’ 922, 926, 927, 930-933, 1136, 1137 belong to
TL, β’ 744,748 belong to Floop and β 543-545 belong to FL2. Hence, insertion of microcin J25 would
notably directly interfere with two of the most important domains involved in the NAC (TL and BH).
In addition, the molecule could inhibit transcription by preventing the release of PPi. So not only
microcin would inhibit transcription activity because of the trapping of the PPi molecule very near from
the A site, which would completely disturb the active site geometry and electrostatics, but also it appears
to obviously impede the conformational degrees of freedom of key domains for transcription such as
BH, TL and FL2. Furthermore, the fact that inhibition is partially overcome at high NTP concentration
does not seem very consistent with the assumption that it blocks substrate loading. If the molecule stays
in place, it is clear from immediate investigation that no substrate should bypass the molecule at all to
access the corridor (microcin almost perfectly seals off the secondary channel, leaving no room for the
passage of a molecule the size of an NTP).
Before concluding this discussion, the puzzling studies about the Brownian ratchet mechanism are to be
argued. Substrate diffusion/loading and translocation are concepts that walk hand in hand, because NTP
binding belongs to the more general translocation/transcription cycle. It is therefore not surprising that
64
these processes were almost always thought about in correlation to each other. To study translocation:
the key process of transcription, it makes intuitive sense to pull on the nucleic frame and/or the enzyme
in a controlled manner. The single-molecule optical tweezers experiments serve that purpose by
attaching the extremities of DNA with an optical trap, and by exerting assisting or opposing force. The
basic concept underlining these studies is to try and fit a kinetic equation describing translocation
(including stepping distance, force, temperature, etc.) to experimental measures, under different
conditions such as varying force, NTP concentration or nucleic translocation track, and validate in return
the axioms of the model. Although seemingly impressive and very accurate, this methodology can suffer
the following limitations. The greatest loophole with the concept of fitting experimental measure to a
kinetic model is that it is not because a model describes the reality, that the model is the reality. In other
words, it is not because a kinetic fit is with good agreement with a model, that all the starting assumptions
of the model are correct. For example, some researchers suggested that results supporting the main
channel theory were invalid because a secondary NTP binding site was not a necessary assumption to
their kinetic model: “we were able to obtain reproducible global fits with the two pawl model without
the need to introduce additional NTP binding sites.” [Bar-Nahum, et al., 2005], “this model does not
invoke additional NTP binding sites at different translocation states, allosteric NTP binding sites,
active/inactive conformational states” [Bai, et al., 2007], “the quality of this fit to our conceptually
simpler model indicates that a more complex model with two NTP binding sites is not necessary to
explain this data” [Maoileidigh, et al., 2011]. Not only does their kinetic fit suffer limitations that will
be discussed below, but their data can actually be explained with a NTP binding to i + 2. With the only
distinction that it is not the initial binding of the NTP that rectifies the ratchet but only its loading in the
active site. Hence, the claim of these papers that the fact that their kinetic equation is in concordance
with their initial hypothesis that NTP binds directly to i + 1, suggests that the main channel theory is
incorrect: “It is reassuring that our model not only explains all the biochemical experiments presented
in the present paper but also provides a consistent and natural explanation of published kinetic data”
[Bar-Nahum, et al., 2005] and “our model does not invoke any hypothetical allosteric and/or template-
specific NTP binding sites other than i + 1 to explain the biphasic rate curves. Simply, under substrate-
limiting conditions, the F bridge has a higher probability to melt the 3’ end of the hybrid, thus facilitating
backtracking.” [Bar-Nahum, et al., 2005], is very fast reasoning. The authors offer no real explanation
as to why the existence of a secondary binding site is to be discarded, and no explanation at all on the
main channel theory kinetic data. There is no link between the BH (also referred to as the F bridge)
facilitating backtracking in particular occasions and pre-binding of NTPs in the downstream bubble
facilitating forward translocation. Furthermore, recent single molecule studies [Larson, et al., 2012;
Dangkulwanich, et al., 2013], offering much more balanced views, contradict quite directly the latter
views about the non-existence of a secondary binding site. Some experiments deriving kinetic
parameters from force-velocity relations should be regarded with caution, because they might involve
wrong starting assumptions such as rapid translocation equilibrium, which is very contested
65
[Dangkulwanich, et al., 2013]. Second, single-molecule studies do not always monitor translocation as
precisely as they seem. At non-subsaturating NTP concentrations, i.e. in normal processive elongation,
the precision of the single-molecule experiments is only of a three-base pair interval resolution
[Maoileidigh, et al., 2011]. Also, kinetic fits can involve the averaging of normalized data, or multiply
independent fit parameters, hence erasing details and artificially improving the verification of the model
used. The study from [Dangkulwanich, et al., 2013] seems more general than previous attempts to
characterize the kinetics of translocation because their model does not assume translocation equilibrium,
ignores NTP binding rates in their initial equation assumptions, and treats forward and reverse
translocation with a separate parameter. Their findings are in full concordance with the CH1 model,
namely post-translocation locked forward at non-subsaturating substrate concentrations and existence
of a secondary binding site independent of the translocation state.
The study from [Bar-Nahum, et al., 2005] mentioned in the previous paragraph poses another issue. The
authors find that when i + 2 NTP is supplemented (in EC34 therein), forward translocation is reduced,
and hence that the allosteric results supporting the CH1 model seem invalid. Let us try and explain their
experimental data with the following reasoning. If the presence of i + 2 NTP reduces EC fractions
belonging to the forward state, it means that somehow, there was a deleterious binding competition
effect between i + 1 and i + 2 NTPs. In the CH2 model, this competition only concerns successful
diffusion to i + 1. If their result is valid, namely reduced forward translocation in the presence of 0.5
mM GTP (matched to i + 1) and 0.5 mM ATP (matched with i + 2) than with 1 mM GTP alone, then
one just needs to replace one postulate: deleterious binding competition happened at i + 2 and not at i +
1, where any NTP that must load to the active site, first need to bind at i + 2 position. Their experiment
seems far from invalidating CH1 kinetic results, where NTP chases done in a very controlled manner
and precise substrate-saturation kinetic curves production is a superior characterization method than
measuring EC fractions that are pre- or post- translocated.
All the arguments in favor of the secondary channel do not seem solid and could be discarded, but the
main channel theory is supported by virtually undeniable proofs: the fact that NTPs can pre-bind in the
main channel and that the latter constitutes the default state is supported by many strong evidences
described in main channel theory section. In particular, it appears impossible to explain the allosteric
effect of several downstream templated NTPs without accepting the fact that they must pre-bind to the
DNA template strand in the downstream bubble.
66
9. Concluding remarks
Further elements can be raised to shed some light on the substrate mode of loading controversy. A
possible explanation for an alternative function for the E site arose in 2007, when Toulokhonov et al.,
proposed that frayed RNA 3’end could bind to the E site during the nonbacktracked pause state
[Toulokhonov, et al., 2007]. Otherwise the E site could be rationalized by the fact that it simply
represents the MgB binding site (where the inverted NTP binds according to [Westover, et al., 2004A]).
In 2008, Weinzierl and colleagues [Tan, et al., 2008] conducted mutagenesis on bridge helix residues
and observed that some mutations led to increased transcription rates. Because the bridge helix is linked
to translocation and not to substrate loading, it seems to indicate that substrate loading is not rate limiting
and therefore this result seems inconsistent with the secondary channel theory. As a conclusion, the
secondary channel theory is inconsistent with the results and observations presented in this review and
appears impossible.
67
Chapter 2
MD Methods
68
1. Introduction
The main channel theory seems to be the default mode of substrate loading during processive elongation.
However still little is known about the loading details of RNAP substrates: “currently, no electrostatic
or diffusion modelling is available to indicate how NTPs might load through the main channel” [Kireeva,
et al., 2010]. Also, although it is scientifically questionable, regarding the solidity of the kinetic
evidences, the common consensus is that the CH1 theory still requires “direct” evidence. It appears
therefore necessary to not only shed some light on how the diffusion process might occur, but also to
offer some additional evidences. MD is an ideal candidate for carrying out such a work. Indeed, MD is
a revolutionary computational simulation method allowing unprecedented levels of inspection at the Å
level and from the femtosecond timescale onwards. It is possibly the best method to characterize the
dynamics of a biomolecular system [Meller, 2001; Frenkel, et al., 2002]. Because diffusion is an ultrafast
process, it makes sense to inspect it using a very precise method. For analogy, it might not be
coincidental if the most compelling evidences for diffusion so far have been given by the kinetic assays,
which allow to catch ultrafast processes. Crystal structures render an atomic precision image of
biomolecular systems, yet no dynamic time evolution is displayed. MD, using as starting input an x-ray
crystallography or NMR set of coordinates, can be seen as a tool making the static image live. In this
section, we will be interested in MD philosophy and methodology, from the preparation of a static model
to advanced MD procedures, allowing to extract mechanisms of the diffusion/loading process, which is
currently not well understood, and to perhaps further prove the main channel theory. The procedures
have been fully automated and scripted, and are given in the appendices, to facilitate reproducibility of
the simulations. In order to achieve optimal computational power, simulations were run with NVIDIA
CUDA Graphic Processing Unit (GPU) based workstations, which have been assembled. The simulated
system is S. cerevisiae RNAP II.
69
2. Metabolite pool
Choosing good metabolite concentrations is important to mimic physiological conditions in MD
simulations. For instance, they can play a crucial role in Electrostatic mechanisms (e.g., shielding,
screening), affect the characteristics of the diffusional routes, and can also impact the overall stability
of the enzyme. In this sub-section, focus will be aimed on the metabolites that are charged, and
particularly on those present in non-negligible proportion. All the concentrations are intracellular (whole
cell or cytoplasmic) and discussed for yeast S. cerevisiae. In general, measures derived from aerobic
glucose-limited chemostat experiments or in reasonable fit with in vivo-like conditions have been
selected over batch cultivation experiments. Concentrations expressed in μmol/gDW or mg/gDW are
converted to mM using the factor of 2.38 mL/gDW ([Theobald, et al., 1997; Hans, et al., 2001]), except
for values from [van Eunen, et al., 2010], where a 2.083 mL/gDW conversion factor is used by the
researchers (based on their measured culture dry weight mass of 3.6 g.L-1 and 2.5 * 1011 cells.L-1, and
assumed cell volume of 3 * 10-14 L). Charged amino acid metabolites seemingly present in the
intracellular environment at a non-negligible amount have been measured as, Glu: 71-82 mM, Asp: 8-
9 mM, His: 2-2.5 mM, Lys: 1.7-1.9 mM ([Hans, et al., 2003; Canelas, et al., 2008A]), Arg: 6mM ([Hans,
et al., 2003]). Realistic intracellular NTP substrate concentrations of ATP: 3 mM, CTP: 0.5 mM, GTP:
0.7 mM, and UTP: 1.7 mM have been measured [Nick McElhinny, et al., 2010]. The latter ≈ 6 mM
rNTP content is in reasonable agreement with Traut’s average concentrations in mammalian cells [Traut,
1994]. The ATP level is rather close to measurements giving intracellular ATP levels around 2.6-3.5
mM [Gonzales, et al., 2000; Canelas, et al., 2008B; Boer, 2009; Volkov, 2015; Magdenoska, et al.,
2015]. Intracellular phosphorus content is 304-320 mM [Graschop, et al., 2001; van Eunen, et al., 2010].
According to [van Eunen, et al., 2010], most of these atoms are bound and form phosphate groups, which
is consistent with literature data fixing phosphate values around 7-43 mM [Lagunas, et al., 1983;
Theobald, et al., 1996; Gonzales, et al., 2000; Auesukaree, et al., 2004; Canelas, et al., 2008B; Zhang,
et al., 2015B]. Sulfur atoms amount to 44-45 mM intracellular concentration [Graschopf, et al., 2001;
van Eunen, et al., 2010]. Most are bound to glutathione, thus rendering a free sulfate content of about 5
mM [van Eunen, et al., 2010]. Ca2+ ion intracellular concentrations (1.9-2.2 mM, [Graschop, et al., 2001;
van Eunen, et al., 2010]) result in an estimated 0.5 mM of free cations, as most of them are
compartmentalized in the vacuole [van Eunen, et al., 2010]. Total intracellular Mg2+ content is 51-55
mM [Graschopf, et al., 2001; van Eunen, et al., 2010]. However, only about 1-2 mM of free magnesium
is estimated [van Eunen, et al., 2010]. Indeed, most of the cations bind to anionic compounds such as
nucleic acids, NTPs, NDPs, polyphosphates, etc., or are stocked in compartments, e.g. undergo
mitochondria and endoplasmic reticulum sequestration [Romani, et al., 1992; Swaminathan, et al., 2003;
van Eunen, et al., 2010]. K+ can display intracellular variations of 50 to 300 mM depending on growth
phase, K+/Na+ extracellular ratio [Volkov, 2015]. However, studies suggest that K+ can reach 5 mM with
dramatically disadvantageous K+/Na+ extracellular proportions, while others report a lower threshold
70
not much below 100mM even at seriously scarce external potassium content (reviewed in [Volkov,
2015]). Nevertheless, potassium concentrations appear to be pretty well balanced and much more
resilient to environment condition changes than Na+, which depends more on initial extracellular
concentration conditions [Volkov, 2015]. A study published in [Kahm, et al., 2012], suggests that when
the external medium contains more than 1 mM of potassium, the latter cation reaches an internal cell
content plateau of 300mM. K+ intracellular concentration is optimal around 200 to 300 mM.
[Rodriguez-Navarro, 2000; Arino, et al., 2010], consistent with 208-340 mM concentrations from
literature data, [Olz, et al., 1993; Sunder, et al., 1996; van Eunen, et al., 2010], and consistent with the
cation being the most abundant metabolite in yeast [Kahm, et al., 2012]. Although intracellular Na+
concentration can vary significantly depending on the growth conditions [Herrera, et al., 2013; Volkov,
2015], consistent with an important amount of researches (e.g., [Sychrova, et al., 2004; Arino, et al.,
2010; Ramos, et al., 2016]), it has been stressed that in order to avoid sodium cation intoxication,
intracellular proportion of K+ must be significantly higher than that of Na+. In order to avoid such a
detrimental effect, several mechanisms appear to greatly favor K+ influx over Na+ (e.g., K+/Na+
transporters extreme selection ratio of 1000:1 [Matthius, et al., 1999], important Na+ efflux mechanisms,
and vacuolar organelle compartmentalization [Montiel, et al., 2007]). Na+ is not even absolutely
necessary for S. cerevisiae growing in plenty of potassium [Camacho, et al., 1981]. Low levels of sodium
relative to potassium seem to be well in line with published data measuring 5-28mM intracellular
concentrations [Olz, et al., 1993; Sunder, et al., 1996; Graschopf, et al., 2001; Kolacna, et al., 2005; van
Eunen, et al., 2010], An optimal 25 mM Na+ concentration for optimal phosphate uptake activation has
also been proposed [Martinez, et al., 1998]. Although yeast S. cerevisiae belongs to the fungus family,
an information that could serve as an indication for its K+/Na+ ratio is the value of 20:1 found in animal
cells [Matthius, et al., 1999]. Concerning Cl- anions, S. cerevisiae requirements are very low [Rodriguez-
Navarro, 2000; Jennings, et al., 2008]. Consequently, the anion could be used solely to ensure charge
neutrality in our MD system, rather than for significant intrinsic physiological contribution. To
summarize the investigation, let us consider the following overall metabolite study. A team of nineteen
co-researchers attempted to facilitate the transfer of experimental enzyme kinetic data to systems
biology field such as metabolic mathematical modelling, computational simulation, etc. [van Eunen, et
al., 2010]. In order to do so, the authors investigate the design of a cell-free in-vivo like enzyme kinetic
assay defined medium which composition mimics as close as possible in-vivo physiological intracellular
cytosolic concentrations (and pH), while trying to reach simplicity (i.e., minimizing the diversity of
metabolites). In other words, they aimed to define a standard assay medium for molecular biology
experiments, which composition resembles the S. cerevisiae in-vivo cytosolic metabolite pool. An
application for instance is to allow accurate kinetic mathematical modelling of metabolic pathway in-
vivo dynamics with the most physiologically relevant intracellular conditions [van Eunen, et al., 2014].
The philosophy of their research superposes well with our MD metabolite investigation: setting up a
realistic S. cerevisiae intracellular metabolite pool. They propose a: K+: 300 mM, Na+: 20 mM,
71
phosphates: 50 mM, sulfates: 5 mM, free Mg2+: 2 mM, Ca+: 0.5 mM composition. The verification of
the latter medium (supplemented with NTP substrates) against cytosolic enzyme activity by kinetic
assay, returns good Km values, and confirms its credibility as a good physiological fit. In addition, the
values are physiologically credible according to literature, and agree well with the elements discussed
previously. Their values, by default of available extensive metabolite studies focused on the nucleus
itself, seem to be a good standard and initial guess for setting up realistic solvation box components in
our system to be simulated. There is however the following distinction to be made. Phosphates and
glutamates are the elements whose concentrations mainly differ from physiological measurements. We
shall propose that MD simulations should require a lower phosphate concentration, which appears more
in line with literature, and which should not impact the system behavior because varying the
concentration (between 10-75 mM) does not seem to have an impact (cytosolic enzymatic activity
unchanged, [van Eunen, et al., 2010]). At pH = 7.0, phosphate concentrations represent approximately
62% dihydrogen phosphates (H2PO4-) and 38% hydrogen phosphates (HPO4
2-). As far as the glutamate
molecules are concerned, van Eunen et al. used an un-physiologically high amount of them, above all
as an experimental convenience: they are naturally abundant in cells and using higher concentrations to
balance the overall charge, instead of injecting another type of counter-ion molecule, appears handy.
Adding Cl- counter-ions is trivial in a MD simulation, therefore a glutamate concentration value that fits
better literature data is chosen.
In summary, taking an updated version of van Eunen et al.’ standard intracellular concentrations and
Nick McElhinny et al.’ NTP content, yields the following proposed MD solvent box metabolite content:
• NTPs: 5.9 mM
• K+: 300 mM
• Na+: 20 mM
• Glu, Arg, Lys, His, Asp: 80, 6, 2, 2.5, 8.5 mM
• Phosphates: 25 mM, i.e. H2PO4- = 15.5 mM, HPO4
2- = 9.5 mM
• Sulfates: 5 mM
• Mg2+: 2 mM
• Ca2+: 0.5 mM
• Cl-: concentration required to ensure charge neutrality
72
The proposed metabolite content is not perfect. Further investigations are required to refine it. Current
knowledge about free versus compartmentalized and/or bound metabolites, especially ions, is still
approximate. More importantly the values discussed in this section correspond to cytosolic or whole cell
total intracellular concentrations, which not only can differ between them, but can also differ from
nucleus content. The study used as a standard for our proposed MD metabolite content [van Eunen, et
al., 2010], but see also [van Eunen, et al., 2014], is for physiologically close intracellular concentrations
optimized for cytosolic enzymatic activity, not for nucleic enzymes. Therefore, the values could be
refined, with further intracellular measurements by using procedures such as [Herrera, et al., 2013] to
isolate nucleus concentrations (a relevant information for our investigation is for example that the
nucleus content of potassium is 29% of total concentration), and with enzyme kinetic assay proof testing
using protocols such as [van Eunen, et al., 2010]. This would be ideal to mimic the in-vivo like nucleus
metabolite environment of RNAP II.
73
3. Forcefields
Molecular Dynamics simulations rely on set of parameters describing the physical interaction between
the atomic components. As such, it enables to model the time-evolution of a set of atomic coordinates,
by calculating the resulting force applied to each atom. The parameters can be pictured as controlling
the protein degrees of freedom such as stretching (between 2 atoms), bending (3 atoms), torsion (4
atoms). Non-bonded interactions include coulombic and vdw interaction (between 1 atom and all the
other atoms, although methods such as PME allows to reduce the number of calculations required by
using Fourier reciprocal space). Additional parameters to be defined include the nature of a bond (single
or double), particle mass and partial charge.
MD parameters are of the highest importance to correctly model a system, and notably so for diffusion
simulation, which involves a substrate bound to a highly charged metallic ion. Studies comparing the
most popular forcefields (CHARMM, Amber, Gromacs, OPLS, etc.) have shown that simulated systems
could differ substantially in time, which underlines how strongly subtle forcefield parameter differences
impact MD performance [Best, et al., 2008; Beauchamp, et al., 2008; Lange, et al., 2010; Cino, et al.,
2012; Lindorff-Larsen, et al., 2012; Piana, et al., 2014].
Finding adequate parameters has represented a relatively significant challenge in this research project.
Preliminary simulations lead to NTPs forming unphysiological clusters due to suboptimal magnesium
parameters. Other issues involved substrates not diffusing sufficiently and almost directly binding and
being trapped on the surface of the protein. Furthermore, it appeared important to add many charged
metabolites (which represents physiological conditions, see previous section) in order to deal with the
previous matter, by notably allowing electrostatic shielding of exposed charged residues. Therefore,
adequate metabolite parameters needed to be set up. In this section, we will develop what
parameterization choices have been made.
Earlier simulations put into contribution Amber12, Amber14 and CHARMM27/36 parameters, that will
not be further discussed. For latest results, presented in chapter 5, the Amber16 forcefield was chosen,
because it belongs to the family of the most popular and tested forcefields, because it is compatible with
a very wide range of molecular types, moreover because it is the only one allowing the use of the 12-6-
4 vdw potential.
More precisely for the parameters of DNA, RNA and protein amino acids, the latest recommended
choices from Amber developers were taken: DNA.OL15, RNA.OL3 and ff14SB [Wang, et al., 2000;
Perez, et al., 2007; Zgarbova, et al., 2011; Krepl, et al., 2012; Zgarbova, et al., 2013; Zgarbova, et al.,
2015; Maier, et al., 2015].
As far as the rNTP substrates are concerned, most advised parameters for Amber were taken and are
those of [Meagher, et al., 2003]. The set is relatively old, yet it was put to contribution in recent published
74
research ([Duan, et al., 2014; Jiang, et al., 2015; Perez-Villa, et al., 2015]). A more recent set of
parameters exists for NTPs, perhaps improving the flexibility of the triphosphate moiety, yet only
concerns utilization with the CHARMM forcefield [Komuro, et al., 2014].
MG ion physical modelling for MD has represented a severe challenge for experts, because it possesses
a high charge and singular vdw properties, and optimal parameters are very difficult to derive for fix-
charged models. Several attempts were made to correctly parameterize the cation. Initially, the set from
[Aqvist, 1990] (default Amber 12 parameters) was tested and lead to very unphysiological behaviors
such as heavy clustering. Then, the model from [Allner, et al., 2012] allowed a significant leap forward
in terms of simulation performance. A 2015 study ([Panteva, et al., 2015A]) compared seventeen Mg2+
forcefield models (of which [Allner, et al., 2012]), and proposed that the optimal model was the one
from [Li, et al., 2014], where a third r-4 term is added to the Lenard-Jones potential and allows to
partially take polarizability into account. The latter model was further optimized for use with nucleic
acids [Panteva, et al., 2015B], and gave best results with the TIP4PEW water model. Therefore, the
modified 12-6-4 set for nucleic acids ([Panteva, et al., 2015B]) was used, together with TIP4PEW water
([Horn, et al., 2004; Horn, et al, 2005]).
For the other monovalent and divalent ions (K+, Na+, Cl-, Ca2+), 12-6-4 parameters were also used and
are described in [Li, et al., 2014; Li, et al., 2015].
In order to simulate glutamate, aspartate, lysine, histidine and arginine metabolites, the zwitterion amino
acid set from [Horn, 2014] was inputted.
Mass, bond, angle and non-bonded parameters for sulfate atom types S and O2, were taken from
Amber16 GLYCAM_06.dat parameter library file and partial charges were taken from [Cannon, et al.,
1994] (model “std 1”). Hydrogen and dihydrogen phosphate files were prepared by analogy with
[Homeyer, et al., 2005; Steinbrecher, et al., 2014].
When no existing parameter libraries were at disposal, they were written with the following procedure.
Mass, bond, angle, dihedral and non-bonded parameters were written in an Amber .dat file respecting
the correct format. Then the relevant topology (.lib) file was prepared, by defining connectivity, bond
nature, and partial charge with the LEaP module of Amber16 [Case, et al., 2016].
Lastly, the OpenMM library ([Friedrichs, et al., 2009; Pande, et al., 2010; Eastman, et al., 2010A;
Eastman, et al., 2010B; Eastman, et al., 2013]) was chosen to run the simulations, because it is to the
author’s knowledge the only existing MD tool allowing to run the 12-6-4 potential on GPU.
75
4. Accelerated MD simulations
aMD is a MD sampling technique that allows to greatly accelerate a simulation reaction-coordinate, by
biasing in a clever fashion the potential energy landscape. When the potential energy falls below an
energy threshold, a boost is added, which allows to cross energetic barriers much faster. The key
advantage of aMD is that it allows to partially overpass two main limitations of conventional MD,
namely timescale and stagnation within local potential energy basins.
In its original implementation, the aMD method, [Hammelberg, et al., 2004], has been done via adding
an energetic boost to the dihedral component of the potential energy equation (describing the physical
interaction between the elements composing a MD system). Torsional degrees of freedom are generally
considered as the main components driving conformational changes, and indeed the dihedral boosting
method has shown enhanced sampling of protein computer simulations. The latter method has then been
implemented to the total potential energy (i.e., where a total boost is added to all the components of the
forcefield), mainly to accelerate diffusive motion [de Oliviera, et al., 2006]. Because solvent molecules
are very numerous, the total boost affects mainly the non-bonded component of the solvent atoms and
hence contributes mainly to accelerating diffusion within a system. A dual boost method ([Hammelberg,
et al., 2007]) combines the two precedent techniques, by adding energy to both the dihedral and the total
potential, and is commonly used as the method of choice. The latter method allows to accelerate at the
same time protein polypeptide chains exploration of space (dihedral boost) and solvent diffusion (total
boost).
The accuracy and functionality of the method is extensively validated by a variety of studies ([Grant, et
al., 2009; Bucher, et al., 2011; de Oliveira, et al., 2011; Markwick, et al., 2011; Lindert, et al., 2013;
Kappel, et al., 2015; Song, et al., 2015; Miao, et al., 2016]). The method has allowed to reach very high
timescale up to the millisecond range ([Markwick, et al., 2007; Pierce, et al., 2012]), and to enhance the
modelling of experimental phenomena [Markwick, et al., 2011]. Recent developments of the method,
boosting separately non-bonded terms and dihedrals, show great promise [Doshi, et al., 2014].
The aMD boost method relies on the following theoretical background.
When the potential energy is inferior to a threshold energy parameter 𝐸𝑏, i.e. for 𝑉(𝑟) < 𝐸𝑏, the added
boost potential is defined by:
𝑉∗(𝑟) = 𝑉(𝑟) + ∆ 𝑉(𝑟)
Where,
∆ 𝑉(𝑟) =(𝐸𝑏 − 𝑉(𝑟))2
𝐸𝑏 − 𝑉(𝑟) + 𝛼
And where 𝛼 is an acceleration parameter.
76
When the potential energy 𝑉(𝑟) of the system does not fall under an energy threshold 𝐸𝑏,
i.e. when 𝑉(𝑟) ≥ 𝐸𝑏, potential energy is kept untouched and ∆ 𝑉(𝑟) = 0.
The above modification of the potential energy surface results in a new force experienced by each atom
of the system.
For an atom 𝑖 belonging to the system, the new force will be ([Hammelberg, et al., 2007; Markwick, et
al., 2011]):
𝐹𝑖∗ = −
𝑑
𝑑𝑡 [𝑉(𝒓) + ∆𝑉(𝒓)]
= 𝐹𝑖 ∗ [𝛼2
(𝛼 + 𝐸𝑏 − 𝑉(𝒓))2]
Where 𝐹𝑖 is the original force.
At each step of the simulation, the unbiased potential is calculated, then the modified boost potential is
computed, which is then translated to a boost force assigned to the concerned force components
[Markwick, et al., 2011].
The boost force acting on a component of the forcefield (i.e. dihedral) can be expressed as:
𝐹𝑐𝑜𝑚𝑝∗ = −∇𝑉𝑐𝑜𝑚𝑝(𝒓)
𝛼𝑐𝑜𝑚𝑝2
(𝛼𝑐𝑜𝑚𝑝 + 𝛽𝑐𝑜𝑚𝑝)2
= 𝐹𝑐𝑜𝑚𝑝𝛾𝑐𝑜𝑚𝑝
Where,
𝑉𝑐𝑜𝑚𝑝(𝒓) is the modified component potential.
𝛼𝑐𝑜𝑚𝑝 is an acceleration parameter
𝛽𝑐𝑜𝑚𝑝= 𝐸𝑏 − 𝑉(𝒓)
𝐹𝑐𝑜𝑚𝑝 is the unboosted force for the component.
𝛾𝑐𝑜𝑚𝑝 = 𝛼𝑐𝑜𝑚𝑝
2
(𝛼𝑐𝑜𝑚𝑝 + 𝛽𝑐𝑜𝑚𝑝)2
The overall force in the boosted system is then obtained as:
𝐹∗ = (𝐹 − 𝐹𝑐𝑜𝑚𝑝) + 𝐹𝑐𝑜𝑚𝑝𝛾𝑐𝑜𝑚𝑝
Where 𝐹 is the unboosted system total force [Lindert, et al., 2013].
77
Although aMD does not require any a priori state of a system to be known and to be defined, 𝛼 and 𝐸𝑏
parameters are to be defined and require some fine tuning that can be challenging. The 𝐸𝑏 parameter
controls the portion of the energy landscape that will be affected by the boost. 𝛼 modifies the shape of
the energy surface [Wang, et al., 2011A; Wang, et al., 2011B]. Both parameters impact the strength of
the acceleration. A higher acceleration can be performed by increasing 𝐸𝑏 or by decreasing 𝛼.
Each system will have in practice 𝐸𝑏 and 𝛼 parameters that will be optimal, and finding them usually
require some testing, for example keeping one parameter constant while varying the other one
[Markwick, et al., 2011; Bucher, et al., 2011B].
Equations used to calculate the parameters are the following.
𝐸_𝑑𝑖ℎ𝑒𝑑 = 𝑉_𝑑𝑖ℎ𝑒𝑑 𝑘𝑐𝑎𝑙.𝑚𝑜𝑙−1 + 𝑐𝑡𝑟,
𝛼 = 0.20 ∗ 𝑐𝑡𝑟
Where,
𝑐𝑡𝑟 = 3 𝑡𝑜 5 𝑘𝑐𝑎𝑙.𝑚𝑜𝑙−1 ∗ 𝑛𝑏_𝑝𝑟𝑜𝑡_𝑟𝑒𝑠, [Markwick, et al., 2011; Miao, et al., 2016]
𝑐𝑡𝑟 = 0.20 𝑘𝑐𝑎𝑙.𝑚𝑜𝑙−1 ∗ 𝑛𝑏_𝑝𝑟𝑜𝑡_𝑎𝑡𝑚𝑠, [Markwick, et al., 2009; Wang, et al., 2011B]
𝑐𝑡𝑟 = 0.3, 0.4, 0.5 ∗ 𝑉_𝑑𝑖ℎ𝑒𝑑 𝑘𝑐𝑎𝑙.𝑚𝑜𝑙−1, [Tikhonova, et al., 2013; Kappel, et al., 2015; Song, et al.,
2015]
The most consensual energetic relations for the dihedral acceleration parameters, based on comparative
analysis from several studies, are 𝐸_𝑑𝑖ℎ𝑒𝑑 and 𝛼 formulas from above, with:
𝑐𝑡𝑟 = 3.5 𝑘𝑐𝑎𝑙.𝑚𝑜𝑙−1 ∗ 𝑛𝑏_𝑝𝑟𝑜𝑡_𝑟𝑒𝑠, [Lindert, et al., 2012]
As far as the total acceleration parameters are concerned, the values that are advised to this time and for
most systems (based on comparative studies), are defined as:
𝐸_𝑡𝑜𝑡𝑎𝑙 = 𝑉_𝑡𝑜𝑡𝑎𝑙 𝑘𝑐𝑎𝑙.𝑚𝑜𝑙−1 + (𝑐𝑡𝑟 = 0.16, 0.20 𝑘𝑐𝑎𝑙.𝑚𝑜𝑙−1 ∗ 𝑛𝑏_𝑎𝑡𝑚𝑠),
𝛼 = 𝑐𝑡𝑟, [Hammelberg, et al., 2007; Markwick, et al., 2011; Kappel, et al., 2015; Miao, et al., 2016]
It is worth considering that the total acceleration 𝛼 parameter is higher than that for the dihedral boost,
in order notably to not distort too heavily the solvent.
78
The aMD simulation procedure is the following.
First, the static model of the protein is prepared. The original RNAP II atomic coordinates used in this
work is PDB#2E2H and its structure resolution is described in [Wang, et al., 2006]. Missing loops are
added with the Yasara-Structure software [Krieger, et al., 2002]. The initial model also has missing
nucleic acid bases, which are added following the complex procedure outlined in chapter 3. For all the
subsequent steps, a script has been written and automates everything, that is to say that running the script
in appendix 1 should automatically (by taking care to make a few adjustments to match the PDB file
sequence used for example) perform all the tasks detailed below.
Second, the static model is “pre-minimized”. That is to say that minimization is first done on an
expurgated system, in order to optimize the static model, by optimizing the minimization algorithms
computation, notably for the inserted new nucleic templates. The static system of step 1 is further
prepared by specifying N- and C- termini at the extremities of the subunits. The system is completed
with missing heavy atoms, hydrogenated, neutralized with K+ ions, and solvated with a TIP4PEW water
box ensuring a minimum solute to edge distance of 15 Å, with the LEaP module of Amber16 [Case, et
al., 2016]. Then ten rounds of, minimization 1 (min 1) straightly followed by minimization 2 (min 2),
are computed with the Amber16 Sander module ([Case, et al., 2016]) on Computer Processing Unit
(CPU). Min 1 consists of 1000 steps of steepest descent and 4000 steps of conjugate gradient algorithms,
with 500 kcal.mol-1 harmonic restraint on protein and nucleic residues, and an electrostatic cutoff of 10
A). Min 2 consists of running 2500 steps of each algorithms without restraints. Amber16 is chosen over
OpenMM for minimization due to the superiority of its algorithms for this matter (notably reduces more
the potential energy).
Third, the refined static model of step 2 is prepared for simulation. The number of water molecules
required to ensure a TIP4PEW water box with a buffer of 15 Å is calculated with LEaP. Then the number
of metabolite molecules to be inserted is calculated, according to the latter number of water molecules.
Cl- amount is adjusted to ensure an overall charge neutrality. Using the AddToBox module of Amber16,
the metabolite molecules were inserted in the refined static model of step 2. Phosphate molecules were
not added, because of simulation instabilities with the 12-6-4 potential (their parameter set worked fine
otherwise, i.e. without using the 12-6-4 potential). The system was then hydrogenated, and simulation
coordinate and parameter files configured, with LEaP. The simulation files were further processed by
Amber16 Parmed module, in order to add the 12-6-4 potential Lenard-Jones matrix to the relevant
molecules and to apply [Panteva, et al., 2015B] nucleic acid modifications by changing the polarization
atom type of some nucleic atoms. Please refer to appendix 1 for more detailed procedure.
79
Fourth, a first round of simulation is done, without the substrates, in order to let the metabolites enough
time to relax and improve the electrostatic configuration. The final system (as compared to the
expurgated system) is minimized using the same procedure as above. Then heating, velocity
equilibration, box equilibration, and final equilibration are executed with OpenMM ([Friedrichs, et al.,
2009; Pande, et al., 2010; Eastman, et al., 2010A; Eastman, et al., 2010B; Eastman, et al., 2013]) on
GPU using the mixed CUDA precision model [Le Grand, et al., 2013], a Langevin integrator using a
time step of 2 fs, a temperature of 300K and a thermal coupling collision frequency of 1.0 ps-1, Hydrogen
bond maintained constrained and water molecules set to rigid. A PME non-bonded method with a cutoff
distance of 8 Å, and 10 kcal.mol-1 harmonic restraint on protein and nucleic atoms, are used for heating.
A PME cutoff distance of 10 Å and 50 kcal.mol-1 harmonic restraint on DNA anchoring residues
(extremities), are used otherwise. The system is heated for 20 ps. Velocity equilibration is run for 100
ps, as NVT (constant moles, volume and temperature), Box equilibration is done for 20ns, as NPT
(constant moles, pressure and temperature), by setting up a MonteCarlo Barostat with a 1 bar pressure.
The system is then relaxed for 20 ns as NVT.
Fifth, the substrates are to be added. In order to account to a NTP influx corresponding to 5.9 total mM
concentration, regardless of the rNTP type, 5.9 mM of GTPs is chosen. It is to be noted that as outlined
in chapter 3, i + 2 and i + 4 are strategically mutated to cytosine, and consequently i + 2 to i + 4 (i + 3
is already cytosine in the original PDB structure) registers of tDNA are available for pairing an incoming
GTP substrate. Water molecules are stripped from the last trajectory frame of round 1 final 20 ns
relaxation. A calculated amount of Cl- ions is also removed in order to ensure a neutral charge when 5.9
mM NTPs of charge -2 will be added. Then GTP molecules are inserted using the AddToBox module.
The next steps are identical to round 1.
Finally, the actual aMD simulation is executed. Acceleration parameters are calculated using similar
equations as outlined in the introduction of this subsection, and are listed in chapter. Several run
durations have been performed. The simulation is configured with a DualBoost integrator using a time
step of 2 fs and the four acceleration parameters, an Andersen Thermostat using a 300 K temperature
bath and a collision frequency of 1.0 ps-1, PME non-bonded method with a cutoff of 8 Å, constrained
Hydrogen bonds, rigid water molecules, 50 kcal.mol-1 harmonic restraint on DNA anchoring residues,
and mixed CUDA GPU precision.
80
5. Steered MD simulations
Steered MD is a simulation technique allowing to bias a reaction-pathway coordinate, by setting a
“pulling” force to one or several atoms. It was invented by applying the concept of Atomic Force
Microscopy (where a cantilever exerts a force on a biomolecule) to MD.
For the simple pulling of an atom along a direction, the force can be defined as:
𝐹𝑜𝑟𝑐𝑒_𝑠𝑀𝐷 = 𝑘 ∗ ((𝑥 − 𝑥0)2 + (𝑦 − 𝑦0)
2 + (𝑧 − 𝑧0)2)
Where,
𝑘 is the force magnitude
𝑥, 𝑦, 𝑧 are the coordinates of the pulled atom
𝑥0, 𝑦0, 𝑧0 are the coordinates towards which the force is exerted
While aMD is rarely used to model diffusion, sMD is the most common method, due to the ease with
which one can force a system to go through the desired pathway. The method requires however to define
a priori information about what is going to happen (the direction of the pulling force), when this is not
required for aMD. Several flavors of sMD (e.g., velocity sMD, adaptive bias sMD) that can be seen as
umbrella sampling techniques allow to extract information such as work or free-energy differences,
which were judged of priority importance for this research project. Hence, classical force sMD has been
performed.
Let us consider the sMD computer routines. The basic simulation trick that has been employed is that
the sMD trajectory is divided into several checkpoints. The latter checkpoints are defined by residue
index. The advantage of this method is that it allows to maximize the portability of the results, with
minimal user input. In other words, no direction has been defined by abstract coordinates, but by precise
landmarks within the structure itself, thus greatly facilitating the reproducibility of the results. In
addition, this strategy has allowed to fully script and automate the procedure. For researchers wishing
to reproduce the simulations, an example sMD trajectory script is provided in appendix 2.
The starting structure is the last frame of simulation round 1 presented above. It consists of an
equilibrated metabolite and water box containing RNAP, where the system has been minimized, heated,
velocity equilibrated, box equilibrated, further relaxed, without NTPs. Two Cl- ions are stripped from
the PDB file to ensure that a neutral overall charge is respected when the sMD GTP substrate will be
added. Water is also removed. Then a GTP molecule is inserted strategically within an inner box
surrounding checkpoint 0: solvent accessible area lying in front of checkpoint 1. It is not placed directly
at the checkpoint coordinates, but within a certain x, y, z threshold (hence the inner box) in order to not
81
overlap with existing metabolites. This is done by extracting an inner box surrounding the checkpoint
from the global PDB file, then by adding a GTP molecule to the inner box with the AddToBox Amber16
module by adjusting the x, y and z range correspondingly, and finally by copying the GTP back to the
global PDB.
The system is then completed with missing heavy atoms, hydrogenated, solvated with a TIP4PEW water
box respecting a 15 Å minimal distance to the solute, with LEaP. The 12-6-4 potential including nucleic
atom modifications is applied in the same fashion as outlined in section 3. Then, minimization, heating
and velocity equilibration are also performed as mentioned in previous section. With the distinction that
velocity equilibration is run for 20 ps, and that instead of using harmonic restraints on the DNA
anchoring residues, mass constraints are used (minimizes the computation complexity of the forces at
play for sMD). These steps are required although the initial system consisted of an already relaxed
system, because when starting from any static model, without the velocity information, it is necessary
to bring the system back to target temperature.
Next, the checkpoint loops are executed. For each checkpoint along the sMD trajectory, the execution
of the ith checkpoint loop is repeated, until a certain threshold distance has been reached, before
switching to the next checkpoint. The threshold distances and precision about the checkpoints used are
listed in chapter 5.
In addition, an iteration check is computed within each checkpoint loop to kill the executing of the loop
after 2 ns, if the trajectory has not converged, in order to avoid memory crash.
Each checkpoint loop is run with a Langevin integrator, using a 300K temperature, 1.0 ps-1 thermal
coupling and a 2 fs time step, a PME non-bonded method with a 8 Å electrostatic cutoff, constrained
Hydrogen bonds, rigid water, and mass constraints applied to the DNA extremities.
For sMD simulations through CH2, preliminary sMD pulls were applied on TL CA atoms of scRPB1
1082, 1087, 1088 and 1092 residues, and the latter residues were kept fixed during simulation, in order
to maintain the TL open.
Finally, sMD in combination with aMD has been tested, where the procedure is the same as for sMD,
except that the checkpoint loop runs with a DualBoost integrator and an Andersen thermostat, instead
of a Langevin integrator.
Preliminary work has been performed on PDB#5C4J (see chapter 5). The same procedures as listed in
this section were employed, with the distinction that CTP molecules were added in the system instead
of GTPs, and that CTP parameters provided by Prof. R. Amaro from UCSD were used.
82
Chapter 3
Elongation Complex Reconstruction
83
1. Introduction
As discussed in chapter 1, most of the crystal structures available for RNAP II do not contain a full
nucleic Elongation Complex. In our starting model: PDB#2E2H, ntDNA is not resolved after i + 5.
Several conventional and aMD simulations have been run (data not shown) on the incomplete structure
and the following observations have been made. The incomplete presence of ntDNA bases inside the
protein is problematic as the conformation of the nucleic Elongation Complex plays a critical role for
the diffusion of rNTPs. It significantly modifies the conformation and electrostatics of the CH1/CH3
channels, hence directly affecting the diffusion of substrates, it does not prevent DS register slippage,
and does not allow pre-binding at the right registers immediately downstream from loading position. In
addition to the factors pre-mentioned, the upstream portion of the ntDNA (after i + 5) seems important
to stabilize tDNA registers in pre-binding substrate welcoming configuration, notably by lowering
tDNA, by minimizing parasitic backbone electrostatic repulsion with the substrates, and possibly
improving diffusion by stabilizing the nucleic acids. Furthermore, RNA bases are not present after i +
10, however a complex is considered elongation ready when the RNA strand consists at least of 13
bases. In simulations with the incomplete RNA chain, the strand took distorted conformations, bending
towards the inside of the protein close to inner DNA, instead of directing towards the RNA exit channel.
Reconstructing a complete EC is also of high relevance to experimentally simulate translocation events,
which as presented in chapter 1 is linked to the loading of substrates, and consequently can shed some
light on the full diffusion/loading process. It is therefore of significant importance to reconstruct a
complete and physiologically adequate EC for RNAP, in order to carry out the characterization of
nucleotide diffusion/loading to a higher degree of precision and to optimize the scientific plausibility of
the experiments. The DNA extremities are to be maintained fixed during simulation, consequently the
starting structure must be as good as possible as restraints can prevent DNA of naturally relaxing into a
more native state conformation during simulation. In this chapter, we will investigate mathematical
tools, the development of algorithms and their application, in order to recreate a complete EC.
84
2. 3D Rotation
Before proceeding to the investigation of the mathematical tools and the algorithms, let us first define
what strategy is to be employed. The goal is to add missing RNA and DNA bases in the RNAP initial
atomic coordinates. To do so, geometric information that is already present in the structure is to be used
to guess the shape of the overall DNA frame, and to add the missing bases incrementally. The guess
need not to be perfect, as minimizing the potential energy of the structure will optimize the geometry of
the nucleic strands. However, the guess must be close enough for the minimizations algorithms to go
through, and in order to converge to a local minimum that is not of an irrelevant high order. Once we
know where to add missing bases, the next step is to insert them incrementally with the right atomic
coordinates. In order, to position an object in 3D space, two rotations are needed for the object to adopt
the right orientation, and an additional translation operation is to be computed to complete the
positioning.
Given an object in space to be aligned in a specific manner with a reference object. Two consecutive
rotation alignments are to be done. A rotation alignment between a vector of the reference object and a
vector of the object to be aligned is defined by an axis that is normal to the two vectors at the same time,
and the angle between the two vectors. The rotation via the latter axis angle can then be expressed
mathematically as three successive rotations around the x, y and z axes (rendering the total number of
rotations needed to align the object to six). This is defined as an axis angle to euler angle rotation
operation.
Three methods are generally used to carry such tasks and encompass a large variety of domains such as
aeronautics (computing the head, bank, roll of a plane), video-games and graphical design (rotating and
visualizing a 3D object). These methods are rotation matrices, quaternions, and Rodrigues’s rotations.
Quaternions are a method of choice due to limited number of operations required and the ease with
which to manipulate an entire 3D object at the same time.
85
A quaternion is a four-dimensional representation of a rotation and is defined by:
𝑞 = 𝑎 + 𝑏𝒊 + 𝑐𝒋 + 𝑑𝒌,
where,
𝒊, 𝒋 and 𝒌 are the fundamental quaternion units and satisfy 𝒊 2 = 𝒋 2 = 𝒌 2 = 𝒊𝒋𝒌 = −1.
𝑎 = cos (𝑎𝑛𝑔𝑙𝑒
2) ,
𝑏 = 𝑎𝑥𝑖𝑠 𝒙 ∗ sin (𝑎𝑛𝑔𝑙𝑒
2) ,
𝑐 = 𝑎𝑥𝑖𝑠 𝒚 ∗ sin (𝑎𝑛𝑔𝑙𝑒
2) ,
𝑑 = 𝑎𝑥𝑖𝑠 𝒛 ∗ sin (𝑎𝑛𝑔𝑙𝑒
2) ,
𝑎𝑛𝑔𝑙𝑒 is the angle of rotation.
Deriving quaternion equations with euler angles, gives the following transformations to be executed in
the right order to express a 3D rotation around an axis with a given angle:
𝑅𝑜𝑡 𝑦 = 𝑎𝑡𝑎𝑛2(𝑦 ∗ sin(𝑎𝑛𝑔𝑙𝑒) − 𝑥 ∗ 𝑧 ∗ (1 − cos(𝑎𝑛𝑔𝑙𝑒)), 1 − (𝑦2 + 𝑧2) ∗ (1 − cos(𝑎𝑛𝑔𝑙𝑒)))
𝑅𝑜𝑡 𝑧 = 𝑎𝑠𝑖𝑛(𝑥 ∗ 𝑦 ∗ (1 − cos(𝑎𝑛𝑔𝑙𝑒)) + 𝑧 ∗ sin(𝑎𝑛𝑔𝑙𝑒))
𝑅𝑜𝑡 𝑥 = 𝑎𝑡𝑎𝑛2(𝑥 ∗ sin(𝑎𝑛𝑔𝑙𝑒) − 𝑦 ∗ 𝑧 ∗ (1 − cos(𝑎𝑛𝑔𝑙𝑒)), 1 − (𝑥2 + 𝑧2) ∗ (1 − cos(𝑎𝑛𝑔𝑙𝑒)))
86
3. Illustrative case: adding a single nucleotide
Here, we will get familiarized with the algorithm principles by considering the case of adding a single
nucleotide. Let us consider the following illustrative case. A DNA strand is to be elongated by 1
nucleotide. The first step is to extend the strand with a sugar backbone.
a) Backbone extension
Two cases are to be considered:
In the first case, a nucleotide is to be added to a DNA strand in the 5’-3’ direction. Hence, O3’ atom of
the extremity nucleotide is to be bound to a new backbone to be inserted in the structure.
Figure 8: 5’-3’ direction of DNA extension. A thymine nucleotide is shown is CPK representation. O3’
binding atom for extending the strand in the 5’-3’ direction is indicated in the dashed rectangle.
87
In the second case, a nucleotide is to be added to a DNA strand in the 3’-5’ direction. Hence, O3’ atom
of the extremity nucleotide is to be bound to a new backbone to be inserted in the structure.
Figure 9: 3’-5’ direction of DNA extension. A thymine nucleotide is shown in CPK representation. P binding
atom for extending the strand in the 3’-5’ direction is indicated in the dashed rectangle.
To extend DNA of one nucleotide, both cases are dealt with using the same molecular template. The
latter template consists of the standard backbone and extended sugar geometry, containing P, O1P, O2P,
O5’, C5’, C4’, O4’, C3’, O3’, C2’ and C1’ atoms. The template also includes extra dummy atoms
allowing to perform the extension alignment.
Figure 10: Backbone extension template for both the 5'-3' and the 3'-5- directions of DNA extension. The
three anchoring residues in the left dashed rectangle allow to attach a new nucleotide in the 5’-3- direction,
while the dashed rectangle on the right contains anchoring atoms for extending DNA in the alternative path.
If DNA is to be extended in the 5’-3’ direction, then in order to bind O3’ atom of the reference nucleotide
to a new backbone, C4’, C3’ and O3’ dummy atoms of the template are aligned with C4’, C3’ and O3’
atoms of the reference.
88
Figure 11: Nucleotide attachment to the DNA backbone host in the 5’-3’ direction. The atoms to be
superposed are indicated by the dashed rectangle.
In the same logic, if extension is pursued in the 3’-5- direction, C5’, O5’ and P dummy atoms of the
template are to be aligned with the corresponding reference atoms. A template backbone is aligned with
three landmark atoms of the nucleotide at the extremity of the strand to be implemented.
Figure 12: Nucleotide attachment to the DNA backbone host in the 3’-5’ direction. The atoms to be
superposed are indicated by the dashed rectangle.
89
4. Transformations
Now, we will illustrate how the adding transformations are done, and the corresponding algorithm lines.
The algorithm is coded with two languages: perl as the host code, which enables to conveniently
manipulate files and sub-programs, and TCL as the called program in order to communicate with VMD
([Humphrey, et al., 1996]) and perform the transformations. First, the reference nucleotide is extracted
from the PDB file to be implemented and written in a separate file. Then the DNA extension direction
is extracted. This is done by looking at the extremity atoms of the reference nucleotide, and checking if
they are bound or free. 𝑎1, 𝑎2 and 𝑎3 atoms for the reference structure, and 𝑏1, 𝑏2 and 𝑏3 atoms of the
template, that will be aligned as 𝑏1 to 𝑎1, 𝑏2 to 𝑎2 and 𝑏3 to 𝑎3, are defined. For example, if the DNA
direction is 5’-3’, then O3’ atom of the reference nucleotide will be unbound, and C4’, C3’, O3’ atoms
are to be superposed between the two structures. The order of the atoms has also a significance, with 𝑎1
and 𝑏1, serving as the central atom for defining the transformation vectors (explained in more detail
below). Then, in order to perform the alignments, the reference and template structures are translated
at the origin of the coordinates. The reference structure is translated at the origin by the translation of
vector atom 𝑎1 to {0, 0, 0}. But before the translation is done, the original coordinates of 𝑎1 are saved
to reset the position when the structures have been aligned. The same operation is done with the template
structure. The dummy atoms are differentiated from the other atoms by using an occupancy field value
of 9.0 in the PDB file. Once the two structures are translated to origin and hence superposed via 𝑎1, 𝑏1,
coordinates of the atoms are extracted as 𝐴𝑥, 𝐴𝑦, 𝐴𝑧, 𝐵𝑥, 𝐵𝑦, 𝐵𝑧, 𝐶𝑥, 𝐶𝑦, 𝐶𝑧, 𝐸𝑥, 𝐸𝑦, 𝐸𝑧, 𝐹𝑥, 𝐹𝑦, 𝐹𝑧,
𝐺𝑥, 𝐺𝑦, 𝐺𝑧 for atoms 𝑎1, 𝑎2, 𝑎3, 𝑏1, 𝑏2, 𝑏3 respectively.
The first rotation is then performed in order to align the normal vectors defined by the three atoms of
the reference and template structure respectively and bring the structures in the same plane. The normal
vectors 𝒏𝟏 and 𝒏𝟐 are calculated as the cross product of the normalized vector of
𝑎1, 𝑎2⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ , 𝑎1, 𝑎3⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ , and 𝑏1, 𝑏2⃗⃗⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ , 𝑏1, 𝑏3⃗⃗⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ respectively. The axis of rotation is given by the cross product of
the two normal vectors, and the angle of rotation is calculated as the dot product of 𝒏𝟏 and 𝒏𝟏.
𝑛1𝑥 = (𝐵𝑦 − 𝐴𝑦) ∗ (𝐶𝑧 − 𝐴𝑧) − (𝐵𝑧 − 𝐴𝑧) ∗ (𝐶𝑦 − 𝐴𝑦)
𝑛1𝑦 = (𝐵𝑧 − 𝐴𝑧) ∗ (𝐶𝑥 − 𝐴𝑥) − (𝐵𝑥 − 𝐴𝑥) ∗ (𝐶𝑧 − 𝐴𝑧)
𝑛1𝑧 = (𝐵𝑥 − 𝐴𝑥) ∗ (𝐶𝑦 − 𝐴𝑦) − (𝐵𝑦 − 𝐴𝑦) ∗ (𝐶𝑥 − 𝐴𝑥)
𝑛2𝑥 = (𝐹𝑦 − 𝐸𝑦) ∗ (𝐺𝑧 − 𝐸𝑧) − (𝐹𝑧 − 𝐸𝑧) ∗ (𝐺𝑦 − 𝐴𝑦)
𝑛2𝑦 = (𝐹𝑧 − 𝐸𝑧) ∗ (𝐺𝑥 − 𝐸𝑥) − (𝐹𝑥 − 𝐸𝑥) ∗ (𝐺𝑧 − 𝐴𝑧)
𝑛2𝑧 = (𝐹𝑥 − 𝐸𝑥) ∗ (𝐺𝑦 − 𝐸𝑦) − (𝐹𝑦 − 𝐸𝑦) ∗ (𝐺𝑥 − 𝐴𝑥)
90
Let (𝑥, 𝑦, 𝑧) be the axis vector components, given by the normalized cross product 𝒏𝟏 ∗ 𝒏𝟐:
𝑥 = 𝑛1𝑦 ∗ 𝑛2𝑧 − 𝑛2𝑦 ∗ 𝑛1𝑧
𝑦 = 𝑛1𝑧 ∗ 𝑛2𝑥 − 𝑛2𝑧 ∗ 𝑛1𝑥
𝑧 = 𝑛1𝑥 ∗ 𝑛2𝑦 − 𝑛2𝑥 ∗ 𝑛1𝑦
𝑛𝑜𝑟𝑚 = (𝑥2 + 𝑦2 + 𝑧2)0.5
𝑥 = 𝑥 / 𝑛𝑜𝑟𝑚
𝑦 = 𝑦 / 𝑛𝑜𝑟𝑚
𝑧 = 𝑧/ 𝑛𝑜𝑟𝑚
Angle of rotation is given by the dot product of the normalized normal vectors:
𝑛1 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = (𝑛1𝑥2 + 𝑛1𝑦
2 + 𝑛1𝑧2)0.5
𝑛1𝑥 = 𝑛1𝑥 / 𝑛1 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒
𝑛1𝑦 = 𝑛1𝑦 / 𝑛1 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒
𝑛1𝑧 = 𝑛1𝑧 / 𝑛1 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒
𝑛2 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = (𝑛2𝑥2 + 𝑛2𝑦
2 + 𝑛2𝑧2)0.5
𝑛2𝑥 = 𝑛2𝑥 / 𝑛1 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒
𝑛2𝑦 = 𝑛2𝑦 / 𝑛1 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒
𝑛2𝑧 = 𝑛2𝑧 / 𝑛2 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒
𝑐𝑜𝑠 = 𝑛1𝑥 ∗ 𝑛2𝑥 + 𝑛1𝑦 ∗ 𝑛2𝑦 + 𝑛1𝑧 ∗ 𝑛2𝑧
𝜃 = 𝑎𝑡𝑎𝑛2 ((1 − 𝑐𝑜𝑠2), 𝑐𝑜𝑠)
Finally, we can calculate the euler angles rotation components (derived from quaternions). Because of
the coordinate reference standards used in VMD, where the transformations are executed, rotations 𝑥, 𝑦
and 𝑧 components are multiplied by -1.
𝑠 = sin (𝜃)
𝑐 = cos (𝜃)
𝑡 = 1 − cos (𝜃)
𝑅𝑜𝑡 𝑦 = −𝑎𝑡𝑎𝑛2(𝑦 ∗ 𝑠 − 𝑥 ∗ 𝑧 ∗ 𝑡, 1 − (𝑦2 + 𝑧2) ∗ 𝑡)
𝑅𝑜𝑡 𝑧 = −𝑎𝑠𝑖𝑛(𝑥 ∗ 𝑦 ∗ 𝑡 + 𝑧 ∗ 𝑠)
𝑅𝑜𝑡 𝑥 = −𝑎𝑡𝑎𝑛2(𝑥 ∗ 𝑠 − 𝑦 ∗ 𝑧 ∗ 𝑡, 1 − (𝑥2 + 𝑧2) ∗ 𝑡)
91
Executing the above rotations angles around axis y, z and x successively, aligns the normal vectors
(Figure 13).
Figure 13: Schematic diagram of the first rotation transformation to align a nucleotide backbone to be
incorporated on DNA 5’ end. The figures on the first row show the original out of plane orientation of the
template backbone, represented by the three atoms to be aligned b1, b2 and b3 with the reference atoms a1,
a2 and a3. Normal vectors of the template and reference structures are n2 and n1 respectively. The figures
on the second row depict the in-plane alignment of the template with the reference backbone after rotation
1.
For the two structures to share the same orientation, 𝒕𝟏 = 𝒂𝟏, 𝒂𝟑⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗, and 𝒕𝟐 = 𝒃𝟏, 𝒃𝟑⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗, are aligned through
a second rotation transformation. The new coordinates of the template atoms (after rotation 1) are
extracted, and the second rotation is computed. The new coordinates of the template are also used to
check how precise the first alignment was done: the 𝒏𝒆𝒘 𝒏𝟐 vector is calculated, and a parallelism
score between 𝒏𝟏 and 𝒏𝒆𝒘 𝒏𝟐, is computed as the dot product of 𝒏𝟏 and 𝒏𝒆𝒘 𝒏𝟐. This is done only
to proof check the algorithm. The axis vector, rotation angle, and euler angles rotation components are
calculated like rotation 1. The second rotation is then executed (Figure 14).
92
Figure 14: Schematic diagram of the second rotation transformation to align a nucleotide backbone to be
incorporated on DNA 5’ end. The figures on the first row show the original out of plane orientation of the
template backbone, represented by the three atoms to be aligned b1, b2 and b3 with the reference atoms a1,
a2 and a3. Vectors of the template and reference structures to be aligned are t2 and t1 respectively. The
figures on the second row depict the alignment of the template with the reference backbone after rotation
2.
Finally, a translation transformation is done so as to bind the template backbone. The new coordinates
of the template are extracted, the previous transformation (rotation 2) is assessed by checking how well
the structures are superposed. Because both structures were translated at the origin initially via 𝑎1 and
𝑏1 respectively, and because the template geometry corresponds to the reference, after the three
previous transformations (translation to origin, rotation 1, rotation 2), the structures are now superposed.
They share the same orientation, 𝑎1 and 𝑏1 are virtually perfectly superposed, however 𝑎2, 𝑏2, and 𝑎3,
𝑏3, are not exactly superposed as the template represents a standardized geometry and do not correpond
exactly to the reference (the reference comes from the initial crystal coordinates). The final translation
is calculated as the vector between atom 𝑏2 and atom 𝑎2 of the reference structure before the initial
transformation, i.e. its original position. It results in the superposition of dummy atom O3’ with
reference atom O3’, hence in the binding of the new backbone.
93
Figure 15: Translation transformation attaching the aligned backbone to DNA 5’end. The superposed atoms
resulting from translating the template O3’ dummy atom with DNA 5’ end O3’ atom are indicated by the
dashed rectangle.
b) Inserting the base group
Once the DNA strand has been extended with a new backbone, the next step is to attach a new base
group on the C1’ (host atom) of the backbone sugar.
Figure 16: DNA nucleotide and backbone references to attach a new base group on the 5’ end. The atom
shown in lime is the attachment point of a new base to the host reference backbone, while the nucleotide
indicated in grey is the extremity nucleotide reference.
The DNA direction is extracted and will be used at a final stage to know if the base is to be laterally
shifted of + or – 34.2 degrees (B-DNA consecutive base shift). The same strategy as above is employed,
except that the atoms to be aligned are specified in the following manner. If the reference or template
base type is G or A, then 𝑎1, 𝑎2 and 𝑎3, 𝑏1, 𝑏2 and 𝑏3, atom types are C2, C4 and C6 respectively.
Alternatively, if the base type is T or C, then the atom type indexes are in the C2, C6 and C4 order. In
doing so, the bases can be aligned properly. For example, when aligning G with A or G with G, C2, C4
and C6 are respectively superposed, yet when aligning G with T or C, C2, C4 and C6 of G are
respectively superposed with C2, C6 and C4 of T or C.
94
After performing the same steps as previously (insertion of the template and translations to origin, etc.),
rotation 1 is performed (Figure 17).
Figure 17: Schematic diagram of the first rotation transformation to align a nucleotide base group to be
incorporated on DNA 5’ end. The figures on the first row show the original out of plane orientation of the
template base group, represented by the three atoms to be aligned b1, b2 and b3 with the reference atoms
a1, a2 and a3. Normal vectors of the template and reference structures are n2 and n1 respectively. The
figures on the second row depict the in-plane alignment of the template with the reference base group after
rotation 1.
95
Then rotation 2 is performed. The only difference with the backbone alignment procedure is that the
template base plane is tilted laterally (around its normal vector) relative to the plane of the reference
base, of + or – 34.2 degrees. The alignment angle is calculated and is represented in Figure 18, but is
incremented of +/- 34.2 degrees (not represented), to take the tilt into account.
Figure 18: Schematic diagram of the second rotation transformation to align a nucleotide base group to be
incorporated on DNA 5’ end. The figures on the first row show the original out of plane orientation of the
template base group, represented by the three atoms to be aligned b1, b2 and b3 with the reference atoms
a1, a2 and a3. Normal vectors of the template and reference structures are n2 and n1 respectively. The
figures on the second row depict the in-plane alignment of the template with the reference base group after
rotation 2.
96
Finally, the base is attached to the sugar, by computing the translation of template dummy atom C1’ to
backbone host attaching atom C1’ and is represented in Figure 19.
Figure 19: Schematic diagram of the translation transformation attaching a new base group to DNA 5’ end
backbone. The template base group is shown in silver, while the reference nucleotide is in grey. A: Position
of the aligned based group resulting from rotations 1 and 2. The translation target is represented by the
atom colored in lime. B: Position of the base group attached to DNA after translation transformation.
A
B
97
5. Principle application: constructing a complete EC
The missing nucleotides are represented in Figure 20 and listed in Tables 3 and 4.
Figure 20: Schematic diagram of missing nucleotides in PDB#2E2H. The upstream and downstream
bubbles are indicated. tDNA, ntDNA and RNA are in light blue, cyan and lime ribbon representation
respectively. The red dashed rectangles represent the register rank to be extended, except for tDNA i-5
where the register is indicated for positional comparison with ntDNA. RNA exit channel is indicated by the
green arrow.
i - 5
i + 9
i + 9
downstream bubble
upstream bubble
i - 5
i - 18
98
Register (i +/-) RNA strand
0 A 18
-1 G 17
-2 G 16
-3 A 15
-4 G 14
-5 A 13
-6 G 12
-7 C 11
-8 U 10
-9 A 9
-10 C 8
-11 U 7
-12 A 6
-13 G 5
-14 C 4
-15 G 3
-16 G 2
-17 U 1
Table 3: RNA nucleotides to be added. RNA strand nucleotide types and register ranks are indicated.
Numbers in green indicate existing nucleotides, while red indexes indicate the nucleotides to be added. 5’-
3’ direction is given by the ascending index order. RNA registers are listed from the downstream to the
upstream direction.
99
Register (i +/-) T strand (D*) NT strand (D*)
21 G 19 C 96
20 T 20 A 95
19 A 21 T 94
18 C 22 G 93
17 T 23 A 92
16 A 24 T 91
15 C 25 G 90
14 C 26 G 89
13 G 27 C 88
12 A 28 T 87
11 T 29 A 86
10 A 30 T 85
9 A 31 T 84
8 G 32 C 83
7 C 33 G 82
6 A 34 T 81
5 G 35 C 80
4 A *C 36 G 79
3 C 37 G 78
2 G *C 38 G 77
1 C 39 G 76
0 T 40 A 75
-1 C 41 G 74
-2 C 42 G 73
-3 T 43 A 72
-4 C 44 G 71
-5 T 45 A 70
-6 C 46 G 69
-7 G 47 C 68
-8 A 48 T 67
-9 T 49 A 66
-10 G 50 C 65
-11 A 51 T 64
-12 T 52 A 63
-13 C 53 G 62
-14 A 54 T 61
-15 T 55 A 60
-16 C 56 G 59
-17 T 57 A 58
Table 4: DNA nucleotides to be added. tDNA and ntDNA nucleotide types and register ranks are indicated.
Numbers in green indicate existing nucleotides, while red indexes indicate the nucleotides to be added. 5’-
3’ direction is given by the ascending index order. DNA registers are listed from the downstream to the
upstream direction. The purple letters indicate the existing nucleotide to be mutated as cytosine to allow
GTP substrate pre-binding in MD simulations.
100
We begin by reconstructing DNA belonging to the downstream bubble. Instead of adding only the
missing nucleotide, the whole double helix from i + 21 to i + 5 is to be inserted. By doing so, one can
perform only one superposition. A perfect double B-DNA helix, which sequence correspond to table 4,
is constructed by the nab tool of Amber package. The perfect helix consists of segment chain M 5’-3’
resid 1 to 16, and chain O 3’-5’ resid 17 to 2 (starting at 2 instead of 1 in order to include the P atom of
the first residue, the strand direction being 3’-5’). Then using a modification of the backbone algorithm,
three atoms of the DNA template are aligned with three landmarks belonging to the original structure.
The three reference atoms of PDB#2E2H are resid 9, chain T, atom P; resid 13, chain T, atom O3’; resid
2, chain N, atom C3’. The logic for choosing these landmarks is that they belong to the backbone, they
are distant to each other allowing to reduce noise, but not too far from the center of the protein in order
to reduce uncertainty due to crystal packing distortion. Two of of the landmarks are close to the binding
register (i + 5). And one of the landmarks 13:O3’ is the binding atom. Landmark 13:O3’ of the reference
and 16:O3’ are to be superposed so as to bind the refitted and extended downstream helix directly to i +
5. Several landmarks have been tested, and the combination that has given the best superposition score
is the one that has been retained.
i + 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 2e2h T 9:P 13:03’
N 2:P nab M 12:P 16:03’
O 2:P Table 5: Alignment of an entire template helix to three reference anchoring points. The template helix atoms
to be superposed are indicated in the « nab » field, while the reference anchoring atoms are listed in the
« 2e2h » row. The atoms indicated in blue, red and purple, are to be respectively superposed.
101
Figure 21: Comparison fit between initial downstream tDNA structure and superposed extended helix.
Template and initial structure tDNA strands are represented in green and light blue respectively. Reference
nucleotide attaching the extended template is shown in yellow.
Figure 22: Comparison fit between initial downstream ntDNA structure and superposed extended helix.
Template and initial structure ntDNA strands are represented in green and cyan respectively. Reference
nucleotide attaching the extended template is shown in yellow.
102
Figure 23: Visualization of downstream DNA reconstruction. tDNA, ntDNA and protein walls are shown as
grey, light blue and cyan surfaces respectively.
tDNA and ntDNA i + 5 registers are kept from the original structures, rendering the missing nucleotides
to be added to: ntDNA registers i + 4 to i -17 and tDNA i - 10 to i - 17. The extension is split into two
procedures: extension of ntDNA from i + 4 to i – 9, and extension of t and ntDNA strands from i – 10
to i – 17. The reason for this splitting is that the chains reanneal at i – 10. The next step of the
reconstruction procedure is to add missing ntDNA nucleotides from i + 4 to i – 9. The goal is to fit the
extra segment in the protein, such that it connects to i + 5 on one extremity and appoaches i – 9 on the
other for reannealling. i + 5 ntDNA strand position lacks P, O1P and O2P atom because it is on the 5’
end extremity. Hence in order to extend the strand in the 5’ direction, a dummy nucleotide is superposed
with i + 5 and its P, O1P and O2P atom coordinates inserted in the reference structure are copied back
to the structure. Then, the task of adding nucleotides from i + 4 to i – 9 is pursued. The latter task
represents a singularity compared to standard geometrical fitting, because the conformation of the
ntDNA is not orthodox and diverges far from a B-DNA helix. Indeed, the path of the strand is not helical,
because it is distorted inside the protein, before it undergoes reassociation outside of the enzyme with
the t strand. In order to solve this problem, let us focus on the basic requirement. Ignoring clashes with
the protein, the first requirement is that a DNA strand of a 14 nucleotide length has to be inserted such
that it starts at the extremity of i + 5, and ends up approximatively in front and one base before tDNA
register i - 9. To perform this requirement the following trick is applied. The tDNA segment running
from i + 4 to i – 9 and which is provided in the initial structure almost perfectly covers the distance
between the two landmarks pre-mentioned on a length of 14 bases. Because t and ntDNA i + 4 blocks
are in front of each other, and i – 9 registers as well, for the DNA to reanneal at i – 10, the preceding
nucleotides must be close to each other. Thus the trick is to use the tDNA structure from i + 4 to i – 9 as
the starting template guess for the extension of the nt strand and to insert it between the landmarks so as
to cover a nucleotide distance ranging from the extremity of ntDNA i + 5 to the front of tDNA i – 9.
Futhermore, it is to be noted that tDNA follows roughly an elbow shape path, hence for ntDNA to go
103
from roughfly the same starting and ending points, the elbow shaped structure of tDNA is to be inverted.
In addition, this allows ntNDA to be in the right 3’-5’ direction. The figure below describes this concept.
Figure 24: Initial fitting of upstream ntDNA. Initial tDNA region to match template upstream ntDNA is
indicated in the left figure by a dashed circle. The right figure displays the initial insertion of upstream
ntDNA (dashed circle) to fit corresponding tDNA from the start and end association areas. Refitted tDNA
and ntDNA derived previously are indicated in light blue and cyan, while template upstream ntDNA is
represented in yellow.
Performing the fitting of the nt strand using the inverted path of the t-strand renders the result displayed
in Figure 25, where the path of the strand has very few vdw clashes with the protein.
104
Figure 25: Visualization of the initial fitting of ntDNA template relative to the enzymatic structure. Existing
tDNA and ntDNA strands are in light blue and cyan respectively, fitted ntDNA is in yellow and protein walls
are in grey. A: Side view. B: Front view.
A
B
105
After the initial insertion, a few adjustements are made to improve the path of the strand, minimizing
vdw contacts, and orientating the strand between the starting and the ending landmarks, especially for
the segment which binds at i + 4. These are done manually under VMD, by closely superving the
structure. The optimized geometry is depicted in Figures 26 and 27 below.
Figure 26: Second fitting of upstream ntDNA. Initial The path of the template is modified to connect to
downstream ntDNA around register i + 4. Refitted tDNA and ntDNA from previously are indicated in light
blue and cyan, while template upstream ntDNA is represented in yellow.
106
Figure 27: Visualization of the second fitting of ntDNA template relative to the enzymatic structure. Existing
tDNA and ntDNA strands are in light blue and cyan respectively, fitted ntDNA is in yellow and protein walls
are in grey. A: Side view. B: Front view.
The adjusted geometry appears in reasonable agreement with Andreacka et al.’s fluorescent probing of
ntDNA [Andreacka, et al., 2009].
A
B
107
Then, ntDNA bases are mutated into the right types, using the same procedures used for the base group
alignment and the result is displayed in Figure 28.
Figure 28: Mutation of ntDNA template nucleotides to match Table 4 sequence. The nucleic acid strand
used for insertion geometry alignment is modified to the wanted sequence. The mutated base groups (blue)
are aligned to the groups to be replaced, belonging to the reference strand (light blue).
108
Then, t and ntDNA i – 10 to i – 17 and RNA i – 9 to i – 17 portions are inserted using the same procedures
used for the sugar and the base insertions explained previously and are manually adjusted under VMD
to minimize vdw contacts and optimize their path, such as: exiting upstream DNA helix is
approximatively helical, RNA is extruded through the RNA exit channel.
Figure 29: Fitting of missing RNA nucleotides. The initial RNA strand (lime) is prolonged by aligned
template nucleotides forming the yellow strand. A: Enzyme-free view. B: Visualization of the extension
relative to the protein (grey).
A
B
109
Finally, the potential energy is minimized, by running ten rounds of minimization 1 (10 kcal.mol-1
restraint on protein residues), and minimization 2 (whole system is minimized), in order to refine the
nucleic acid frame geometry, and notably to create the correct bonding distances.
Figure 30: vdw representation of the full nucleic complex before potential energy minimization.
Figure 31: vdw representation of the full nucleic complex after potential energy minimization.
110
6. Closing remarks
Several mathematical methods have been investigated in the biosciences field to characterize helix
geometry occuring in nanostructures [McLahan, 1979; Aqvist, 1986; Kahn, 1988; Christopher, et al.,
1996; Lu, et al., 2003; Dalton, et al., 2003; Lee, et al., 2007; Enkbayar, et al., 2008; Kumar, et al., 2012;
Bansal, et al., 2012]. In this section, helix geometry was recreated using the fitting of an optimal template
to three landmarks atoms, using 3D rotations. It is to be noted that the advantage of the method presented
here is that its minimum requirement as starting data is three atoms that are not necessarily consecutives,
but for which the registers are known, when other methods require at least four consecutive atoms.
Nevertheless, it is worth inspecting alternative procedures outlined in the above references in order to
identify what could be optimized.
The observation of the outcome of the EC recontruction shows that while refining the potential energy
(minimization) works, the nucleic acids from i - 10 to i - 17 seem to only have a satisfactory
conformation because the strands do not form a well defined double helix. Indeed, potential energy
refinement is a very efficient method, but rely on algorithms that can get stuck to local minima which
are too high. For example, minimizing the same structure presented in this section, but with many
surrounding metabolites and with the 12-6-4 potential did not allow a satisfactory minimization of the
nucleic acids because the initial system was too far from relaxation. Hence not only to further refine the
structure before minimization, but also to port the method presented here to a more complex system,
mathematical refinement methods referenced above could be investigated.
For example, for segments adopting a conventional shape, investigating mathematical helical parameter
extraction tools could be of interest. In particular the non-linear optimization procedure presented in
[Enkbayar, et al., 2008] seems to be the best tool so far to derive notably an helix axis, and could, using
information present in the initial crystal structure atomic coordinates, be used to refine the positioning
of missing nucleotides. The latter method works in three steps.
First the function 𝑓1 is minimized, i.e., seven variables (two vectors, one radius) are calculated for which
the “energy” of the function is minimal. The second step is to calculate the helix pitch. Then the eleven
parameters (two 3D vectors, two 3D points, one radius) of the function 𝑓2 are minimized (at the same
time), using as starting guess 𝑃, 𝑟, 𝒂 and 𝒐 of step 1 and 2.
Where:
𝑓1(𝑟, 𝒂, 𝒐) = ∑ (|𝒙𝒊𝑁𝑖 − 𝒐 − (𝒙𝒊. 𝒂)𝒂| − 𝒓)𝟐,
𝑓2(𝑃, 𝑟, 𝒂, 𝒐, 𝑡0) = ∑|𝒙𝒊
𝑁
𝑖
− (𝒐 + 𝒂𝑃𝑡 − 𝑟(𝒗𝑐𝑜𝑠(𝑡) + 𝒘𝑠𝑖𝑛(𝑡)))|2
111
And:
𝑟 is the helix radius,
𝒂 is the helix axis direction vector,
𝒐 is the perpendicular vector from the coordinate origin (0,0,0) to the starting of the helix axis,
𝒙𝒊 is the ith data point vector (vector from the origin to the ith point belonging to the helix),
𝑃 is the helix pitch,
𝒗 is a unit vector perpendicular to 𝒂,
𝒘 is a unit vector perpendicular to 𝒂 and 𝒗,
𝑡 is an independent variable representing the rotation angle around 𝒂,
𝑡0 is the first data point (the first point lying at the beginning of the helix verifies 𝒐 + 𝒂𝑃𝑡 −
𝑟(𝒗𝑐𝑜𝑠(𝑡) + 𝒘𝑠𝑖𝑛(𝑡)) = 𝑡0).
112
Chapter 4
Advanced Characterization of the Diffusional Pathways
113
1. Introduction
For advanced characterization of the diffusion process, meaningful parameters to be extracted can be
divided in two categories: conformational contribution and long range interaction contribution. In this
section, a novel algorithm allowing to extract the diffusive cross section areas along pathways and other
useful parameters will be presented. Then we will focus our investigation on how to characterize non-
bonded phenomena.
2. Geometric pathway analysis
2.1. Introduction
In order to characterize how the geometry of the pathways impact nucleotide diffusion, parameters of
particular interest include: pathway axis (allows to define a protein-free central trajectory) and cross
section area. There exist tools such as CAVER 3.0, ([Chovancova, et al., 2012; Kozlikova, et al., 2014;
Pavelka, et al., 2016]), PoreWalker ([Pellegrini-Calace, et al., 2009] or MolAxis ([Yaffe, et al., 2008]),
that propose automated analysis of pathways in protein. However, these tools are based on algorithms
that function either poorly or do not allow to extract a physically sound cross section area. It is therefore
necessary to investigate how to express mathematically parameters of the channels in a rigorous manner,
to be able for example to state that CH3 is wider than CH2 and hence offers greater accessibility. The
task of mathematically expressing the parameters of protein pathways in space is not straightforward.
The shape of the pathways in proteins generally does not exhibit orthodox geometry (i.e. canonical
shape) but can be very irregular. Furthermore, defining a surface or a volume with atoms poses an
additional issue, as the true dimensions of an object composed of atoms is not derived directly from the
coordinates of the atomic centers, but the true shape is given by the electromagnetic contour, that can
be represented as the van der Walls radius. Let us consider this issue more closely. Let us assume that a
nondescript pathway lies in space, and let an axis a traverse the pathway. Let us assume that at a given
point along the axis, the cross section area is to be calculated. The task has the following difficulties.
First, the diffusive cross section is only defined by the inner surface contour, thus only the atoms for
which the vdw radius are the closest to the inside of the pathway are to be taken into consideration
(Figure 33). Second, investigating the lateral component of the pathway in two dimensions with a cross
section plane is not satisfactory: because of the vdw radius, atoms that lie just in front or behind the
plane will also affect the inner cross section area of the plane (Figure 34). In other words, because of the
vdw radius, atomic points can be represented as spherical and hence there is a third depth dimension at
play that affects the lateral contour (Figure 32).
114
Figure 32: Schematic diagram of the main dimensions of a pathway.
Combining the two latter issues means that for any lateral direction, only the atomic sphere that is
laterally the closest to the inside of the pathway, and belonging to a certain vdw longitudinal atomic
threshold, will contribute to the inner dimensions of the channel. An important fact to underline is that
extracting only the interlining atomic contributions allows calculation of the right axial geometric center,
while including in the atomic selection extra atoms, can severely bias the calculation (Figure 33).
Figure 33: Schematic diagram of a pathway cross section layer. Spheres in cyan represent the cross section
selection of atoms of a channel. Left: geometric center of the pathway (blue) is erroneous if not excluding
the outer-lining atoms. Right: geometric center is correct when excluding outer-lining atoms (red).
115
There is also another problem to solve: defining the right axial direction. Defining the diffusive axis
with a single straight line across a pore is erroneous because the lateral width of an irregular cavity is to
be defined as the biggest lateral void dimensions along the pathway. Let us take the following example.
If the axis of a given pathway is defined as a straight line ranging from the start to the finish of the
structure, then the cross section area is defined as the plane perpendicular to the axis, will not
characterize real accessibility. It is more accurate to define a readjusted axis along the pathway so as to
be orthogonal to the lateral contour offering the biggest accessibility (Figure 34).
Figure 34: Pathway axis of an irregular channel. Left: if the diffusive cross section is defined as the plane
(red rectangle) orthogonal to a fixed axis (solid arrow) from the start to the end of the pathway, then the
cross-sectional area will be erroneous. Right: Correct non-fixed pathway axis defining diffusive cross section
areas.
116
2.2. Principle of the algorithm
The main issues expressed above allow to refine the task to be carried out. Hereafter, an algorithm
allowing to solve the task will be explained in its main principles. A way to tackle the issue is to
imagine that one is looking axially towards a pore (Figure 35).
Figure 35: Schematic diagram of the visualization through a pathway. Figures on the left indicate the
visualization direction towards a pathway represented as a tube lying in space. Figures on the right indicate
variation of the void space projected in front. A: out of axis direction. B: axial direction.
The visualization angle displaying the biggest opening will give the correct accessibility direction. To
dig further on this concept, let the eye of above be replaced by a plane onto which the pathway points
immediately in front are projected. A way to define the best accessibility direction is the plane direction
for which the projected points have the biggest minimal atomic distance to the inner contour center
among other directions (Figure 36). This is in fact a simplification, as the best accessibility direction
will be accessed with a radius in 3D and not only from the projections above. A more precise definition
is that the best accessibility direction is the projection for which the contour geometric center has the
biggest minimal distance to any other atoms of the pathway.
A
B
117
Figure 36: Projection of pathway points onto a tested direction. Figures on the left represent a tested axial
direction of a pathway with a plane. Figures on the right correspond to the projection of the atoms belonging
to a channel (grey) and lying immediately in front of the plane, onto the tested plane. Optimal axial direction
(B) gives a minimal distance to the interlining cluster of atoms center greater than the wrong tested direction
(A).
A
B
118
The algorithm starts from an initial direction along a pathway start guess point and a pathway end guess
point. This is the only user input required, i.e. 6 values (x, y and z coordinates of the two guesses). It
would theoretically be possible to have zero user input by automatically generating the guess points. For
example, by detecting borders between protein and protein-free regions using mathematical
convolutions. An even simpler way, would be to map the entire protein with a series of adjacent spheres.
Then the spheres that do not contain a threshold value of atoms are selected as void cavities. Then void
cavities that are adjacent to each other are selected to define a linked void area, or in other words a
pathway. This extra complexity is however unnecessary for our investigation.
The axis is scanned by tilting around the initial direction. Each scanned direction is represented by a
vector and the initial point. For every direction, the points that belong to a 3 Å window in front are
projected onto the plane defined by the direction vector along the initial point (Figure 37).
119
Figure 37: Axis scan. Starting from an initial direction (black arrow), an axis scan is performed by rotating
a test vector (red dashed arrow) about the initial direction. A: Generation of the scan directions. B: For
each scan direction (red dashed arrow), atoms belonging to a cylindric region in the direction of the scan
are extracted.
A
B
120
Next, the contour of the projection is extracted by selecting only the inner atoms (Figure 39). This is
done by rotating around the scan direction vector and analyzing the contour by single dials (Figure 38).
Figure 38: Contour scan. For any given scan direction (black arrow), the pathway contour is scanned
around the axis by dial increments. The first figure shows the starting dial (blue) of the contour, calculated
from the closest atom to the axis (black point) displaying a certain vdw radius (red circle). The second figure
displays the atom extraction performed for the dial, and the purple, green and orange atomic points are
selected. The third figure indicates the selection of the dial atom that is closest to the axis (interlining atom).
The dial is then incremented to scan a new angular region (purple dial).
121
Figure 39: Interlining atoms extraction. Left: atomic selection (grey points) before performing the contour
scan. Right: interlining atomic selection computed by the dial calculations.
Then, the contours (of the scanned axis) are assessed against each other. The contour that has the biggest
minimal distance to its geometric center is selected: the new good axial direction forward has been
found. The second part of the algorithm uses a similar approach but scans the pathway by tilting a virtual
sphere along the previous detected pathway axis point and selects the virtual sphere whose center has
the biggest minimal radius compared to the other virtual spheres scanned. For the start of the pathway,
the first point is the winning contour projection geometric center along the winning scanned direction.
The second part of the algorithm also used a “fixed axis” principle. A fixed axis is defined with the start
and end guess points (see previous paragraphs) and allows the algorithm to run across the channel from
roughly the start to the end guesses, without exploring sub-pathways in the main channel, by going
backwards for example. This is done by setting up a two-step virtual sphere scan of 45 degrees maximum
around the fixed axis, such as the scan does not go backwards (i.e., more than 90 degrees). A second
trick is employed to prevent the pathway exploration to escape the channel and consists in defining an
outer tube around the fixed axis. This allows us to compute the best curvature of the inner pathway axis,
without escaping from the outer tube, and is done by defining virtual atoms in the outer tube. Finally,
the algorithm increments the scan forward to advance along the pathway by starting a new virtual sphere
scan forward.
122
2.3. Detailed description of the algorithm
2.3.1. Refine starting point
a) Scan axis
The first step: the scan of the axis, is done by assigning into three arrays the respective x, y and z
coordinates of a point projected from the starting point 𝐴, along the tested direction. To do so, the initial
direction vector (hereafter named 𝒏) is rotated laterally and vertically, in 5 degrees increments, and
covering a spherical scan of -35 to +35 degrees. First, the initial point 𝐴 is projected 1 Å along 𝒏 and a
point called 𝑁𝑖𝑛𝑖 (N initial) is set. Each tested scanned direction 𝒏𝒔𝒄𝒂𝒏 is represented as the new
position of point 𝑁𝑖𝑛𝑖 in space and is specified by the point 𝑁𝑝𝑟𝑜𝑗 (N projected): lateral shift of point
𝑁𝑖𝑛𝑖 in space, and 𝑁𝑝𝑟𝑜𝑗𝑃 (N projected prime): vertical shift of 𝑁𝑝𝑟𝑜𝑗 in space, hence representing the
combination of lateral and vertical shift in space. To understand how this represents a new direction (for
example lateral shift of -30 deg., vertical shift of +5 deg.), an illustration that can be made is that vector
(𝐴, 𝑁𝑝𝑟𝑜𝑗𝑃) is the vector n starting from 𝐴 but pointing in a new direction. The latter direction is given
by the projection of point 𝐴 along the vector and is point 𝑁𝑝𝑟𝑜𝑗𝑃. To rotate n laterally and vertically,
i.e. to rotate point 𝑁𝑖𝑛𝑖, two vectors are defined. 𝝎 vector is set and is a vector orthogonal to 𝒏. 𝝍 is a
vector orthogonal to 𝒏 and 𝝎.
Let the initial vector 𝒏 be specified by initial point 𝐴(𝐴𝑥, 𝐴𝑦, 𝐴𝑧), pointing towards point
𝐵(𝐵𝑥, 𝐵𝑦, 𝐵𝑧).
Let 𝑁𝑥,𝑁𝑦,𝑁𝑧 be the parameters of unit vector 𝒏:
𝑁𝑥 = 𝐵𝑥 − 𝐴𝑥, 𝑁𝑦 = 𝐵𝑦 − 𝐴𝑦, 𝑁𝑧 = 𝐵𝑧 − 𝐴𝑧
Vector magnitude is: 𝑁𝑛𝑜𝑟𝑚 = (𝑁𝑥2 + 𝑁𝑦2 + 𝑁𝑧2)0.5
𝑁𝑥 = 𝑁𝑥/𝑁𝑛𝑜𝑟𝑚, 𝑁𝑦 = 𝑁𝑦/𝑁𝑛𝑜𝑟𝑚, 𝑁𝑧 = 𝑁𝑧/𝑁𝑛𝑜𝑟𝑚
Note that the following terminology is used. When the same variable occurs on the left and the right of
an equation, the left variable corresponds to the new value of the right variable and overwrites it.
A vector 𝝎 orthogonal to 𝒏 verifies
𝑑𝑜𝑡(𝒏,𝝎) = 0
Hence 𝝎(𝑊𝑥,𝑊𝑦,𝑊𝑧) = (0, −𝑁𝑧,𝑁𝑦) is orthogonal to 𝒏 and is set.
Unit vector parameters are given by:
𝑊𝑛𝑜𝑟𝑚 = (𝑊𝑥2 + 𝑊𝑦2 + 𝑊𝑧2)0.5
123
𝑊𝑥 = 𝑊𝑥/𝑊𝑛𝑜𝑟𝑚, 𝑊𝑦 = 𝑊𝑦/𝑊𝑛𝑜𝑟𝑚, 𝑊𝑧 = 𝑊𝑧/𝑊𝑛𝑜𝑟𝑚
A vector 𝝍(𝑌𝑥, 𝑌𝑦, 𝑌𝑧) that is both orthogonal to 𝒏 and 𝒘, verifies 𝑐𝑟𝑜𝑠𝑠(𝒏,𝝎) = 𝝍
𝑌𝑥 = 𝑁𝑦 ∗ 𝑊𝑧 − 𝑁𝑧 ∗ 𝑊𝑦
𝑌𝑦 = 𝑁𝑧 ∗ 𝑊𝑥 − 𝑁𝑥 ∗ 𝑊𝑧
𝑌𝑧 = 𝑁𝑥 ∗ 𝑊𝑦 − 𝑁𝑦 ∗ 𝑊𝑥
Unit vector components are calculated as:
𝑌𝑛𝑜𝑟𝑚 = (𝑌𝑥2 + 𝑌𝑦2 + 𝑌𝑧2)0.5
𝑌𝑥 = 𝑌𝑥/𝑌𝑛𝑜𝑟𝑚, 𝑌𝑦 = 𝑌𝑦/𝑌𝑛𝑜𝑟𝑚, 𝑌𝑧 = 𝑌𝑧/𝑌𝑛𝑜𝑟𝑚
The unshifted position of vector 𝒏 is represented by the 1 Å projection of point 𝐴 along 𝒏, and is point
𝑁𝑖𝑛𝑖(𝑁𝑖𝑛𝑖_𝑥, 𝑁𝑖𝑛𝑖_𝑦, 𝑁𝑖𝑛𝑖_𝑧):
𝑁𝑖𝑛𝑖_𝑥 = 𝐴𝑥 + 𝑁𝑥, 𝑁𝑖𝑛𝑖_𝑦 = 𝐴𝑦 + 𝑁𝑦, 𝑁𝑖𝑛𝑖_𝑧 = 𝐴𝑧 + 𝑁𝑧
The vertical scan is done by rotating 14 times (in order to cover the -35 to 35 degrees range in 5 degrees
increments) a point 𝑁𝑝𝑟𝑜𝑗(𝑁𝑝𝑟𝑜𝑗_𝑥, 𝑁𝑝𝑟𝑜𝑗_𝑦, 𝑁𝑝𝑟𝑜𝑗_𝑧) around 𝝎 and corresponds to the point
𝑁𝑝𝑟𝑜𝑗P. 𝑁𝑝𝑟𝑜𝑗 corresponds to the current lateral scan position, and initially
𝑁𝑝𝑟𝑜𝑗(𝑁𝑝𝑟𝑜𝑗_𝑥, 𝑁𝑝𝑟𝑜𝑗_𝑦, 𝑁𝑝𝑟𝑜𝑗_𝑧) = 𝑁𝑖𝑛𝑖(𝑁𝑖𝑛𝑖_𝑥, 𝑁𝑖𝑛𝑖_𝑦, 𝑁𝑖𝑛𝑖_𝑧).
The rotation is calculated with the rotation matrix of point 𝑁𝑝𝑟𝑜𝑗(𝑁𝑝𝑟𝑜𝑗_𝑥, 𝑁𝑝𝑟𝑜𝑗_𝑦, 𝑁𝑝𝑟𝑜𝑗_𝑧) about
the origin, around 𝝎(𝑊𝑥,𝑊𝑦,𝑊𝑧) going through point 𝐴(𝐴𝑥, 𝐴𝑦, 𝐴𝑧), of angle 𝑡𝑒𝑡𝑎1.
Let:
𝑠 = sin(𝑡𝑒𝑡𝑎1) , 𝑐 = cos(𝑡𝑒𝑡𝑎1) , 𝑡 = 1 − 𝑐
Let us apply transformation matrix to point 𝑁𝑝𝑟𝑜𝑗:
𝑚𝑎𝑡1𝑥 = (𝐴𝑥 ∗ (𝑌𝑦2 + 𝑌𝑧2) − 𝑌𝑥 ∗ (𝐴𝑦 ∗ 𝑌𝑦 + 𝐴𝑧 ∗ 𝑌𝑧 − 𝑌𝑥 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑥 − 𝑌𝑦 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑦 − 𝑌𝑧 ∗
𝑁𝑝𝑟𝑜𝑗_𝑧)) ∗ 𝑡
𝑚𝑎𝑡2𝑥 = 𝑁𝑝𝑟𝑜𝑗_𝑥 ∗ 𝑐 + (−𝐴𝑧 ∗ 𝑌𝑦 + 𝐴𝑦 ∗ 𝑌𝑧 − 𝑌𝑧 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑦 + 𝑌𝑦 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑧) ∗ 𝑠
𝑁𝑝𝑟𝑜𝑗𝑃_𝑥 = 𝑚𝑎𝑡1𝑥 + 𝑚𝑎𝑡2𝑥
𝑚𝑎𝑡1𝑦 = (𝐴𝑦 ∗ (𝑌𝑥2 + 𝑌𝑧2) − 𝑌𝑦 ∗ (𝐴𝑥 ∗ 𝑌𝑥 + 𝐴𝑧 ∗ 𝑌𝑧 − 𝑌𝑥 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑥 − 𝑌𝑦 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑦 − 𝑌𝑧 ∗
𝑁𝑝𝑟𝑜𝑗_𝑧)) ∗ 𝑡
124
𝑚𝑎𝑡2𝑦 = 𝑁𝑝𝑟𝑜𝑗_𝑦 ∗ 𝑐 + (𝐴𝑧 ∗ 𝑌𝑥 − 𝐴𝑥 ∗ 𝑌𝑧 + 𝑌𝑧 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑥 − 𝑌𝑥 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑧) ∗ 𝑠
𝑁𝑝𝑟𝑜𝑗𝑃_𝑦 = 𝑚𝑎𝑡1𝑦 + 𝑚𝑎𝑡2𝑦
𝑚𝑎𝑡1𝑧 = (𝐴𝑧 ∗ (𝑌𝑥2 + 𝑌𝑦2) − 𝑌𝑧 ∗ (𝐴𝑥 ∗ 𝑌𝑥 + 𝐴𝑦 ∗ 𝑌𝑦 − 𝑌𝑥 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑥 − 𝑌𝑦 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑦 − 𝑌𝑧 ∗
𝑁𝑝𝑟𝑜𝑗_𝑧)) ∗ 𝑡
𝑚𝑎𝑡2𝑧 = 𝑁𝑝𝑟𝑜𝑗_𝑧 ∗ 𝑐 + (−𝐴𝑦 ∗ 𝑌𝑥 + 𝐴𝑥 ∗ 𝑌𝑦 − 𝑌𝑦 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑥 + 𝑌𝑥 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑦) ∗ 𝑠
𝑁𝑝𝑟𝑜𝑗𝑃_𝑧 = 𝑚𝑎𝑡1𝑧 + 𝑚𝑎𝑡2𝑧
After each vertical -35 to +35 deg. scan, i.e. after 14 vertical scans, the scan is rotated laterally in order
to refresh the new lateral position 𝑁𝑝𝑟𝑜𝑗 from which the vertical rotation is to be performed: the scan is
restarted in order to cover a new vertical region. Lateral shift is performed 14 times. Hence, taking into
account that the first vertical scan does not require a lateral rotation, the total number of tilts is 14 * 14=
196, and allows to cover forward spherical scan of -35 to +35 degrees.
Lateral rotation coordinates 𝑁𝑝𝑟𝑜𝑗(𝑁𝑝𝑟𝑜𝑗_𝑥, 𝑁𝑝𝑟𝑜𝑗_𝑦, 𝑁𝑝𝑟𝑜𝑗_𝑧) of point 𝑁𝑖𝑛𝑖 around 𝝍 are given by:
𝑠 = sin(𝑡𝑒𝑡𝑎2) , 𝑐 = cos(𝑡𝑒𝑡𝑎2) , 𝑡 = 1 − 𝑐, where 𝑡𝑒𝑡𝑎2 is the lateral angle increment.
𝑚𝑎𝑡1𝑥 = (𝐴𝑥 ∗ (𝑊𝑦2 + 𝑊𝑧2) − 𝑊𝑥 ∗ (𝐴𝑦 ∗ 𝑊𝑦 + 𝐴𝑧 ∗ 𝑊𝑧 − 𝑊𝑥 ∗ 𝑁𝑖𝑛𝑖_𝑥 − 𝑊𝑦 ∗ 𝑁𝑖𝑛𝑖_𝑦 −
𝑊𝑧 ∗ 𝑁𝑖𝑛𝑖_𝑧)) ∗ 𝑡
𝑚𝑎𝑡2𝑥 = 𝑁𝑖𝑛𝑖_𝑥 ∗ 𝑐 + (−𝐴𝑧 ∗ 𝑊𝑦 + 𝐴𝑦 ∗ 𝑊𝑧 − 𝑊𝑧 ∗ 𝑁𝑖𝑛𝑖_𝑦 + 𝑊𝑦 ∗ 𝑁𝑖𝑛𝑖_𝑧) ∗ 𝑠
𝑁𝑝𝑟𝑜𝑗_𝑥 = 𝑚𝑎𝑡1𝑥 + 𝑚𝑎𝑡2𝑥
𝑚𝑎𝑡1𝑦 = (𝐴𝑦 ∗ (𝑊𝑥2 + 𝑊𝑧2) − 𝑊𝑦 ∗ (𝐴𝑥 ∗ 𝑊𝑥 + 𝐴𝑧 ∗ 𝑊𝑧 − 𝑊𝑥 ∗ 𝑁𝑖𝑛𝑖_𝑥 − 𝑊𝑦 ∗ 𝑁𝑖𝑛𝑖_𝑦 −
𝑊𝑧 ∗ 𝑁𝑖𝑛𝑖_𝑧)) ∗ 𝑡
𝑚𝑎𝑡2𝑦 = 𝑁𝑖𝑛𝑖_𝑦 ∗ 𝑐 + (𝐴𝑧 ∗ 𝑊𝑥 − 𝐴𝑥 ∗ 𝑊𝑧 + 𝑊𝑧 ∗ 𝑁𝑖𝑛𝑖_𝑥 − 𝑊𝑥 ∗ 𝑁𝑖𝑛𝑖_𝑧) ∗ 𝑠
𝑁𝑝𝑟𝑜𝑗_𝑦 = 𝑚𝑎𝑡1𝑦 + 𝑚𝑎𝑡2𝑦
𝑚𝑎𝑡1𝑧 = (𝐴𝑧 ∗ (𝑊𝑥2 + 𝑊𝑦2) − 𝑊𝑧 ∗ (𝐴𝑥 ∗ 𝑊𝑥 + 𝐴𝑦 ∗ 𝑊𝑦 − 𝑊𝑥 ∗ 𝑁𝑖𝑛𝑖_𝑥 − 𝑊𝑦 ∗ 𝑁𝑖𝑛𝑖_𝑦 −
𝑊𝑧 ∗ 𝑁𝑖𝑛𝑖_𝑧)) ∗ 𝑡
𝑚𝑎𝑡2𝑧 = 𝑁𝑖𝑛𝑖_𝑧 ∗ 𝑐 + (−𝐴𝑦 ∗ 𝑊𝑥 + 𝐴𝑥 ∗ 𝑊𝑦 − 𝑊𝑦 ∗ 𝑁𝑖𝑛𝑖_𝑥 + 𝑊𝑥 ∗ 𝑁𝑖𝑛𝑖_𝑦) ∗ 𝑠
𝑁𝑝𝑟𝑜𝑗_𝑧 = 𝑚𝑎𝑡1𝑧 + 𝑚𝑎𝑡2𝑧
125
Each scanned direction is assigned into three arrays 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑥, 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑦, 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑧,
containing the respective coordinate components of the 1 Å projection of point 𝐴 along a scan direction.
To simplify, let the ensemble of points 𝑁𝑝𝑟𝑜𝑗 and 𝑁𝑝𝑟𝑜𝑗𝑃 (depicting the lateral and vertical rotations)
be 𝑁𝑟𝑜𝑡 (N rotated). A rotation direction rank i is recorded, such that 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑥[𝑖], 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑦[𝑖],
𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑧[𝑖], contain the respective x, y, z coordinates of 1 Å projection of point 𝐴 along the 𝑖th
scan direction, and 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑥[𝑖] = 𝑁𝑟𝑜𝑡[𝑖]_𝑥, 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑦[𝑖] = 𝑁𝑟𝑜𝑡[𝑖]_𝑦, 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑧[𝑖] =
𝑁𝑟𝑜𝑡[𝑖]_𝑧.
b) Project points
For each scan direction, the next step is to project the coordinates of the pathway atoms centers lying in
front of the scanned direction onto the plane 𝐷𝐼𝑅 defined by point 𝐴 and scan direction. The protein
points that belong to a cylinder of radius 20 Å and length 3 Å in front of plane 1, are projected onto
plane 1. Cylinder atoms are the points that belong between plane 1 and plane 2 that is 3 Å ahead of plane
1, and which are at a distance inferior than 20 Å from the axis going from 𝐴 and 𝐴 projected 3 Å along
the scanned direction.
Plane 𝐷𝐼𝑅 is defined by 𝐴 and 𝒏𝒔𝒄𝒂𝒏, where 𝑁𝑥,𝑁𝑦,𝑁𝑧 are the new parameters of vector 𝒏𝒔𝒄𝒂𝒏.
𝑁𝑥 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑥[𝑖] − 𝐴𝑥, 𝑁𝑦 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑦[𝑖] − 𝐴𝑦, 𝑁𝑧 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑧[𝑖] − 𝐴𝑧
𝒏𝒔𝒄𝒂𝒏 need not to be normalized, because 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠 ith point is already at 1 Å from 𝐴.
Plane 𝐷𝐼𝑅2 is defined by the 3 Å projection of plane DIR along 𝒏𝒔𝒄𝒂𝒏, where:
𝐴𝑃 is the 3 Å projection of 𝐴 along 𝒏𝒔𝒄𝒂𝒏
𝒏𝒑(𝑁𝑃𝑥,𝑁𝑃𝑦,𝑁𝑃𝑧) is a normal vector of the plane going through point 𝐴𝑃(𝐴𝑃𝑥, 𝐴𝑃𝑦, 𝐴𝑃𝑧), and
pointing towards 𝐴, verifying 𝒏𝒑 = −𝒏𝒔𝒄𝒂𝒏.
𝐴𝑃𝑥 = 𝐴𝑥 + 3 ∗ 𝑁𝑥
𝐴𝑃𝑦 = 𝐴𝑦 + 3 ∗ 𝑁𝑦
𝐴𝑃𝑧 = 𝐴𝑧 + 3 ∗ 𝑁𝑧
𝑁𝑃𝑥 = −𝑁𝑥
𝑁𝑃𝑦 = −𝑁𝑦
𝑁𝑃𝑧 = −𝑁𝑧
126
Let an atom that belongs to the protein structure be defined by the point
𝑎𝑡𝑜𝑚(𝑎𝑡𝑜𝑚_𝑥, 𝑎𝑡𝑜𝑚_𝑦, 𝑎𝑡𝑜𝑚_𝑧).
Let a vector 𝒖(𝑈𝑥, 𝑈𝑦, 𝑈𝑧) go from 𝐴 to 𝑎𝑡𝑜𝑚.
𝑈𝑥 = 𝑎𝑡𝑜𝑚_𝑥 − 𝐴𝑥, 𝑈𝑦 = 𝑎𝑡𝑜𝑚_𝑦 − 𝐴𝑦, 𝑈𝑧 = 𝑎𝑡𝑜𝑚_𝑧 − 𝐴𝑧
Let a vector 𝒗(𝑉𝑥, 𝑉𝑦, 𝑉𝑧) go from 𝐴𝑃 to 𝑎𝑡𝑜𝑚.
𝑉𝑥 = 𝑎𝑡𝑜𝑚_𝑥 − 𝐴𝑃𝑥, 𝑉𝑦 = 𝑎𝑡𝑜𝑚_𝑦 − 𝐴𝑃𝑦, 𝑉𝑧 = 𝑎𝑡𝑜𝑚_𝑧 − 𝐴𝑃𝑧
An atom that lies in between the two planes will verify 𝑑𝑜𝑡_1 = 𝑑𝑜𝑡(𝒏𝒔𝒄𝒂𝒏,𝑼) > 𝟎 and 𝑑𝑜𝑡_2 =
𝑑𝑜𝑡(𝒏𝒑, 𝑽) > 𝟎
𝑑𝑜𝑡_1 = 𝑁𝑥 ∗ 𝑈𝑥 + 𝑁𝑦 ∗ 𝑈𝑦 + 𝑁𝑧 ∗ 𝑈𝑧
𝑑𝑜𝑡_2 = 𝑁𝑃𝑥 ∗ 𝑉𝑥 + 𝑁𝑃𝑦 ∗ 𝑉𝑦 + 𝑁𝑃𝑧 ∗ 𝑉𝑧
An atom that further verifies a distance to the axis going through 𝐴 and 𝐴𝑃 inferior or equal to 20 Å,
will belong to the 3 Å long, 20 Å wide, forward cylinder, where the distance is calculated as:
𝑟𝑎𝑑𝑖𝑢𝑠 = |𝑐𝑟𝑜𝑠𝑠(𝒖, 𝒗)| / |𝑨 − 𝑨𝑷|
Let 𝒘 = 𝑨 − 𝑨𝑷:
𝑊𝑥 = 𝐴𝑥 − 𝐴𝑃𝑥, 𝑊𝑦 = 𝐴𝑦 − 𝐴𝑃𝑦, 𝑊𝑧 = 𝐴𝑧 − 𝐴𝑃𝑧
|𝑨 − 𝑨𝑷| = 𝑊𝑛𝑜𝑟𝑚 = (𝑊𝑥2 + 𝑊𝑦2 + 𝑊𝑧2)0.5
Let 𝑐𝑟𝑜𝑠𝑠(𝒖, 𝒗) = 𝑐𝑟𝑜𝑠𝑠_𝑥, 𝑐𝑟𝑜𝑠𝑠_𝑦, 𝑐𝑟𝑜𝑠𝑠_𝑧
𝑐𝑟𝑜𝑠𝑠_𝑥 = 𝑈𝑦 ∗ 𝑉𝑧 − 𝑈𝑧 ∗ 𝑉𝑦
𝑐𝑟𝑜𝑠𝑠_𝑦 = 𝑈𝑧 ∗ 𝑉𝑥 − 𝑈𝑥 ∗ 𝑉𝑧
𝑐𝑟𝑜𝑠𝑠_𝑧 = 𝑈𝑥 ∗ 𝑉𝑦 − 𝑈𝑦 ∗ 𝑉𝑥
|𝑐𝑟𝑜𝑠𝑠(𝒖, 𝒗)| = 𝑐𝑟𝑜𝑠𝑠_𝑛𝑜𝑟𝑚 = (𝑐𝑟𝑜𝑠𝑠_𝑥2 + 𝑐𝑟𝑜𝑠𝑠_𝑦2 + 𝑐𝑟𝑜𝑠𝑠_𝑧2)0.5
𝑟𝑎𝑑𝑖𝑢𝑠 = 𝑐𝑟𝑜𝑠𝑠_𝑛𝑜𝑟𝑚/𝑊𝑛𝑜𝑟𝑚
Each atom that verifies 𝑑𝑜𝑡_1 > 0, 𝑑𝑜𝑡_2 > 0, and 𝑟𝑎𝑑𝑖𝑢𝑠 ≤ 20, is projected onto plane 1:
Let 𝑡_𝑝𝑟𝑜𝑗 be the projection parameter, such as the projected atom onto plane 1 defined by
𝐴(𝐴𝑥, 𝐴𝑦, 𝐴𝑧) and 𝒏𝒔𝒄𝒂𝒏(𝑁𝑥,𝑁𝑦, 𝑁𝑧), is given by 𝑎𝑡𝑜𝑚 + 𝑡_𝑝𝑟𝑜𝑗 * 𝒏𝒔𝒄𝒂𝒏.
127
𝑡_𝑝𝑟𝑜𝑗 verifies:
𝑡_𝑝𝑟𝑜𝑗 = (𝑁𝑥 ∗ 𝐴𝑥 − 𝑁𝑥 ∗ 𝑎𝑡𝑜𝑚_𝑥 + 𝑁𝑦 ∗ 𝐴𝑦 − 𝑁𝑦 ∗ 𝑎𝑡𝑜𝑚_𝑦 + 𝑁𝑧 ∗ 𝐴𝑧 − 𝑁𝑧 ∗ 𝑎𝑡𝑜𝑚_𝑧)/(𝑁𝑥2 +
𝑁𝑦2 + 𝑁𝑧2)
Let 𝑎𝑡𝑜𝑚_𝑝𝑟𝑜𝑗(𝑎𝑡𝑜𝑚_𝑝𝑟𝑜𝑗_𝑥, 𝑎𝑡𝑜𝑚_𝑝𝑟𝑜𝑗_𝑦, 𝑎𝑡𝑜𝑚_𝑝𝑟𝑜𝑗_𝑧) be 𝑎𝑡𝑜𝑚(𝑎𝑡𝑜𝑚_𝑥, 𝑎𝑡𝑜𝑚_𝑦, 𝑎𝑡𝑜𝑚_𝑧)
projected onto plane 1:
𝑎𝑡𝑜𝑚_𝑝𝑟𝑜𝑗_𝑥 = 𝑎𝑡𝑜𝑚_𝑥 + 𝑡_𝑝𝑟𝑜𝑗 ∗ 𝑁𝑥
𝑎𝑡𝑜𝑚_𝑝𝑟𝑜𝑗_𝑦 = 𝑎𝑡𝑜𝑚_𝑦 + 𝑡_𝑝𝑟𝑜𝑗 ∗ 𝑁𝑦
𝑎𝑡𝑜𝑚_𝑝𝑟𝑜𝑗_𝑧 = 𝑎𝑡𝑜𝑚_𝑧 + 𝑡_𝑝𝑟𝑜𝑗 ∗ 𝑁𝑧
The atom belonging to the cylinder selection and projected onto the plane defined by the scanned
direction and 𝐴 is stored by respective x, y, z components into arrays 𝑝𝑟𝑜𝑗_𝑥, 𝑝𝑟𝑜𝑗_𝑦, 𝑝𝑟𝑜𝑗_𝑧.
Its vdw radius is also stored in array 𝑝𝑟𝑜𝑗_𝑣𝑑𝑤, in the following fashion. If the atom is of hydrogen,
carbon, nitrogen, oxygen, phosphorus, sulfur, magnesium or zinc type, then its vdw radius is set to 1.20,
1.70, 1.55, 1.52, 1.80, 1.80, 1.73 or 1.39 respectively. It follows that for each scanned direction, is
associated an ensemble of projected points, with their corresponding vdw radii.
c) Scan contour
Once a projection map is assigned to each scanned axis, one proceeds to the scan of the projection
contour. The goal is to assign to each projection map a unique contour, such that only the relevant
interlining atoms are selected.
First point 𝐵𝑃 (B prime) is defined as a second point (in addition to 𝐴), to delineate a direction axis.
𝐵𝑃𝑥 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑥[𝑖], 𝐵𝑃𝑦 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑦[𝑖], 𝐵𝑃𝑧 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑧[𝑖]
Then, the point closest to 𝐴 is selected in order to start the circular scan from a first point belonging to
the contour. To perform the latter selection, the distance between each atom belonging to the projection
and 𝐴 is stored into an array called 𝑑𝑖𝑠𝑡. Then, the atom index that corresponds to the first iteration of
the minimal distance is stored in the 𝑑𝑖𝑠𝑡 array.
It is worth underlining that 𝐴 represents the point from which a direction axis is pointing. Hence, for
each scanned axis, the distance between the projected points and 𝐴, is equivalent to the distance between
the non-projected points and the axis.
For each projected atom, appearing at 𝑐𝑜𝑢𝑛𝑡 iteration, their lateral distance is given by:
128
𝑑𝑖𝑠𝑡_𝑥 = 𝑝𝑟𝑜𝑗_𝑥[𝑐𝑜𝑢𝑛𝑡] − 𝐴𝑥
𝑑𝑖𝑠𝑡_𝑦 = 𝑝𝑟𝑜𝑗_𝑦[𝑐𝑜𝑢𝑛𝑡] − 𝐴𝑦
𝑑𝑖𝑠𝑡_𝑧 = 𝑝𝑟𝑜𝑗_𝑧[𝑐𝑜𝑢𝑛𝑡] − 𝐴𝑧
𝑑𝑖𝑠𝑡 = (𝑑𝑖𝑠𝑡_𝑥2 + 𝑑𝑖𝑠𝑡_𝑦2 + 𝑑𝑖𝑠𝑡_𝑧2)0.5 − 𝑝𝑟𝑜𝑗_𝑣𝑑𝑤[𝑐𝑜𝑢𝑛𝑡]
Where −𝑝𝑟𝑜𝑗_𝑣𝑑𝑤[𝑐𝑜𝑢𝑛𝑡] accounts for the deduction of the vdw radius.
Let 𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥 be the index of the first iteration of the minimal value stored in 𝑑𝑖𝑠𝑡 array and let 𝑀𝐼𝑁
be the point corresponding to the first atom belonging to the contour:
𝑀𝐼𝑁𝑥 = 𝑝𝑟𝑜𝑗_𝑥[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]
𝑀𝐼𝑁𝑦 = 𝑝𝑟𝑜𝑗_𝑦[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]
𝑀𝐼𝑁𝑧 = 𝑝𝑟𝑜𝑗_𝑧[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]
Also, because projection atom coordinates were implemented into 𝑝𝑟𝑜𝑗_𝑥, 𝑝𝑟𝑜𝑗_𝑦, and 𝑝𝑟𝑜𝑗_𝑧 arrays,
in the same order than 𝑝𝑟𝑜𝑗_𝑣𝑑𝑤 array, the corresponding vdw radius of atom 𝑀𝐼𝑁 is:
𝑣𝑑𝑤_𝑎𝑡𝑚 = 𝑝𝑟𝑜𝑗_𝑣𝑑𝑤[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]
The four values of the first contour atom (x, y, z coordinates, vdw radius) are then stored into 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥,
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑦, 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑧, and 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤 arrays respectively.
The latter arrays store information about the contour, correspond to a refined state of the projection map,
and will later be put to contribution to assess notably the right axial direction.
Then the actual 360 deg. contour scan is performed, starting from the atom selected above, and using
the projection map and the tested axis direction. This is done in six steps that are outlined below.
The calculation that will be performed starts from the previous atom belonging to the contour, and
searches for atoms belonging to a 2 Å lateral window from that atom.
In order to cover a full 360 deg. circular scan, each increment of the scan is done successively in the
same circular direction: positive angle.
129
Step 1
The angle required to cover a lateral range of 2 Å is calculated.
Let radius be the distance to 𝐴, and 𝑀𝐼𝑁 be the previous contour atom from which the next dial is to be
scanned.
𝑟𝑎𝑑𝑖𝑢𝑠 = ((𝑀𝐼𝑁𝑥 − 𝐴𝑥)2 + (𝑀𝐼𝑁𝑦 − 𝐴𝑦)2 + (𝑀𝐼𝑁𝑧 − 𝐴𝑧)2)0.5
Let 𝑡𝑒𝑡𝑎 be the angle required to cover a lateral range of 2 Å and 𝑙𝑎𝑡_𝑤𝑖𝑛𝑑𝑜𝑤 = 2 Å,
𝑐𝑜𝑠 = (2 ∗ 𝑟𝑎𝑑𝑖𝑢𝑠2 − 𝑙𝑎𝑡_𝑤𝑖𝑛𝑑𝑜𝑤2)/(2 ∗ 𝑟𝑎𝑑𝑖𝑢𝑠2)
𝑡𝑒𝑡𝑎 = 𝑎𝑡𝑎𝑛2((1 − 𝑐𝑜𝑠2), cos)
Step 2
Second, the borders of the dial are calculated, in order to extract atoms belonging to the dial (i.e.
belonging to that particular circular region).
To do so, two planes are to be defined.
One dial border is represented by the plane going through atom 𝑀𝐼𝑁 (dial start) and 𝐴, with normal
vector orthogonal to the axis.
Plane 1:
Let 𝑾(𝑊𝑥,𝑊𝑦,𝑊𝑧) be the axis vector,
𝑊𝑥 = 𝐵𝑃𝑥 − 𝐴𝑥, 𝑊𝑦 = 𝐵𝑃𝑦 − 𝐴𝑦, 𝑊𝑧 = 𝐵𝑃𝑧 − 𝐴𝑧
Let 𝑼(𝑈𝑥, 𝑈𝑦, 𝑈𝑧) be the vector between point 𝐴 and point 𝑀𝐼𝑁,
𝑈𝑥 = 𝑀𝐼𝑁𝑥 − 𝐴𝑥, 𝑈𝑦 = 𝑀𝐼𝑁𝑦 − 𝐴𝑦, 𝑈𝑧 = 𝑀𝐼𝑁𝑧 − 𝐴𝑧
Plane 1 is defined by normal vector 𝒏𝟏(𝑛1_𝑥, 𝑛1_𝑦, 𝑛1_𝑧) going through point 𝐴, where,
𝒏𝟏 = 𝑐𝑟𝑜𝑠𝑠(𝑾,𝑼)
𝑛1_𝑥 = 𝑊𝑥 ∗ 𝑈𝑧 − 𝑊𝑧 ∗ 𝑈𝑦
𝑛1_𝑦 = 𝑊𝑧 ∗ 𝑈𝑥 − 𝑊𝑥 ∗ 𝑈𝑧
𝑛1_𝑧 = 𝑊𝑥 ∗ 𝑈𝑦 − 𝑊𝑦 ∗ 𝑈𝑥
130
The upper border of the dial can be expressed as the plane going through atom 𝑀𝐼𝑁 prime (𝑎𝑡𝑜𝑚𝑃),
where 𝑎𝑡𝑜𝑚𝑃 is rotation of 𝑀𝐼𝑁 around axis 𝐴— 𝐵𝑃 (scan axis) of angle 𝑡𝑒𝑡𝑎, with normal vector
orthogonal to the axis, but pointing in the opposite angle direction to plane 1.
To find the upper border of the dial, the positive direction rotation of point 𝑀𝐼𝑁 around axis 𝑾 is
calculated. Let 𝑎𝑡𝑜𝑚𝑃(𝑎𝑡𝑜𝑚𝑃𝑥, 𝑎𝑡𝑜𝑚𝑃𝑦, 𝑎𝑡𝑜𝑚𝑃𝑧) be the rotation image of 𝑀𝐼𝑁 around 𝑾, going
through point 𝐴 with an angle of 𝑡𝑒𝑡𝑎 (angle calculated above corresponding to a lateral window of 2
Å).
The rotation image is calculated as follows:
𝑊𝑛𝑜𝑟𝑚 = (𝑊𝑥2 + 𝑊𝑦2 + 𝑊𝑧2)0.5
𝑊𝑥 = 𝑊𝑥/𝑊𝑛𝑜𝑟𝑚, 𝑊𝑦 = 𝑊𝑦/𝑊𝑛𝑜𝑟𝑚, 𝑊𝑧 = 𝑊𝑧/𝑊𝑛𝑜𝑟𝑚
𝑠 = sin(𝑡𝑒𝑡𝑎) , 𝑐 = cos(𝑡𝑒𝑡𝑎) , 𝑡 = 1 − 𝑐,
𝑚𝑎𝑡1𝑥 = (𝐴𝑥 ∗ (𝑊𝑦2 + 𝑊𝑧2) − 𝑊𝑥 ∗ (𝐴𝑦 ∗ 𝑊𝑦 + 𝐴𝑧 ∗ 𝑊𝑧 − 𝑊𝑥 ∗ 𝑀𝐼𝑁𝑥 − 𝑊𝑦 ∗ 𝑀𝐼𝑁𝑦 − 𝑊𝑧 ∗
𝑀𝐼𝑁𝑧)) ∗ 𝑡
𝑚𝑎𝑡2𝑥 = 𝑁𝑖𝑛𝑖_𝑥 ∗ 𝑐 + (−𝐴𝑧 ∗ 𝑊𝑦 + 𝐴𝑦 ∗ 𝑊𝑧 − 𝑊𝑧 ∗ 𝑁𝑖𝑛𝑖_𝑦 + 𝑊𝑦 ∗ 𝑀𝐼𝑁𝑧) ∗ 𝑠
𝑎𝑡𝑜𝑚𝑃𝑥 = 𝑚𝑎𝑡1𝑥 + 𝑚𝑎𝑡2𝑥
𝑚𝑎𝑡1𝑦 = (𝐴𝑦 ∗ (𝑊𝑥2 + 𝑊𝑧2) − 𝑊𝑦 ∗ (𝐴𝑥 ∗ 𝑊𝑥 + 𝐴𝑧 ∗ 𝑊𝑧 − 𝑊𝑥 ∗ 𝑀𝐼𝑁𝑥 − 𝑊𝑦 ∗ 𝑀𝐼𝑁𝑦 − 𝑊𝑧 ∗
𝑀𝐼𝑁𝑧)) ∗ 𝑡
𝑚𝑎𝑡2𝑦 = 𝑀𝐼𝑁𝑦 ∗ 𝑐 + (𝐴𝑧 ∗ 𝑊𝑥 − 𝐴𝑥 ∗ 𝑊𝑧 + 𝑊𝑧 ∗ 𝑀𝐼𝑁𝑥 − 𝑊𝑥 ∗ 𝑀𝐼𝑁𝑧) ∗ 𝑠
𝑎𝑡𝑜𝑚𝑃𝑦 = 𝑚𝑎𝑡1𝑦 + 𝑚𝑎𝑡2𝑦
𝑚𝑎𝑡1𝑧 = (𝐴𝑧 ∗ (𝑊𝑥2 + 𝑊𝑦2) − 𝑊𝑧 ∗ (𝐴𝑥 ∗ 𝑊𝑥 + 𝐴𝑦 ∗ 𝑊𝑦 − 𝑊𝑥 ∗ 𝑀𝐼𝑁𝑥 − 𝑊𝑦 ∗ 𝑀𝐼𝑁𝑦 − 𝑊𝑧 ∗
𝑀𝐼𝑁𝑧)) ∗ 𝑡
𝑚𝑎𝑡2𝑧 = 𝑀𝐼𝑁𝑧 ∗ 𝑐 + (−𝐴𝑦 ∗ 𝑊𝑥 + 𝐴𝑥 ∗ 𝑊𝑦 − 𝑊𝑦 ∗ 𝑀𝐼𝑁𝑥 + 𝑊𝑥 ∗ 𝑀𝐼𝑁𝑦) ∗ 𝑠
𝑎𝑡𝑜𝑚𝑃𝑧 = 𝑚𝑎𝑡1𝑧 + 𝑚𝑎𝑡2𝑧
Let 𝑿(𝑋𝑥, 𝑋𝑦, 𝑋𝑧) be the vector between point 𝐴 and point 𝑎𝑡𝑜𝑚𝑃,
𝑋𝑥 = 𝑎𝑡𝑜𝑚𝑃𝑥 − 𝐴𝑥, 𝑋𝑦 = 𝑎𝑡𝑜𝑚𝑃𝑦 − 𝐴𝑦, 𝑋𝑧 = 𝑎𝑡𝑜𝑚𝑃𝑧 − 𝐴𝑧
Plane 2 is defined by normal vector 𝒏𝟐(𝑛2_𝑥, 𝑛2_𝑦, 𝑛2_𝑧) going through point 𝐴, where,
131
𝒏𝟐 = −𝑐𝑟𝑜𝑠𝑠(𝑾,𝑿)
𝑛2_𝑥 = 𝑊𝑧 ∗ 𝑋𝑦 − 𝑊𝑦 ∗ 𝑋𝑧
𝑛2_𝑦 = 𝑊𝑥 ∗ 𝑋𝑧 − 𝑊𝑧 ∗ 𝑋𝑥
𝑛2_𝑧 = 𝑊𝑦 ∗ 𝑋𝑥 − 𝑊𝑥 ∗ 𝑋𝑦
Step 3
Then the atoms that belong to the dial are extracted. This is done by checking if they lie in-between
plane 1 and plane 2.
For each 𝑐𝑜𝑢𝑛𝑡 rank of the projection map array, each atom is represented by:
𝑎𝑡𝑜𝑚_𝑥 = 𝑝𝑟𝑜𝑗_𝑥[𝑐𝑜𝑢𝑛𝑡], 𝑎𝑡𝑜𝑚_𝑦 = 𝑝𝑟𝑜𝑗_𝑦[𝑐𝑜𝑢𝑛𝑡], 𝑎𝑡𝑜𝑚_𝑧 = 𝑝𝑟𝑜𝑗_𝑧[𝑐𝑜𝑢𝑛𝑡]
𝑎𝑡𝑜𝑚_𝑣𝑑𝑤 = 𝑝𝑟𝑜𝑗_𝑣𝑑𝑤[𝑐𝑜𝑢𝑛𝑡]
An atom of the projection map belongs to the dial if it lies in-between plane 1 and plane 2, hence if
𝑑𝑜𝑡_1 = 𝑑𝑜𝑡 (𝒏𝟏, (𝐴, 𝑎𝑡𝑜𝑚)⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ) > 0
And 𝑑𝑜𝑡_2 = 𝑑𝑜𝑡 (𝒏𝟐, (𝐴, 𝑎𝑡𝑜𝑚)⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ) > 0
Calculation:
𝑑𝑜𝑡_1 = 𝑛1_𝑥 ∗ (𝑎𝑡𝑜𝑚_𝑥 − 𝐴𝑥) + 𝑛1_𝑦 ∗ (𝑎𝑡𝑜𝑚_𝑦 − 𝐴𝑦) + 𝑛1_𝑧 ∗ (𝑎𝑡𝑜𝑚_𝑧 − 𝐴𝑧)
𝑑𝑜𝑡_2 = 𝑛2_𝑥 ∗ (𝑎𝑡𝑜𝑚_𝑥 − 𝐴𝑥) + 𝑛2_𝑦 ∗ (𝑎𝑡𝑜𝑚_𝑦 − 𝐴𝑦) + 𝑛2_𝑧 ∗ (𝑎𝑡𝑜𝑚_𝑧 − 𝐴𝑧)
Because of limited floating value accuracy in computation calculations, the selection criterion is actually
made to be 𝑑𝑜𝑡_1 > 0.1 and 𝑑𝑜𝑡_2 > 0.1. Otherwise, the algorithm can for example detect 𝑀𝐼𝑁
belonging to the inside of the dial (e.g. 𝑑𝑜𝑡_1 = 0.000000000001647), when it is just outside (𝑑𝑜𝑡_1 =
0). Which has for effect to double select an atom that was already in the dial.
Therefore, each projection atom verifying 𝑑𝑜𝑡_1 > 0.1and 𝑑𝑜𝑡_2 > 0.1, is selected as belonging to the
dial. Its coordinates (𝑎𝑡𝑜𝑚_𝑥, 𝑎𝑡𝑜𝑚_𝑦, 𝑎𝑡𝑜𝑚_𝑧), radius (distance to 𝐴) and vdw (𝑎𝑡𝑜𝑚_𝑣𝑑𝑤) values
are stored into the following arrays: 𝑑𝑖𝑎𝑙_𝑥, 𝑑𝑖𝑎𝑙_𝑦, 𝑑𝑖𝑎𝑙_𝑧, 𝑑𝑖𝑎𝑙_𝑟𝑎𝑑𝑖𝑢𝑠, 𝑑𝑖𝑎𝑙_𝑣𝑑𝑤.
𝑑𝑖𝑎𝑙_𝑟𝑎𝑑𝑖𝑢𝑠 value is calculated as:
𝑟𝑎𝑑𝑖𝑢𝑠 = ((𝑎𝑡𝑜𝑚_𝑥 − 𝐴𝑥)2 + (𝑎𝑡𝑜𝑚_𝑦 − 𝐴𝑦)2 + (𝑎𝑡𝑜𝑚_𝑧 − 𝐴𝑧)2)0.5 − 𝑎𝑡𝑜𝑚_𝑣𝑑𝑤
132
Step 4
The dial (i.e., the projection map atoms belonging to that particular angular area) is processed, in order
to extract the atom that is the closest to 𝐴, and therefore corresponds to the relevant interlining atom.
Let 𝑀𝐼𝑁𝑛𝑒𝑤(𝑀𝐼𝑁𝑛𝑒𝑤_𝑥,𝑀𝐼𝑁𝑛𝑒𝑤_𝑦,𝑀𝐼𝑁𝑛𝑒𝑤_𝑧) be that atom, which corresponds to the next atom
belonging to the contour (the first atom being 𝑀𝐼𝑁).
Let 𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥 be the index of the first iteration of the minimal value stored in 𝑑𝑖𝑎𝑙_𝑟𝑎𝑑𝑖𝑢𝑠 array,
𝑀𝐼𝑁𝑛𝑒𝑤 is:
𝑀𝐼𝑁𝑛𝑒𝑤_𝑥 = 𝑑𝑖𝑎𝑙_𝑥[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]
𝑀𝐼𝑁𝑛𝑒𝑤_𝑦 = 𝑑𝑖𝑎𝑙_𝑦[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]
𝑀𝐼𝑁𝑛𝑒𝑤_𝑧 = 𝑑𝑖𝑎𝑙_𝑧[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]
𝑀𝐼𝑁𝑛𝑒𝑤_𝑣𝑑𝑤 = 𝑑𝑖𝑎𝑙_𝑣𝑑𝑤[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]
These four values are stored in the following ith contour arrays: 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥, 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑦, 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑧,
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤, because the new minimum value (that is to say minimal radius) of the dial belongs to
the pathway contour for the current axis direction being tested. In other words, the atom that has been
extracted through step 1 to 4 corresponds to an inner contour atom, and is consequently stored in the
contour array.
Step 5
So as to keep track of the angular region processed (amount of dial region covered), the angle between
the previously calculated contour atom and the new contour atom is calculated. When the sum of the
dial angle processed will equate 360 degrees, the contour will have been scanned in its entirety.
Let 𝑸(𝑄𝑥, 𝑄𝑦, 𝑄𝑧) be the vector going from 𝐴 to 𝑀𝐼𝑁, and 𝑸𝒏𝒆𝒘(𝑄𝑛𝑒𝑤_𝑥, 𝑄𝑛𝑒𝑤_𝑦, 𝑄𝑛𝑒𝑤_𝑧) be the
vector going from 𝐴 to 𝑀𝐼𝑁_𝑛𝑒𝑤.
The dial region covered is calculated as the angle between 𝑸 and 𝑸𝒏𝒆𝒘 and is represented by 𝑟𝑜𝑡_𝑠ℎ𝑖𝑓𝑡.
𝑄𝑥 = 𝑀𝐼𝑁𝑥 − 𝐴𝑥, 𝑄𝑦 = 𝑀𝐼𝑁𝑦 − 𝐴𝑦, 𝑄𝑧 = 𝑀𝐼𝑁𝑧 − 𝐴𝑧
𝑄𝑛𝑜𝑟𝑚 = (𝑄𝑥2 + 𝑄𝑦2 + 𝑄𝑧2)0.5
𝑄𝑥 = 𝑄𝑥/𝑊𝑛𝑜𝑟𝑚, 𝑄𝑦 = 𝑄𝑦/𝑄𝑛𝑜𝑟𝑚, 𝑄𝑧 = 𝑄𝑧/𝑄𝑛𝑜𝑟𝑚
𝑄𝑛𝑒𝑤_𝑥 = 𝑀𝐼𝑁𝑛𝑒𝑤_𝑥 − 𝐴𝑥, 𝑄𝑦 = 𝑀𝐼𝑁𝑛𝑒𝑤_𝑦 − 𝐴𝑦, 𝑄𝑧 = 𝑀𝐼𝑁𝑛𝑒𝑤_𝑧 − 𝐴𝑧
𝑄𝑛𝑒𝑤_𝑛𝑜𝑟𝑚 = (𝑄𝑛𝑒𝑤_𝑥2 + 𝑄𝑛𝑒𝑤_𝑦2 + 𝑄𝑛𝑒𝑤_𝑧2)0.5
133
𝑄𝑛𝑒𝑤_𝑥 = 𝑄𝑛𝑒𝑤_𝑥/𝑄𝑛𝑒𝑤_𝑛𝑜𝑟𝑚, 𝑄𝑛𝑒𝑤_𝑦 = 𝑄𝑛𝑒𝑤_𝑦/𝑄𝑛𝑒𝑤_𝑛𝑜𝑟𝑚,
𝑄𝑛𝑒𝑤_𝑧 = 𝑄𝑛𝑒𝑤_𝑧/𝑄𝑛𝑒𝑤_𝑛𝑜𝑟𝑚
𝑐𝑜𝑠 = 𝑑𝑜𝑡(𝑸,𝑸𝒏𝒆𝒘) = 𝑄𝑥 ∗ 𝑄𝑛𝑒𝑤_𝑥 + 𝑄𝑦 ∗ 𝑄𝑛𝑒𝑤_𝑦 + 𝑄𝑧 ∗ 𝑄𝑛𝑒𝑤_𝑧
𝑟𝑜𝑡_𝑠ℎ𝑖𝑓𝑡 = 𝑎𝑡𝑎𝑛2((1 − 𝑐𝑜𝑠2), cos)
Step 6
Finally, the dial is incremented one step forward and the calculations above are repeated until the total
dial region covered has reached 360 degrees.
To increment the next starting dial point, 𝑀𝐼𝑁 of next stage is 𝑀𝐼𝑁𝑛𝑒𝑤 of previous dial, hence:
𝑀𝐼𝑁𝑥 = 𝑀𝐼𝑁𝑛𝑒𝑤_𝑥
𝑀𝐼𝑁𝑦 = 𝑀𝐼𝑁𝑛𝑒𝑤_𝑦
𝑀𝐼𝑁𝑧 = 𝑀𝐼𝑁𝑛𝑒𝑤_𝑧
To keep track of the total dial angle processed, 𝑟𝑜𝑡_𝑠ℎ𝑖𝑓𝑡 dial values are summed into 𝑟𝑜𝑡_𝑠ℎ𝑖𝑓𝑡_𝑖𝑛𝑐
at each stage, with:
𝑟𝑜𝑡_𝑠ℎ𝑖𝑓𝑡_𝑖𝑛𝑐 = 𝑟𝑜𝑡_𝑠ℎ𝑖𝑓𝑡_𝑖𝑛𝑐 + 𝑟𝑜𝑡_𝑠ℎ𝑖𝑓𝑡
If no atoms have been detected inside the dial, the procedures above are repeated but with a larger lateral
window.
This is automated as, if the size of 𝑑𝑖𝑎𝑙_𝑥 array is null (which means that 𝑑𝑖𝑎𝑙_𝑦, 𝑑𝑖𝑎𝑙_𝑧, and 𝑑𝑖𝑎𝑙_𝑣𝑑𝑤
are also empty arrays), then the next dial is incremented with:
𝑙𝑎𝑡_𝑤𝑖𝑛𝑑𝑜𝑤 = 𝑙𝑎𝑡_𝑤𝑖𝑛𝑑𝑜𝑤 + 2
d) Analyze contour
Executing procedures a) to c) allows to get an ith contour map (whose information is stored into the
contour arrays) for each ith scanned direction. In order to compare the scanned directions between each
other, the object of procedure d) is to further characterize the contours (one contour for each direction)
by assigning to each contour the smallest distance between its van der Walls geometric center and the
surrounding atoms. This also allows one to hit two birds with one stone, since the van der Walls
geometric center of the winning contour will be put to contribution to get the pathway center, the
pathway minimal radius at the corresponding longitudinal region along the pathway, together with the
cross section area.
134
First, let us calculate the van der Walls geometric center.
For each 𝑐𝑜𝑢𝑛𝑡 atom rank of the contour array, let:
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥[𝑐𝑜𝑢𝑛𝑡]
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑦 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑦[𝑐𝑜𝑢𝑛𝑡]
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑧 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑧[𝑐𝑜𝑢𝑛𝑡]
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤[𝑐𝑜𝑢𝑛𝑡]
To take the atom vdw radius into the geometric center calculation, the following relation is applied, and
is given as an illustration in the x dimension only:
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑥 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥 + 𝑣𝑑𝑤_𝑢𝑛𝑖𝑡_𝑣𝑒𝑐𝑡𝑜𝑟 ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ∗ 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤
Where,
𝑣𝑑𝑤_𝑢𝑛𝑖𝑡_𝑣𝑒𝑐𝑡𝑜𝑟 ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ = (𝐴𝑥 − 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥)⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ _𝑢𝑛𝑖𝑡_𝑣𝑒𝑐𝑡𝑜𝑟
It is important to underline that while 𝐴 represents the initial center of the scan axis, it does not represent
the geometric center.
Calculation:
Let 𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑥, 𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑦, 𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑧 be the parameters of 𝑣𝑑𝑤_𝑢𝑛𝑖𝑡_𝑣𝑒𝑐𝑡𝑜𝑟 ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗
𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑥 = 𝐴𝑥 − 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥, 𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑦 = 𝐴𝑦 − 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑦, 𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑧 = 𝐴𝑧 −
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑧
𝑣𝑑𝑤_𝑛𝑜𝑟𝑚 = (𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑥2 + 𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑦2 + 𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑧2)0.5
If 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟 ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ = 𝑣𝑑𝑤_𝑢𝑛𝑖𝑡_𝑣𝑒𝑐𝑡𝑜𝑟 ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ∗ 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤:
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑥 = (𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑥/𝑣𝑑𝑤_𝑛𝑜𝑟𝑚) ∗ 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑦 = (𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑦/𝑣𝑑𝑤_𝑛𝑜𝑟𝑚) ∗ 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑧 = (𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑧/𝑣𝑑𝑤_𝑛𝑜𝑟𝑚) ∗ 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤
Then summing the different contour atom contributions into 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟 components renders:
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑥 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑥 + 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥 + 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑥
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑦 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑦 + 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑦 + 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑦
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑧 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑧 + 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑧 + 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑧
135
Finally, the ith contour interlining atoms vdw geometric center is given by:
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑥 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑥/𝑛𝑏_𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑎𝑡𝑜𝑚𝑠
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑦 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑦/𝑛𝑏_𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑎𝑡𝑜𝑚𝑠
𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑧 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑧/𝑛𝑏_𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑎𝑡𝑜𝑚𝑠
The latter values are then stored in three arrays: 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑥, 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑦, 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑧, that will be
used in procedure e). The vdw geometric center is not a center of mass (i.e. “COM”), but for convenient
reasons, 𝐶𝑂𝑀 terminology is used.
Before proceeding to step e), the minimal surrounding atom distance to 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟 is calculated.
To do so, the atoms that lie within 20 Å of 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟 are extracted.
The following selection criterion is calculated, where 𝑎𝑡𝑜𝑚(𝑎𝑡𝑜𝑚_𝑥, 𝑎𝑡𝑜𝑚_𝑦, 𝑎𝑡𝑜𝑚_𝑧) is an atom
belonging to the structure, 𝑎𝑡𝑜𝑚_𝑣𝑑𝑤 is its van der Walls radius, 𝑑𝑖𝑠𝑡(𝑑𝑖𝑠𝑡_𝑥, 𝑑𝑖𝑠𝑡_𝑦, 𝑑𝑖𝑠𝑡_𝑧) is the
distance to 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟 and 𝑟𝑎𝑑𝑖𝑢𝑠_𝑐ℎ𝑒𝑐𝑘 is the vdw weighted distance to 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟.
𝑑𝑖𝑠𝑡_𝑥 = 𝑎𝑡𝑜𝑚_𝑥 − 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑥
𝑑𝑖𝑠𝑡_𝑦 = 𝑎𝑡𝑜𝑚_𝑦 − 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑦
𝑑𝑖𝑠𝑡_𝑧 = 𝑎𝑡𝑜𝑚_𝑧 − 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑧
𝑟𝑎𝑑𝑖𝑢𝑠_𝑐ℎ𝑒𝑐𝑘 = (𝑑𝑖𝑠𝑡_𝑥2 + 𝑑𝑖𝑠𝑡_𝑦2 + 𝑑𝑖𝑠𝑡_𝑧2)0.5 − 𝑎𝑡𝑜𝑚_𝑣𝑑𝑤
To accelerate the selection, only the atoms lying within 20 Å of 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟 (i.e., verifying
𝑟𝑎𝑑𝑖𝑢𝑠_𝑐ℎ𝑒𝑐𝑘 ≤ 20) are stored into an array: 𝑝𝑟𝑜𝑗_𝑟𝑎𝑑𝑖𝑢𝑠_𝑐ℎ𝑒𝑐𝑘.
Let 𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥 be the index of the first iteration of the minimal value stored in
𝑝𝑟𝑜𝑗_𝑟𝑎𝑑𝑖𝑢𝑠_𝑐ℎ𝑒𝑐𝑘 array, and let 𝑚𝑖𝑛 be the minimal contour distance to center:
𝑚𝑖𝑛 = 𝑝𝑟𝑜𝑗_𝑟𝑎𝑑𝑖𝑢𝑠_𝑐ℎ𝑒𝑐𝑘[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]
Finally, the ith contour minimal radius is stored into 𝑝𝑟𝑜𝑗_𝑟𝑎𝑑𝑖𝑢𝑠_𝑚𝑖𝑛 array.
136
e) Choose the best pathway axis and calculate pathway parameters
Procedures a) to d) are repeated for each tested pathway direction (scanned axis). For each ith scanned
axis, we now have a corresponding minimum radius (stored at ith rank in 𝑝𝑟𝑜𝑗_𝑟𝑎𝑑𝑖𝑢𝑠_𝑚𝑖𝑛) and a van
der Walls reweighted geometric center (stored at ith rank in 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑥, 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑦, 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑧).
Now, the scanned direction which offers the ith scanned contour with the biggest (compared to those of
the other scanned directions) minimal radius is the winning axis, and is to be selected.
Let 𝑚𝑎𝑥_𝑖𝑛𝑑𝑒𝑥 be the index of the first iteration of the maximal value stored in
𝑝𝑟𝑜𝑗_𝑟𝑎𝑑𝑖𝑢𝑠_𝑚𝑖𝑛 array, and let 𝐶𝑂𝑀(𝐶𝑂𝑀𝑥, 𝐶𝑂𝑀𝑦, 𝐶𝑂𝑀𝑧) be the geometric center of the winning
contour projected onto plane DIR (i.e. from the starting point of the scan):
𝐶𝑂𝑀𝑥 = 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑥[max _𝑖𝑛𝑑𝑒𝑥]
𝐶𝑂𝑀𝑦 = 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑦[max _𝑖𝑛𝑑𝑒𝑥]
𝐶𝑂𝑀𝑧 = 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑧[max _𝑖𝑛𝑑𝑒𝑥]
It follows that the winning axis is given by:
𝑁𝑥 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑥[max _𝑖𝑛𝑑𝑒𝑥]
𝑁𝑦 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑦[max _𝑖𝑛𝑑𝑒𝑥]
𝑁𝑧 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑧[max _𝑖𝑛𝑑𝑒𝑥]
137
2.3.2. Virtual sphere scan
To advance along the pathway with a constant step size, the second part of the algorithm performs a
virtual sphere scan. Starting from the previous point belonging to the pathway axis (refined starting
point for the beginning of the pathway axis generation), a spherical region is scanned along the fixed
axis. To do so, a principle very similar to 2.3.1 is employed. For each scan axis, a sphere center is defined
as the projection of point 𝐴 along the axis, with a projection step equal to the previous minimum pathway
radius +3 Å. By doing so, the sphere scan is close enough to stay within the pathway resolution, but far
enough to capture information about the longitudinal spread of the pathway. For each sphere projected
along a sphere_step distance in the tested scanned direction, the minimum distance from surrounding
atoms is stored into an array. The sphere containing the maximal (relative to the other scans) minimal
radius is selected. The latter sphere allows to refine a main pathway direction. Then a second sphere
scan is performed, starting from the winning direction of the first sphere scan, in order to refine details
about the inner irregularities of the channel and to generate a final pathway point belonging to the
computed pathway axis. The calculations for the virtual sphere scans are similar than the one described
previously and will not be specified here. The double virtual sphere scan method (Figure 40) allows to
generate pathway axes with great precision (Figure 41).
Figure 40: Virtual sphere scan method. The first scan allows to remain within the main longitudinal spread
direction of the pathway. The tested direction and corresponding virtual spheres are shown in purple
arrows and circles respectively. The winning direction corresponding to the spherical area containing the
largest minimal radius to surrounding atoms compared to the other spherical regions tested, to be selected,
is represented in blue. The second scan resolves the inner irregularities and details of the pathway. The
tested directions are represented as cyan arrows. The final winning spherical region to be selected is
represented in white. The main and sub pathways are represented as a large and small grey tube
respectively.
138
Figure 41: Virtual sphere scan pathway axis detection. The double virtual sphere scan method allows to
generate precisely the axis of a very irregular pathway. The protein channel cross section is represented in
grey surface. The inner contour is very complex, consists of almost 90 degrees turns and displays
periodically very small void areas (e.g., pathway exit on the right). The computed axis is represented as a
series of red spheres.
139
2.3.3. Walk forward along pathway axis
All the calculations above have been done for the first step along the pathway. The good pathway
direction has been found. The final step is to increment the scan forward, so as to advance along the
pathway axis. Before moving forward along the pathway, the scan is repeated altogether one time, but
starting from a re-adjusted position. In other words, procedures above are repeated with 𝐴 (initial
pathway start guess) replaced by 𝐶𝑂𝑀 (re-adjusted pathway start center) and with 𝐵 (direction to which
the initial pathway axis guess is pointing) by the projection of 𝐶𝑂𝑀 along 𝒏:
𝑁𝑥 = 𝑁𝑥 − 𝐴𝑥, 𝑁𝑦 = 𝑁𝑦 − 𝐴𝑦, 𝑁𝑧 = 𝑁𝑧 − 𝐴𝑧
𝑁𝑛𝑜𝑟𝑚 = (𝑁𝑥2 + 𝑁𝑦2 + 𝑁𝑧2)0.5
𝑁𝑥 = 𝑁𝑥/𝑁𝑛𝑜𝑟𝑚, 𝑁𝑦 = 𝑁𝑦/𝑁𝑛𝑜𝑟𝑚, 𝑁𝑧 = 𝑁𝑧/𝑁𝑛𝑜𝑟𝑚
New 𝐴 and 𝐵 points are given by:
𝐴𝑥 = 𝐶𝑂𝑀𝑥, 𝐴𝑦 = 𝐶𝑂𝑀𝑦, 𝐴𝑧 = 𝐶𝑂𝑀𝑧
𝐵𝑥 = 𝐶𝑂𝑀𝑥 + 3 ∗ 𝑁𝑥, 𝐵𝑦 = 𝐶𝑂𝑀𝑦 + 3 ∗ 𝑁𝑦, 𝐵𝑧 = 𝐶𝑂𝑀𝑧 + 3 ∗ 𝑁𝑧
Then for all the subsequent walks along the pathway, the new starting position is set to 𝐴 as the 2 Å
projection of 𝐶𝑂𝑀 along 𝑁 (in order to walk along the winning axis) and to 𝐵 as the 4 Å (arbitrary
value) projection of 𝐶𝑂𝑀 along 𝑁 (𝐵 is only used to characterize the direction, and hence could be
projected at any distance along 𝑁):
New 𝐴 and 𝐵 points (for forward shifted scan) are given by:
𝐴𝑥 = 𝐶𝑂𝑀𝑥 + 2 ∗ 𝑁𝑥, 𝐴𝑦 = 𝐶𝑂𝑀𝑦 + 2 ∗ 𝑁𝑦, 𝐴𝑧 = 𝐶𝑂𝑀𝑧 + 2 ∗ 𝑁𝑧
𝐵𝑥 = 𝐶𝑂𝑀𝑥 + 4 ∗ 𝑁𝑥, 𝐵𝑦 = 𝐶𝑂𝑀𝑦 + 4 ∗ 𝑁𝑦, 𝐵𝑧 = 𝐶𝑂𝑀𝑧 + 4 ∗ 𝑁𝑧
140
2.3.4. Convert COM map to distance bins
In order to split the pathway axis into fixed distance to binding steps, independently from the pathway
axis length (which varies in time as the pathway conformation changes in time), and hence
independently from the simulation frame, the following procedure is employed. The calculation details
will not be specified.
First, each 𝐶𝑂𝑀𝑖 point (defines the pathway axis, derived previously) is projected onto the fixed axis
(see previously), which serves as an invariable reference for the different simulation frames. Let the
projected 𝐶𝑂𝑀𝑖 points be 𝐶𝑂𝑀𝑃𝑖. And let the fixed axis run from points 𝐿1 to 𝐿2. Second, the fixed
axis is divided into 1 Å steps beginning from 𝐿1 and ending at 𝐿2 (corresponding to the target successful
binding position). Third, to each fixed axis step is assigned a lower and an upper bound 𝐶𝑂𝑀 point
(segment of the pathway axis defined by two consecutive 𝐶𝑂𝑀 points), by comparing its position to the
𝐶𝑂𝑀𝑃𝑖 points (initial 𝐶𝑂𝑀𝑖 points that have been projected onto the fixed axis). Fourth, the distance
steps are “reprojected” onto their corresponding 𝐶𝑂𝑀 axis (lower and upper bound 𝐶𝑂𝑀). This is done
by calculating the intersection between the plane defined by the ith step and the normal vector (𝐿1, 𝐿2),
and the assigned 𝐶𝑂𝑀 axis (consecutive 𝐶𝑂𝑀 points selected). Let the final reprojected points
(corresponding to a fixed distance step along (𝐿1, 𝐿2), and belonging to the pathway axis) be
𝐶𝑂𝑀_𝑆𝑇𝐸𝑃𝑖.
141
2.3.5. Calculate cross section area
A good approximation of the cross section area of a 2D shape is:
𝐴𝑟𝑒𝑎 = 𝜋. 𝑟_𝑚𝑒𝑎𝑛2
Where 𝑟_𝑚𝑒𝑎𝑛 is the mean of the contour atoms radius to the geometrical center.
Simulating this formula on a square or a 6-branched star shape returns an area with a precision of +/-17
and +/-19 % respectively.
A more accurate method, explained below, is to sum the local areas formed by the atoms, successively
around the contour, i.e. to sum the areas per dial.
Figure 42: Cross section area calculation. The cross section area is computed by summing the area
contribution of the successive dials. The first three dials are represented in green, purple and orange
respectively.
Hence the formula used to calculate the cross section area is:
𝐴𝑟𝑒𝑎 = ∑(r1_i
𝑛
𝑖=1
∗ r2_i ∗ sin (𝑡𝑒𝑡𝑎))/2
Where r1_i and r2_i are the radius for the ith and ith +1 atoms belonging to the contour, and 𝑡𝑒𝑡𝑎 is the
angle between r1_i and r2_i.
To perform this calculation, the pathway is processed again, and loops through each 𝐶𝑂𝑀_𝑆𝑇𝐸𝑃𝑖 points.
The algorithm performs similar computations to the previous dial calculations. In order to calculate an
estimated diffusion area in pathways containing holes inside them, when no atom is detected inside a
dial, the previous atom of the contour is rotated to define a virtual contour atom.
142
3. Electrostatic analysis
Three types of forces govern the diffusion: the Brownian random molecular water motion, the non-
bonded interactions and hydrogen bonds.
Given that the hydrophobic non-bonded interactions are indirectly taken into account and given that
hydrogen bonds represent a special case of electrostatic interaction, the long-range non-bonded
interactions are described by the Electrostatics and the van der Walls potential, between atoms i and j:
Unon-bonded= Uelectrostatics + UvdW
= 𝑞𝑖 𝑞𝑗
4𝜋𝜖0𝑟𝑖𝑗+ 𝜀 [(
𝑅𝑚𝑖𝑛,𝑖𝑗
𝑟𝑖𝑗)12
− 2 (𝑅𝑚𝑖𝑛,𝑖𝑗
𝑟𝑖𝑗)6
]
vdw forces are not straight forward to characterize and are dominated by Coulombic interactions on
long distances. Consequently, emphasis is made on Coulombic electrostatics to characterize the force
guiding or impeding substrate access along the pathways.
More precisely, to characterize how favorable a pathway is for substrate diffusion, the central diffusion
pathway is put to contribution with the methodology of previous section. The series of pathway
𝐶𝑂𝑀_𝑆𝑇𝐸𝑃𝑖 points are used to represent the position of a substrate successively along the pathway.
Long range electrostatics at play inside a channel over a rNTP substrate is then characterized by
calculating the Coulombic interaction between a point of charge -2 representing the substrate at the ith
position along the pathway axis. If a NTP at 𝐶𝑂𝑀_𝑆𝑇𝐸𝑃𝑖 position is represented by the point
𝐶𝑂𝑀𝑖_𝑁𝑇𝑃, and if j and i are the protein atom and NTP indexes respectively, the force on 𝐶𝑂𝑀𝑖_𝑁𝑇𝑃
charge due to the protein charge is given by:
𝑭(𝐶𝑂𝑀𝑖_𝑁𝑇𝑃) = (qNTP
4π𝜀𝑟𝜀0) ∗ ∑
qj
|𝒓𝒋𝒊|2 �̂�𝒋𝒊
𝑛
𝑗=1
Using a protein dielectric constant of 74, a NTP charge of - 2, and converting in SI units (elementary
charge in Coulombs and Angstroms in meters), the equation can be rewritten as:
𝑭(𝐶𝑂𝑀𝑖_𝑁𝑇𝑃) =−2
4 ∗ 𝜋 ∗ 74 ∗ 8.854187817 ∗ 10−12∗
(1.6021762208 ∗ 10−19)2
10−20∗ ∑
qj
|𝒓𝒋𝒊|2 �̂�𝒋𝒊
𝑛
𝑗=1
𝑭(𝐶𝑂𝑀𝑖_𝑁𝑇𝑃) = −6.2353446 ∗ 10−10 ∗ ∑qj
|𝒓𝒋𝒊|2 �̂�𝒋𝒊
𝑛
𝑗=1
143
Chapter 5
Results and Discussion
144
1. Introduction
Diffusion is a critical step to provide substrates to molecular machines. One can think of substrate
loading as being mainly stochastic and random in nature. However, a cell is orchestrated in a very precise
manner, and in living organisms, nature has provided advanced and sometimes complex solutions to
control substrate input such as precisely shaped pathways, or elaborate electrostatic filtration. As such,
diffusion, and the biomolecular properties underlining its behavior, can be seen as being part of more
general cellular programming. In RNA synthesis, substrate delivery can be seen as the most elementary
step of elongation.
In this section, we will present new results about substrate diffusion and loading into RNAP. We will
attempt to characterize the diffusion process and check if simulation results are in accordance with the
main channel theory presented in chapter 1. The following questions will be discussed. What are the
diffusion pathways leading to the DS bubble or the catalytic center? Are there conformationally or
electrostatically suitable routes and do they compare favorably to CH2? How does NTP loading fit in a
rationalized more general enzymatic translocation cycle model?
145
2. Simulation summary
Trajectories derived from five aMD and six sMD simulations are listed in this subsection. aMD
simulations are summarized in Table 6 below.
aMD
simulation
name
time
A1 A2 A3 A4 nb of
protein
residues
nb of
water
mol.
total
nb of
atoms
aMD1 20 ns 3.5 0.20 0.50 0.50
3795
159600
707874 50 ns 3.5 0.20 0.20 0.20
aMD2 50 ns 3.5 0.20 0.50 0.50
aMD3 50 ns 4.5 0.20 0.20 0.20
aMD4 50 ns 3.5 0.20 0.20 0.20
aMD5 80ns 3.5 0.20 0.20 0.20
Table 6: aMD simulation summary. Acceleration parameters are calculated from A1, A2, A3 and A4 as:
𝑬_𝒅𝒊𝒉𝒆𝒅 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 = 𝑽_𝒅𝒊𝒉𝒆𝒅 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 + 𝑨𝟏 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 ∗ 𝒏𝒃_𝒑𝒓𝒐𝒕_𝒓𝒆𝒔,
𝜶_𝒅𝒊𝒉𝒆𝒅 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 = 𝑨𝟐 ∗ (𝑨𝟏 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 ∗ 𝒏𝒃_𝒑𝒓𝒐𝒕_𝒓𝒆𝒔),
𝑬_𝒕𝒐𝒕𝒂𝒍 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 = 𝑽_𝒕𝒐𝒕𝒂𝒍 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 + 𝑨𝟑 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 ∗ 𝒏𝒃_𝒂𝒕𝒎𝒔,
𝜶_𝒕𝒐𝒕𝒂𝒍 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 = 𝑨𝟒 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 ∗ 𝒏𝒃_𝒂𝒕𝒎𝒔
In addition, Force-Distance relationships are generated from the six sMD trajectories outlined in Table
7, specifying the structural checkpoints used along the sMD pull across a pathway, and the magnitude
of the pulling force.
146
sMD CH2 0.3, sMD CH2 0.4
LD1A, Rpb1 728:NZ
LD1B, Rpb1, 1300:NZ
LD2A, Rpb1 1360:CA
LD2B, Rpb1 620:CA
LD3, Rpb1 476:CA
LD4, Rpb2 837:CB
LD5, tDNA i + 1:N3
CK1, 2.5 Å from m(LD1A, LD1B),
CK2, 7 Å from m(LD2A, LD2B)
CK3, 6 Å from LD3
CK4, 5 Å from LD4
CK5, 3 Å from LD5
sMD CH3C 0.075
LD1A, Rpb1 1222:CG
LD1B, Rpb5 118:CG
LD2, Rpb1 1278:CG
LD3, tDNA i + 2:N3
CK1, 4 Å from m(LD1A, LD1B)
CK2, 7 Å from LD2
CK3, 3 Å from LD3
sMD CH3A 0.3
LD1A, Rpb1 728:NZ
LD1B, Rpb1 1300:NZ
LD2A, Rpb1 716:CG
LD2B, Rpb1 1092:CE
LD3A, Rpb1 1113:CB
LD3B, Rpb1 773:CD
LD4A, Rpb1 1112:CE
LD4B, Rpb2 509:CA
LD5, tDNA i + 2:N3
CK1, 2.5 Å from m(LD1A, LD1B)
CK2, 2.5 Å from m(LD2A, LD2B)
CK3, 2.5 Å from m(LD3A, LD3B)
CK4, 2.5 Å from m( LD4A, LD4B)
CK5, 3 Å from LD5
sMD-aMD CH3C 0.04
LD1A, Rpb5 91:CD
LD1B, tDNA i-21:C5'
LD2, Rpb1 1247:CB
LD3, Rpb1 771:CG
LD4, tDNA i + 2:N3
CK1, 2.5 Å from m(LD1A, LD1B)
CK2, 7 Å from LD2
CK3, 7 Å from LD3
CK4, 3 Å from LD4
sMD CH3B 0.15, sMD CH3B 0.3
LD1A, Rpb1 702:CG
LD1B, Rpb1 1274:CZ
LD1C, Rpb9 92:NH2 ,
LD2A, Rpb1 702:CG
LD2B, Rpb1 1274:CZ
LD3, Rpb9 50:CB
LD4, tDNA i + 2:N3
LD5, Rpb1 771:CG
LD6, tDNA i + 2:N3
CK1, 2.5 Å from 25 Å projection of
m(LD1A, LD1B) along LD1C, m(LD1A,
LD1B).
CK2 2.5 Å from m(LD2A, LD2B)
CK3 12 Å from LD3
CK4, no checkpoint distance, pulled for 50
ps towards LD4
CK5, 6.5 Å from LD5
CK6, 3 Å from LD6
sMD CH4 0.075
LD1, NA
LD2, tDNA i + 2:N1
CK1, manually positioned at entrance of CH4
CK2, 3 Å from LD2
Table 7: sMD simulation summary. sMD simulations are listed, where the corresponding pathway is
indicated after the “sMD” instance in the title, followed by the pulling force in kcal.mol-1.A-2. When the
landmark is calculated as the middle between two points, the notation m(A,B) is used. LD stands for
147
landmark x, y, z coordinates. SMD trajectories are divided into sub-paths, where switching is done at CK
point (stands for checkpoint) from a certain threshold distance (Å distance given in the table above before
the LD point). The simulated system and pulled molecule are 2E2H and GTP respectively for all the sMD
runs, except for sMD CH4 0.075 where the system is PDB#5C4J and the pulled molecule is CTP.
Finally, an algorithm has been developed (see previous section for explanations) and is executed to
extract the pathway axis, cross section area, minimal radius and electrostatic force experienced by a
virtual NTP point charge of -2, along the diffusional pathway. In order to characterize the electrostatic
force in an informative fashion, i.e. propensity to travel through a pathway: magnitude and orientation,
are combined in one single value by projecting the Coulombic interaction vector between the virtual
NTP point charge and the protein atoms onto the diffusional axis generated by the scanning algorithm.
The latter is not rectilinear (a single axis) but is represented by successive 1 Å long pathway axes, where
each 1 Å axis runs from one pathway center to the next, which are referred to as pathway COMs for
simplicity.
148
3. Results
3.1. Diffusional zones
It has been proposed by many authors that no access was granted to deliver substrates to the main
channel. However, it appears impossible to rationalize how downstream templated NTPs could promote
the translocation sliding degrees of freedom and consequently help expel misloaded NTP or accelerate
the active site delivery and/or isomerization of a correct NTP, without binding to DS registers.
Consistent with the results presented in the main channel theory section, which seem to indicate that
substrates can access the main channel, several pathways have been identified in the RNAP structure,
and appear to offer substrate delivery capabilities to the main channel. In addition to the widely
discussed CH2 pathway in literature, five channels leading to the DS bubble have been identified.
Altogether, the possible diffusion routes are the following.
CH2 comprises the funnel and a narrow corridor. Sequence of CH2 is, scRPB1: 350, 352, 446-448, 450,
451, 453, 454, 472-477, 479-486, 513, 515-525, 528, 532, 533, 535-538, 588-605, 616-628, 631, 632,
635, 693, 696, 697, 702-739, 743-758, 760, 764-769, 772-774, 819-824, 826-828, 831, 832, 878-888,
946-962, 1025, 1071, 1074, 1075, 1078-1097, 1100, 1113, 1115-1117, 1119, 1281-1291, 1298-1309,
1326, 1328-1330, 1342, 1345, 1346, 1349-1351, 1353-1366, 1368; scRPB2: 529-531, 533, 763, 765,
766, 769, 772, 773, 776, 835, 836, 837, 977, 979, 985-987, 1013, 1016, 1018-1021, 1095-1097, 1102;
scRPB5: 147-149, 151, 200-204.
The sequence of the corridor is, scRPB1: 350, 352, 446 - 448, 450, 451, 453, 454, 472 - 477, 479 - 486,
515, 520 - 525, 528, 623, 624, 750 - 753, 819-824, 826, 827, 1074, 1075, 1078 – 1086; scRPB2: 529 -
531, 533, 763, 765, 766, 769, 772, 773, 776, 836, 837, 977, 979, 985-987, 1018 - 1021, 1095-1097,
1102.
149
Figure 43: CH2 and corridor pathways. Residues lining the corridor section of CH2 are shown in white, the
remaining part of CH2 is shown in blue. The protein and the RNA’3 end are colored in grey and lime
respectively.
A complex channel is branched in four parts and will be referred to as CH3. CH3A/B channel runs from
two openings near the funnel of CH2, directly to the downstream bubble near registers i + 2 to i + 4.
CH3A is formed by a hole in the funnel of CH2. CH3B is adjacent to CH3A, and is formed by a hole
lying near the exterior of the enzyme rather than the funnel. CH3A seems to correspond to the “pore 2”
pathway described briefly by Cramer et al. in [Cramer, et al., 2000], but has apparently not been referred
to since.
CH3A/B is composed of the following residues, scRPB1: 700-712, 715, 716, 768-784, 787-791, 796,
797, 814, 815, 817, 819, 826, 827, 829, 835, 837, 840, 1076, 1080 1089, 1089-1116, 1132, 1134-1136,
1138-1141, 1144-1146, 1148, 1198, 1200-1207, 1269, 1274, 1277-1284, 1307-1312, 1329-1334, 1351,
1354, 1355, 1357, 1358, 1381, 1383-1387; scRPB2: 218, 224-241, 254-264, 267, 308, 309, 312, 313,
381, 386-400, 404, 501-517, 535, 699; scRPB9: 44, 46, 48-53, 87, 89-94, 96, 113-120. CH3A opening
is, scRPB1: 705-708, 712, 713, 716, 717, 719, 720, 769, 771-774, 1089-1097, 1100, 1113, 1115, 1117,
1281, 1283, 1285, 1287, 1307, 1309, 1328, 1330, 1350, 1351, 1354, 1357, 1358. CH3B opening is,
scRPB1: 700-706, 708-710, 1132, 1134-1136, 1138-1141, 1144-1146, 1148, 1198, 1200-1207, 1269,
1274, 1277-1279, 1281-1284; scRPB2: 263, 264, 267, 308, 309, 312, 313; scRPB9: 44, 46, 48-53, 90,
92-94, 96, 113-120.
150
Figure 44: CH3 channel view from CH2. Residues lining opening A of CH3 are shown in green, opening B
leading to CH3 and CH3 are indicated in pink. CH2 is indicated in blue and the protein and nucleic acid
atoms are represented as grey lines.
Figure 45: Side view of CH3. CH3, CH2, downstream tDNA and ntDNA are shown in blue, pink, light blue
and cyan respectively. Protein atoms are represented as grey lines.
151
CH3C joins CH3A/B on the other side of the protein wall further away from the funnel, and is shaped
as a tube open on one third of its length on one side and DS DNA. Hence CH3C is a sub-channel of
CH1 and envelops partly DS DNA. In addition to CH3C, two additional channels lie in the CH1 area:
CH3D runs below and perpendicularly to DS DNA and joins CH3C and CH4 is a passage that goes
under a loop formed by ntDNA to enter the pre-binding i + 2 to i + 4 zone, from the opposite direction
than CH3A/B.
The sequences are the following. CH3C: scRPB1: 829, 832, 833, 836, 837, 840, 1095, 1096, 1099, 1100,
1102, 1103, 1105-1114, 1140-1142, 1144, 1145, 1215, 1216, 1218-1224, 1242-1263, 1265-1267, 1269-
1272, 1275-1280, 1309-1315, 1317, 1318, 1329, 1331, 1333, 1334, 1336-1338, 1381-1383, 1385-1387;
scRPB2: 224, 226-234, 237, 239, 255, 257, 259-268, 270, 277, 278, 279, 396-399, 504-511; scRPB5: 5,
7, 8, 11, 112-119, 121, 122, 136-140; ntDNA i-20 to i-4, t strand i-20 to i-2. CH3D: scRPB1: 118-141,
143-147, 860-862, 1393, 1394; scRPB5: 140, 171, 173, 175- 194, 213-215. CH4: scRPB1: 306-316,
ntDNA i + 4 to i + 10.
Figure 46: Side view of CH3C, CH3D and CH4, relative to CH2. CH2, CH3C, CH3D, CH4, tDNA and nt
DNA are shown in blue, yellow, red orange, light blue and cyan respectively. CH4 represented surface
includes ntDNA registers i + 4 to i + 10. The rest of the protein is indicated in grey.
152
Figure 47: Front view of CH3C, CH3D and CH4. CH3C, CH3D, CH4, tDNA, ntDNA and RNA are shown
in yellow, red orange, light blue cyan and lime respectively. The protein is indicated in grey.
Figure 48: Side view of CH3C, CH3D and CH4, relative to CH4. CH3C, CH3D, CH4, tDNA, ntDNA and
RNA are shown in yellow, red orange, light blue cyan and lime respectively. The protein is indicated in grey.
153
Figure 49: Bottom view of CH3D entrance to CH3. CH3D, CH3C, tDNA and the overall enzyme are visible
as red, yellow, cyan and grey surfaces respectively.
154
3.2. CH2 Analysis
Conformational analysis
Let us start our investigation with the well-known secondary channel. The pathway algorithm detected
the following COM axis across the channel.
Figure 50: Front, side and back view of CH2 pathway axis. The pathway is represented in grey surface.
Virtual atoms filling holes in the pathway surface are indicated as silver points along the contour, thereby
drawing a closed diffusive channel. The computed pathway axis is represented as a red Gaussian trajectory.
155
The pathway axis is indicated as successive red spheres in the first figure to better visualize trajectory along
the funnel.
The path is characterized by two main directions: path from funnel opening to entrance of the corridor,
and then bifurcation across to corridor leading to the active site.
The cross-sectional areas and minimal radii along the latter COM axis are given below. It is to be noted
that the path displays dramatic reduction of diffusive area (heatmap corroborates well that of minimal
radius) when entering the corridor.
Figure 51: CH2 minimal radius along diffusional path heatmap. Time against Distance to Binding against
Minimal Radius along the pathway is plotted. The simulation trajectory is aMD5.
156
Figure 52: CH2 cross section area along diffusional path heatmap. Time against Distance to Binding against
Cross Section Area along the pathway is plotted. The simulation trajectory is aMD5.
157
Electrostatic analysis
The Electrostatic favorable or impeding contribution is characterized by the projection of the Coulombic
electrostatic interaction between a virtual point of charge -2, representing a NTP, along the COM axis.
The heatmap below displays this information and further worsens the case of CH2 being a favorable
diffusive channel from the corridor section onwards.
Figure 53: CH2 Electrostatic NTP interaction along diffusional path heatmap. Time against NTP
experienced Electrostatic Force projected along channel axis against Cross Section Area along the pathway
is plotted. The simulation trajectory is aMD5.
158
Force-Distance relationship
To further test the pathway diffusive favorability score, several pulling forces were applied to a GTP
molecule along the checkpoints presented in Table 7. The nucleotide triphosphate required a 0.4
kcal.mol-1.A-2 force to cross the corridor, while a force of 0.3 kcal.mol-1.A-2 lead to the substrate halting
its diffusion. Furthermore, the most favorable conditions were used, with the TL maintained open with
restraints.
Figure 54: CH2 force-distance plot. The simulation trajectories are sMD CH2 0.3 and sMD CH2 0.4.
Substrate/Metabolite diffusion analysis
In aMD2 and aMD4 simulations, glutamate molecules diffused through the funnel to the entrance of
the corridor, then quickly diffused away, confirming that the corridor is not suitable to accommodate
negatively charged molecules.
159
3.3. CH3A Analysis
Conformational analysis
CH3A is an interesting opening, because its access can be completely gated or expanded greatly. The
pore leads directly to DS DNA around i + 2 to i + 3. Let us first consider parameters affecting restriction
of the channel. The opening appears to be gated by the TL, when the latter is in the extreme open
conformation. For example, PDB#5C4J crystal structure shows an initial complete gating of CH3A.
However, preliminary simulations of 5C4J seem to indicate that the TL quickly retracts a little bit from
CH3A, reducing its gating (data not shown). Also, CH3A access seems to be shielded when TFIIS binds
(chapter 1).
Figure 55: Front and side view of TL closing of opening CH3A. TL, opening A and protein walls are
indicated in grey, red and green respectively. RNAP structure is PDB#5C4J [Barnes, et al., 2015].
In aMD 1 to 5 simulations, CH3A maintains globally a large opening. The entrance expands
stochastically, resulting in the periodic merging with CH3B, thereby forming one single opening:
CH3A/B. The pathway algorithm was run on aMD 1, where the access displays a very large void surface,
160
and where it was virtually merged with CH3B during the entire 70 ns simulation. The figures below
display the COM axis trajectory, and a CH3A merged with CH3B conformation.
Figure 56: Front, side and back view of CH3A pathway axis. The pathway is represented in grey surface.
Virtual atoms filling holes in the pathway surface are indicated as silver points along the contour, thereby
drawing a closed diffusive channel. The computed pathway axis is represented as a red Gaussian trajectory.
161
The conformation along the channel can be characterized by the following heatmaps.
Figure 57: CH3A minimal radius along diffusional path heatmap. Time against Distance to Binding against
Minimal Radius along the pathway is plotted. The simulation trajectory is aMD5.
Figure 58: CH3A cross section area along diffusional path heatmap. Time against Distance to Binding
against Cross Section Area along the pathway is plotted. The simulation trajectory is aMD5.
162
Electrostatic analysis
Although the accessibility area is very important, the pathway is unfavorable to NTP diffusion due to
an Electrostatic force repelling a NTP away from the diffusional path leading to a potential pre-binding.
Figure 59: CH3A Electrostatic NTP interaction along diffusional path heatmap. Time against NTP
experienced Electrostatic Force projected along channel axis against Cross Section Area along the pathway
is plotted. The simulation trajectory is aMD5.
163
Force-Distance relationship
Several forces were tested, and a pulling magnitude of 0.3 kcal.mol-1.A-2 was required to overcome the
negative Electrostatic potential.
Figure 60: CH3A force-distance plot. The simulation trajectory is sMD CH3A 0.2.
Substrate/Metabolite diffusion analysis
In aMD1 and aMD2 simulations, a glutamate and an aspartate metabolite respectively, diffused
completely across the channel, which seems to indicate that the pathway is more favorable than CH2,
when no metabolite was able to go pass the E site near the entrance of the corridor.
164
3.4. CH3B Analysis
Conformational analysis
CH3B is also an interesting pathway, because it displays a very precisely shaped narrow pore running
from an opening outside the enzyme, adjacent to the CH3A opening belonging to CH2 funnel area, and
does not seem to be affected by TL conformation switch or TFIIS binding. The pathway algorithm
generated a COM trajectory axis displaying a mean minimal radius along the diffusive path of about 3
Å only.
Figure 61: Front, side and back view of CH3B pathway axis. The pathway is represented in grey surface.
Virtual atoms filling holes in the pathway surface are indicated as silver points along the contour, thereby
drawing a closed diffusive channel. The computed pathway axis is represented as a red Gaussian trajectory.
165
Figure 62: CH3B minimal radius along diffusional path heatmap. Time against Distance to Binding against
Minimal Radius along the pathway is plotted. The simulation trajectory is aMD5.
Figure 63: CH3B cross section area along diffusional path heatmap. Time against Distance to Binding
against Cross Section Area along the pathway is plotted. The simulation trajectory is aMD5.
166
Electrostatic analysis
According to the electrostatic calculations computed, CH3B is not favorable to NTP accommodation.
Figure 64: CH3B Electrostatic NTP interaction along diffusional path heatmap. Time against NTP
experienced Electrostatic Force projected along channel axis against Cross Section Area along the pathway
is plotted. The simulation trajectory is aMD5.
167
Force-Distance relationship
Although a negative Electrostatic potential lies across the channel, a relatively low pulling force of 0.15
kcal.mol-1.A-2 was able to make a GTP molecule diffuse almost successfully. A force of 0.3 kcal.mol-
1.A-2 made the substrate diffuse very quickly, and compared favorably to the same pulling force applied
in the CH3A case, which seems to indicate that the channel is more favorable than both CH2 and CH3A.
Figure 65: CH3B force-distance plot. The simulation trajectories are sMD CH3B 0.15 and sMD CH3B 0.3.
Substrate/Metabolite diffusion analysis
In aMD2 simulation, a glutamate zwitterion amino acid diffused through the channel. More importantly,
in aMD5, a GTP molecule bonded at the entrance of the channel and remained at the position during the
simulation time, which seems to indicate that there is no energy barrier to access the very beginning of
the pathway.
168
Figure 66: GTP bound at CH3B entrance. GTP and bound MgB ion are indicated as red CPK drawing and
pink sphere respectively. Protein surface, tDNA and ntDNA are shown in grey, light blue and cyan
respectively.
169
3.5. CH3C Analysis
Conformational analysis
CH3C is an intriguing pathway, because although it lies next to DS DNA, it remains at distance with
the nucleic helix during most of the time in the simulations. The solvent accessible cavity widens in the
first few ns of simulations, meaning that in initial crystal structure atomic coordinates, crystal packing
forces might partially hide the pathway. The last fourth of the corridor seems to be gated by scRPB2:
204-206. Nevertheless, the latter residues are most of the time folded away, hence not impeding
accessibility in the last section of the channel. In the 80 ns long aMD5 simulation, scRPB2: 204-206
were always folded away.
Figure 67: Longitudinal view through CH3C. Gating residues near the end the pathway, protein surface,
tDNA and ntDNA are shown in lime, light blue, cyan and grey respectively.
170
A diffusive COM axis has been detected, and is presented below.
Figure 68: Side view of CH3C pathway axis. The pathway is represented in grey surface. Virtual atoms
filling holes in the pathway surface are indicated as silver points along the contour, thereby drawing a closed
diffusive channel. The computed pathway axis is represented as a red Gaussian trajectory.
171
For 80 ns of aMD, important accessibility dimensions occur, although not obvious from the minimal
radius along the COM axis, the accessibility is better evidenced by the cross section area heatmap.
Figure 69: CH3C minimal radius along diffusional path heatmap. Time against Distance to Binding against
Minimal Radius along the pathway is plotted. The simulation trajectory is aMD5.
Figure 70: CH3C cross section area along diffusional path heatmap. Time against Distance to Binding
against Cross Section Area along the pathway is plotted. The simulation trajectory is aMD5.
172
Electrostatic analysis
CH3C seems to be suitable electrostatically to accommodate NTP substrates, although an energetic
barrier lies at the very beginning.
Figure 71: CH3C Electrostatic NTP interaction along diffusional path heatmap. Time against NTP
experienced Electrostatic Force projected along channel axis against Cross Section Area along the pathway
is plotted. The simulation trajectory is aMD5.
173
Force-Distance relationship
sMD simulations compare in a very advantageous manner to the alternative pathways, where a pulling
force of only 0.075 kcal.mol-1.A-2 allowed fast diffusion of the substrate to binding. Also, a sMD
simulation using the aMD boost sampling method, allowed a virtually complete diffusion (a few
angstroms away from binding, probably due to a trajectory that would require a few adjustments) with
a force that can be considered almost negligible: 0.04 kcal.mol-1.A-2.
Figure 72: CH3C force-distance plot. The simulation trajectories are sMD CH3C 0.075 and sMD CH3C
0.04 aMD.
174
Substrate/Metabolite diffusion analysis
Around 65 ns of aMD5 simulation, a GTP molecule initiated diffusion across CH3C. The substrate then
inserted further in the channel. The base group stuck to protein walls, preventing the molecule to pursue
its diffusion quickly, which appeared to be due to suboptimal NTP base group parameters. The nucleic
acid forcefield potential modifications for use with the 12-6-4 potential from [Panteva, et al., 2015B]
was then applied to the NTP, and the molecule unbounded and continued a quick diffusion across the
channel. The simulated diffusion could constitute an unbiased (as compared to sMD where a force that
biases the reaction-coordinate is applied) partial successful diffusion. The NTP is bound to an additional
positively charged metabolite: an extra Mg2+ ion. This could help cross the small energetic barrier that
seems to lie (on 3 to 4 Å) at the beginning of the pathway.
Figure 73: NTP diffusion through CH3C state 1. A substrate approaches CH3C around time step 66 ns of
aMD5 simulation. The GTP molecule (red) bound to two Mg2+ (pink spheres) is shown. The protein surface,
tDNA and ntDNA are indicated in grey, light blue and cyan respectively.
175
Figure 74: NTP diffusion through CH3C state 2. The substrate initiates diffusion around time step 66.5 ns
of aMD5 simulation. The GTP molecule (red) bound to two Mg2+ (pink spheres) is shown. The protein
surface, tDNA and ntDNA are indicated in grey, light blue and cyan respectively.
Figure 75: NTP diffusion through CH3C state 3. The substrate continues diffusion inside CH3C around
time step 80 ns of aMD5 simulation. The GTP molecule (red) bound to two Mg2+ (pink spheres) is shown.
The protein surface, tDNA and ntDNA are indicated in grey, light blue and cyan respectively.
176
Figure 76: NTP diffusion through CH3C state 4. [Panteva, et al., 2015B] parameters are switched on and
the substrate diffuses along half of CH3C pathway (aMD5-prolonged time step 85.5 ns). The GTP molecule
(red) bound to two Mg2+ (pink spheres) is shown. The protein surface, tDNA and ntDNA are indicated in
grey, light blue and cyan respectively.
In addition to the NTP loading depicted above, glutamate molecules diffused completely along the
channel, arriving near DS DNA pre-binding area in aMD2, 3, 4 and 5. The diffusion occurred very
quickly (0.5 ns up to 2ns), significantly faster than for the eventual metabolite travel in the alternative
pathways. This seems to corroborate both the sMD and the electrostatic analysis indicating that CH3C
is the favorable access for NTP loading to the pre-binding registers.
177
3.6. CH3D Analysis
Preliminary analysis
In aMD simulations, NTPs appeared to display a strong repulsion from the entrance of CH3D. Therefore,
the other channels were tested in priority and CH3D has not been thoroughly investigated. In aMD2
simulation, a GTP substrate travelled at the entrance of the channel, before diffusing away.
Figure 77: NTP diffusion at CH3D entrance. The GTP molecule and its bound Mg2+ atom are shown in red
and pink respectively. The protein surface, tDNA and ntDNA are indicated in grey, light blue and cyan
respectively.
178
3.7. CH4 Analysis
Preliminary analysis
CH4 opening seems to be created mainly by the ntDNA upstream section from i + 4 to i + 10. aMD
simulations were performed with a reconstructed EC displaying only a satisfactory ntDNA upstream
conformation. Therefore, CH4 has not been thoroughly examined because non-optimal initial
conformation can bias the entire simulation behavior, all the more because extremities of DNA are to
be maintained immobile with restraints, thereby not allowing necessarily the structure to recover from
an initial potentially hedged conformation. A complete transcription bubble (PDB#5C4J) has been
published recently, and provides an adequate structure to investigate CH4. Therefore, investigation of
CH4 has only been started by the author and is in current progress.
Preliminary results seem to show that the access is favorable to substrate diffusion (Figure 78 below).
In addition, i + 2 register appears to orientate most of the time towards CH4, which may be consistent
with the channel being the most favorable NTP loading route. In aMD2 simulation, a glutamate molecule
diffused inside the pre-binding cavity via CH4.
Figure 78: CH4 force-distance plot. The simulation trajectory is sMD CH4 0.075.
179
3.8. Misloading recovery investigation
We have discussed in chapter 1 hypotheses about how misloading recovery could occur in the CH1
model. The CH2 model appears at first glance more straightforward for proposing a misloading recovery
mechanism. If NTP substrates load via CH2, then if a wrong NTP is isomerized in the catalytic site and
subsequently expelled by TL induced fit mechanism, a new NTP can simply travel again via CH2 and
bind if correct. However, the issue is much subtler in light of the CH1 theory. If an erroneous NTP has
bound to DS registers and has been wrongly loaded to the active site, then this time expulsion of the
NTP via CH2, leaves as only option for recovery an obligatory repositioning of i + 1 tDNA register
inside the DS bubble. i + 1 could simply rotate toward CH1 to allow NTP reloading. In other words,
the EC may not necessarily need to be fully pre-translocated to recover from misloading. However, the
latter phenomenon most likely represents an off-pathway short time window, when i + 1 stochastically
shifts toward the DS bubble. On the other hand, a full pre-translocation of the EC, would allow i + 1 to
position more permanently in the DS bubble and hence would represent the on-pathway recovery state.
It appears therefore interesting to investigate the pre-translocation mechanism, because it allows to
refine details about the critical misloading recovery process in a more general CH1 model.
aMD3, with a higher acceleration boost on the dihedral component of the forcefield potential, captured
a complete pre-translocation event. Analysis of the interplay between the enzymatic domains raises the
following observations. During the pre-translocation motion, BH applies a force against free i + 1
register, by bending towards the catalytic site. In contrast to the post-translocation motion following
incorporation of a NTP, the latter register is not immobilized, because it is unbound. aMD3 simulation
shows that when the BH starts bending and exerting a pressure to the free i + 1 nucleotide, the force is
absorbed by the DNA that begins to bend, and the force is telescoped to i + 2 register that undergoes an
almost 180 degrees shift. i + 2 flips and pushes against Switch 2 domain (SW2) resulting in a net motion
of RNAP towards RNA 3’end. While the BH bending continues, i + 1 base flips as well, and stacks
briefly against i + 2 in an inverted position, thereby assisting the push against Switch 2 domain, while
further freeing the catalytic cavity. Finally, i + 1 and i + 2 resume to a non-inverted position and stabilize
in the DS bubble: RNAP has pre-translocated. This mechanism is fascinating for several reasons. First,
the enzyme uses the push against i + 1 indirectly. It does not move away from i + 1 as it could be
intuitively assumed, but rather the induced force is telescoped behind the initial pushing direction of
the BH, to i + 2 that pushes against SW2. Second, it is very interesting to note that RNAP utilizes the
exact same initial mechanical domain motion to carry out sliding on DNA in two opposite directions.
The key is that the same force applied by the BH, is not decoupled in the same way, whether the EC is
in the pre-translocated or the post-translocated geometry, resulting in two net motions in the opposite
direction. The BH does not push in the opposite direction from post-translocation to drive pre-
translocation.
180
Figure 79: Pre-translocation protein re-adjustments occurring near the active site. RNA, tDNA and BH are
shown in lime, light blue and red respectively. i + 1 and i + 2 nucleotides are indicated in yellow and orange
vdw representation respectively. A: The complex is fully post-translocated. B: BH bends and initiates a push
against i + 1 resulting in the flipping of i + 2 register. C: Downstream displacement of the enzymatic complex
is occurring, BH approaches RNA 3’ end and i + 2 register is joined by i + 1 in an inverted position. i + 1
has left the catalytic cavity. D and E: i + 1 switches to the other side of BH, while i + 2 resumes to a non-
inverted position.
Figure 80: Mechanistic basis for pre-translocation. RNA, tDNA, BH, i + 1 nucleotide, i + 2 nucleotide, Switch
1 (scRPB1: 1384-1407) and Switch 2 (scRPB1: 326-345) domains are represented in lime, light blue, red,
yellow, orange, blue and mauve respectively. A and B: while flipping into an inverted position, i + 2 applies
a push against Switch 2 domain. C: i + 1 transiently assists i + 2 pushing against Switch 2 domain, before
being channeled downstream.
A B C
D E
A B C
181
4. Discussion
The intricate gallery structure running through RNAP is very complex. In addition to CH2, five channels
have been identified. Some of them are branched, involving overlapping areas, and some constitute sub-
pathways of larger channels (e.g., CH3C). In all the simulations, melting of registers i + 2 to i + 4 has
been observed, which allows substrate pre-binding in the DS bubble. This could potentially occur in
PDB#5C4J. Diffusion across the different channels, has been reasonably investigated (see next
subsection for future research to be undertaken) and allows to gauge how NTP diffusion-friendly a given
pathway may be. More importantly, it allows to test CH1 model against CH2 loading theory. In all the
investigations carried out, CH2 appears to be the worst option for substrate accommodation. Not only
Figures 51 and 52 show that conformationally the corridor section of CH2 is very constricted, being
even virtually completely closed an important fraction of the time. But also, CH2 tested the least
favorably when applying a pull to force a NTP through the corridor. Indeed, 0.4 kcal.mol-1.A-2 was
required, while 0.3, 0.15, 0.075/0.04 and 0.075 kcal.mol-1.A-2 were sufficient for travel via CH3A,
CH3B, CH3C and CH4 respectively. The electrostatic analysis, corroborated by the free glutamate
metabolite diffusion observations, indicates that the corridor section of CH2 appears more suitable for
exit diffusion, and appears less suitable for substrate entry to the catalytic center. In addition, sMD pull
through CH2 involved artificially maintaining the TL wide open: without this operation, the case would
most likely be worse. In contrast, the CH3C and CH4 pathways, leading directly to a pre-binding site in
the DS bubble, appear to be very credible routes of substrate diffusion and loading.
Importantly, in an unbiased reaction-coordinate aMD simulation (aMD5) using realistic metabolite
concentrations, physiological temperature and a complete transcription bubble, a partly successful
diffusion via CH3C is observed. The NTP has travelled through about half the pathway. It seems that
coordination of the incoming substrate with an additional Mg2+ ion is beneficial to diffusion and helps
traverse the energetic barrier lying at the entrance. For penetration through the channel and unsticking
to protein walls, [Panteva, et al., 2015B] nucleic acid forcefield parameters were switched on for the
GTP (modified 12-6-4 vdw potential for phosphate oxygen and nitrogen N7 atoms). However, in other
simulations using the latter parameters on the substrates lead to an increase in NTP stacking
aggregations. In other words, the utilization of the forcefield modification parameter set from [Panteva,
et al., 2015B] reduced the inconvenience of NTPs sticking unphysiologically at the entrance of CH3C,
yet the same parameters lead to alternative complications such as an increase in NTP aggregation. This
underlines how complex and subtle the parameterization choices can be. sMD simulations sampled a
successful diffusion with only a small biasing force of 0.075 and even 0.04 kcal.mol-1.A-2.
Conformational analysis shows that there is sufficient space remaining in time to accommodate diffusive
substrates. CH3C seems to be periodically gated near the end of the pathway, which has been observed
in some simulations for a short amount of time, but not in aMD5. It is hypothesized that the occasional
gating does impede substrate loading. Electrostatically, Figure 71 seems to indicate that an incoming
182
NTP would only experience an energetic barrier for about 3 to 4 Å at the entrance. It is interesting to
note that the substrate tends to straighten up upon approaching CH3C entry, and then undergo a rotation
of the polyphosphate tail bound to two Mg2+ ions in the direction of the channel. This mechanism could
involve a dipole moment alignment of the NTP with the local electrostatic field, could involve an
electrostatic field ionic screening with MgB, or could simply allow to place the more positively charged
part ahead. This phenomenon could permit diffusional attack along CH3C and help overcome the small
negative barrier. Adding credibility to CH3C being an input channel, is the observation that glutamate
molecules loaded through the pathway at great speed in aMD2, 3, 4 and 5. CH3C seems overall favorable
for substrate input: accessibility dimensions are wide, electrostatic configuration is globally neutral or
assisting.
Only one aMD simulation captured a partly successful diffusion across CH3C. The most likely
explanation is that not enough simulation time was sampled overall. If for the sake of the argument we
assume that a physiological diffusion is very rapid and should be observed in a few nanoseconds, several
hypotheses can be put forward as to why a complete successful diffusion via CH3C has not been
observed in the five relatively short aMD simulations. A first assumption is that forcefield parameters
are suboptimal. In particular, the parameters of the NTP base moiety seem questionable. In aMD5
simulation, the NTP base group tends to stick against protein walls and slow down diffusion via CH3C.
Furthermore, it has been observed that even with the adoption of the 16-12-4 potential, NTPs still tend
to stick to protein surface walls and to periodically form aggregates by stacking interactions. It is
possible that correctly modelling diffusion that involves nucleic acids and nucleic-acid-like NTPs,
would require the use of a polarizable forcefield. It has been indeed suggested that polarizable forcefields
are required to correctly model a system containing nucleic acids [Baker, et al., 2011; Lindert, et al.,
2013]. There is also the issue of the NTP bound highly charged Mg2+ ion parameters, which may still
not be optimal despite the 12-6-4 vdw potential. Hence, it is possible that aMD simulations did not allow
diffusion to converge adequately. Now, let us assume the possibility that the forcefield parameters were
relatively correct, but that the slow timescales available in MD simulations (aMD boost only increases
diffusion by about 3-folds) did not allow sufficient sampling and that substrates did not explore the
optimal pathway fast enough. It might take time for a NTP to be positioned randomly at a favorable
diffusion entry window through CH3C and hence it was only observed in one simulation.
Although CH4 has not been fully investigated at this stage, preliminary analysis seems to indicate that
the pathway is a very credible route of NTP loading to the DS bubble as well. It might even represent
the default mode of substrate loading, since i + 2 register appears to favor orientation towards CH4 (data
not shown).
Several hypotheses can be raised about how downstream pre-bound substrates can be stabilized in the
DS bubble in time, until their loading into the catalytic cavity. One assumption is that stacking
183
interactions between the adjacent NTP-dNMP pairs or involving DS DNA nucleotides in CH1 might
help their hybridization integrity to resist thermal fluctuations. Another hypothesis is that FL2,
contacting directly ntDNA i + 2 register, may help stabilize DS DNA and indirectly the pre-bound rNTP
at tDNA i + 2 position. In [Kireeva, et al., 2011], the authors propose that in addition to play a role in
promoting the isomerization of the active site, FL2 might contribute to the resilience of DS DNA to
thermal fluctuations.
Concerning the electrostatic analyses performed, the following limitations may be noted. The true
electrostatic configuration of a NTP-MgB substrate consists in the distribution of partial charges in
space, and modelling the molecule as a simple point of charge -2 along a diffusive path is a
simplification. This might erase details about the spatial positioning of the NTP relative to the protein
structure during diffusion, which may allow to optimize diffusion attack along a given pathway. Second,
vdw interactions have not been taken into account in the calculations and might affect the diffusion
characteristics of the channels. Finally, a NTP might undergo coordination with protein walls, by
temporarily binding to the enzyme surface. Then the stochastic tilting of the protein region coordinating
the NTP, could help push the substrate through a pathway section. The latter phenomenon could
contribute to cross small energetic barriers, notably the one lying at the entrance of CH3C.
As far as the misloading recovery investigation is concerned, simulation results show that when pre-
translocation occurs to rescue an unbound i + 1 register, the latter register quickly repositions at i + 2
position inside the DS bubble where it becomes available for pairing via CH3C. Both literature (e.g.,
[Dangkulwanich, et al., 2013]) and the observation of a rapid pre-translocation event in the absence of i
+ 1 NTP, supports the idea that: as the EC necessarily oscillates if i + 1 position is unbound, and hence
if NTPs are not loaded immediately following the previous nucleotide incorporation, and as the EC does
not seem to oscillate in normal on-pathway elongation, then it means that NTPs are necessarily pre-
bound in normal elongation. In other words, it appears that the only way to prevent rapid spontaneous
pre-translocation to occur (which does not seem to occur in fast elongation) is to have the EC
immediately locked from the first incorporation event to the next, and hence that the next NTP is already
pre-bound to i + 2, resulting in the instantaneous fixing of the EC following the transition between two
incorporations. In addition to all its conceptual drawbacks, the CH2 model does not allow to solve the
latter issue, whereas the CH1 pre-binding mechanism fits perfectly.
In summary, a general model of substrate delivery, linked to translocation, is proposed in the figures
hereafter.
184
Figure 81: Schematic representation of EC-RNAP coordination with substrate diffusion trajectory. The
figure depicts a NTP, that is complementary to the i + 2 binding site accessible in the downstream bubble,
reaching the CH3 (via CH3C) or CH4 side of the DS DNA helix pre-binding region. RNAP is represented
as a grey train sliding along a DNA frame. tDNA (upper strand) and ntDNA (bottom strand) are represented
as chains of connecting lozenges. RNA strand is constituted of connecting stars, and is extruded through the
RNA exit channel. NTPs are shown as triangles. The cyan, orange, blue and purple colors represent
indistinctly the four bases or NTP types possible. CH1, CH2, CH3, CH4, Switch 2 domain (SW2) and BH
are indicated. The MgA/MgB binding sites are represented by the metallic border fixing NTP number 4 in
the active site. The enzymatic process is simplified by shortening the real length of the nucleic acids, by
representing the downstream binding region by only one available register: only i + 2 is considered and i +
3/4 are ignored, and by separating radically CH1, CH2 and CH3, from each other for visualization purposes.
Also, in reality CH3 and CH4 reach the DS bubble from different directions and are not juxtaposed to each
other.
CH2 CH3 CH4
BH
CH1 SW2
185
Figure 82: Schematic representation of on-pathway state 1. While i + 1 NTP is undergoing catalysis in the
active site, i + 2 substrate diffuses via CH4 and binds to i + 2. Until the NTP in the active site has not
undergone the chemical reaction incorporating it into the RNA transcript, EC is notably immobilized by i
+ 1 NTP binding to MgB and MgA sites.
Figure 83: Schematic representation of on-pathway state 2. i + 1 NTP is incorporated at RNA 3’end and
PPi-MgB (represented by a small silver ball) is expelled through CH2. MgB site interaction is eliminated,
MgA site interaction is loosen up. RNAP is free to move forwards, but not backwards due to the steric block
induced by the RNA 3’ end.
186
Figure 84: Schematic representation of on-pathway state 3. BH bends and applies a force against RNA 3’end
initiating post-translocation.
Figure 85: Schematic representation of on-pathway state 4. RNAP has undergone post-translocation along
the DNA frame, resetting the nucleotide addition cycle one increment forward. i + 2 NTP is now at i + 1
position. A new NTP diffuses through CH4 and binds to i + 2.
187
Figure 86: Schematic representation of off-pathway state 1. A wrong NTP has been loaded into the active
site (through wrong pre-binding to i + 2 and subsequent loading to the catalytic center).
Figure 87: Schematic representation of off-pathway state 2. The mismatched NTP is expelled through CH2
via TL induced fit mechanism (second layer of nucleotide discrimination).
188
Figure 88: Schematic representation of off-pathway state 3. BH bends and initiates a push against the free
i + 1 base. tDNA i + 2 nucleotide flips around and initiates a strong push against Switch 2 domain.
Figure 89: Schematic representation of off-pathway state 4. BH bending is further decoupled as a force
pushing against Switch domain 2 via the flipping of i + 2 and i + 1 bases, thereby driving pre-translocation.
189
Figure 90: Schematic representation of off-pathway state 5. Resulting from the force applied against Switch
2 domain, RNAP pre-translocates. i register position of the RNA-tDNA hybrid enters the catalytic cavity. i
+ 1 tDNA register repositions at i + 2 location in the downstream channel, where it is available for binding
a new (matched) NTP. RNAP EC has been reset one step backwards.
Figure 91: Schematic representation of off-pathway state 6. BH bends and applies a force against the hybrid,
thereby initiating post-translocation.
190
Figure 92: Schematic representation of off-pathway state 7. RNAP has post-translocated, i + 2 NTP has
loaded into the active site. i + 1 register is now bound to the right NTP. RNAP EC has been rescued.
191
5. Future Works
A RNAP structure containing a complete EC has been recently published (PDB#5C4J, [Barnes, et al.,
2015]). Repeating all the work presented in this section with this system is proposed to be a priority
future work, because the path of the nucleic acid strands is optimal compared to a reconstructed EC.
Because the CH1 theory is very controversial, it is a good idea to use the most undebatable starting
system possible. Comparison with the reconstructed EC shows that the DNA positions are almost
identical, with the fine distinction of ntDNA trajectory between register i + 4 to i - 11. Work has been
initiated with 5C4J, where the ntDNA adopts a slight conformation difference, and could improve
diffusion via CH4 and possibly via CH3C. Preliminary simulations appear to confirm the availability of
DS registers, where i + 3 to i + 2 are often in the melted state and i + 4 is in transient association.
Future tasks are also to be pursued with sMD. Accessibility of the pathways can change drastically in
time. Hence executing sMD runs from different starting pathway conformations could allow to better
characterize how diffusion friendly a pathway may be. aMD parameters that were used in combination
with sMD (sMD CH3C 0.04) were very aggressive, and repeating sMD-aMD runs with a moderate
acceleration could be more suitable. For example, a total boost acceleration that is too high can distort
the solvent. Overall, more aMD parameters and sMD forces are to be tested in future research.
Furthermore, sMD, electrostatic, cross section area and minimal radius analyses are to be carried out for
CH3D and CH4, which have not been fully investigated.
A number of options can be explored to improve the sampling of substrate loading in MD simulations.
Raising the temperature does not seem adequate, as diffusion is a subtle process, and increasing the
thermal energy could modify for example the conformation of the channels. Modelling the NTP as a
sphere could allow to tackle the issue of substrates sticking to the protein walls, yet details of the
diffusion process would be lost. A more promising trick could be to repeat the aMD simulations, but
using only CTPs. A hypothesis is that the NTP would diffuse faster because it is smaller than GTP and
have an enhanced chance of successfully binding because it forms the G-C hydrogen bond, which is
stronger than the A-T bond. At this stage, aMD simulations with PDB#5C4J and 5.9 mM CTPs have
been started. Another future work could consist in providing the solvent box with higher Mg2+
concentrations, because the binding of a second magnesium ion to the NTP substrate balances the overall
electrostatic potential of the molecule. In the partly successful aMD run, the NTP diffusing through
CH3C is bound to an additional Mg2+ ion. This could however raise new issues. There are for example
questions marks about the possibility of catalysis of a loaded NTP coordinated to a second magnesium
atom. A strategy to increase the probability of simulating a complete successful diffusion, without
biasing the reaction-coordinate (such as in sMD), could be the increasing of NTP concentrations. Long
lived unproductive stacking aggregations were observed in simulations with a concentration of 5.9 mM.
Hence, multiplying the number of rNTPs in the solvent box could at first glance appear detrimental.
192
Nevertheless, such an inconvenience seems to be greatly reduced in preliminary simulations with an
alternative set of NTP parameters (provided by Prof. R. Amaro from UCSD) applied to CTP molecules,
without using [Panteva, et al., 2015B] modifications, and with aMD3 acceleration parameters (larger
dihedral boost). Hence, future works could consider adding more substrates into the simulation box,
with the use of well-reasoned parameters, and with a high dihedral boost. Next, glutamate and sulfate
metabolites displayed tendency to bind to MgB in aMD simulations, thereby increasing the negative
potential of the substrate. Adjusting the metabolite content, for example by reducing the glutamate and
sulfate concentrations, might increase the probability of sampling successful diffusions. Finally, Markov
State modelling could be explored, where several short aMD simulations (e.g., 20 ns) are run to map the
reaction-coordinate probability distribution.
Using a polarizable forcefield such as AMOEBA [Shi, et al., 2013] could be necessary to correctly
model a system containing nucleic acids and highly charged substrates. At this stage, such forcefields
are still in development and lack an important range of parameters for metabolites and nucleic acids.
Also, using polarizable forcefield increases simulation time of about 10-fold. Developments in the
electronic industry, in particular in GPUs being increasingly powerful, might allow sufficient sampling
time in the future.
Additional analyses to be performed could include examining the dipole alignment of the NTP with the
local Electrostatic field, which could reduce the diffusive degrees of freedom, and to monitor the water
flow across the channel which could be partly directional and impact input/output of substrates.
193
6. Conclusions
The substrate diffusion and loading mechanism to the active site of RNAP has many fundamental
implications concerning matters such as nucleotide discrimination, translocation and the general
sequential orchestration of the enzyme. The molecular architecture of the enzyme is very complex and
structural characteristics have been overlooked, such as the existence of several additional pathways
connecting the inside of the enzyme to the solvent. The secondary channel has been erroneously
considered as being the only unobstructed path of substrate diffusion. We propose that the widespread
CH2 theory about nucleotide triphosphate diffusion should be rejected, because the evidences
supporting the theory do not withstand scrutiny. The channel does not seem suitable both
conformationally and electrostatically to accommodate rapid input of substrates, moreover fast diffusion
through the pathway is not supported by aMD and sMD simulations. The pathway imposes conceptual
issues; such as bottle-neck roadblocking where successive substrates must halt in front of a narrow
section, the “corridor”, until wrong alternative substrates bound at the E site or the A site diffuse away,
in order to eventually load to the active site to check if they are matched to the DNA base to be
transcribed. The alternative main channel model on the other hand, initially proposed on the basis of
kinetics experiments, which evidences were sometimes overlooked, is fully supported by the research
presented in this thesis. An aMD simulation, using realistic conditions, such as a full nucleic acid EC
and physiological concentration of metabolites, captured the initial diffusion process of a nucleotide
travelling through an alternative channel, termed the tertiary channel and leading to a pre-binding region
in the main channel. In particular, a specific potential loading path through the tertiary channel, CH3C,
is supported by conformational, electrostatic and sMD analysis. An alternative pathway: CH4, has been
identified, and seems also to be a credible route of substrate diffusion to CH1. The following general
mechanism of NTP loading is proposed. Nucleotide substrates diffuse via CH3C or CH4. The last fourth
of the CH3C path is sometimes stochastically gated by scRPB2: 204-206, in which case incoming NTPs
would temporarily halt in CH3 bubble adjacent to DNA or diffuse away until a favorable time window
occurs. They then reach a pre-binding region where i + 2 and i + 3 tDNA registers are predominantly
melted and i + 4 is sometimes available. They bind to the latter registers if they are complementary or
diffuse away and exit the protein. Stacking interactions between multiple pre-bound substrates, between
NTP-dNMPs and DS DNA or interaction of FL2 with ntDNA i + 2 position, might facilitate their
stabilization in the DS bubble. Finally, the pre-bound substrates are loaded sequentially into the active
site, when post-translocation advances the enzymatic complex one tDNA base forward to incorporate
the next nucleotide. Although CH2 does not seem to serve the function of substrate input, we propose
that it is an excellent output pathway, where misloaded substrate and the bi-product of the elongation
reaction are expelled. Additional functions of the secondary channel are TF binding site (TFIIS for
eukaryotic RNAP II and GreA/B for bacterial RNAP), possible transient binding site for RNA during
pause-arrest, and site for RNA backtracking. Subsidiary conclusions are the following. NTP loading is
194
not rate limiting at non-subsaturating concentrations because CH3C/CH4 allows fast substrate input,
and most importantly because while i + 1 NTP undergoes incorporation, DS registers dispose of an
important time window to bind substrates, without impacting the on-pathway kinetics. The latter
considerations would corroborate very high elongation speed measured in studies. The first layer of
nucleotide discrimination is performed directly in the downstream bubble, prior to NTP loading, and the
catalytic site only concerns the second layer of selection, notably involving the TL induced fit
mechanism. We complete the model of substrate loading by suggesting that misloading recovery in
performed in three steps. i + 1 register mismatched substrate is expelled through CH2, the enzyme then
pre-translocates via BH induced nucleotide flip against Switch 2 domain and the register is reset for
base-pairing in the downstream bubble where it becomes available again to a CH3C/CH4 diffusing
rNTP. Finally, we note that the main channel model has several fundamental implications concerning
the manner translocation, the central mechanism underlying elongation, proceeds. The standard
Brownian ratchet model is most likely partly incorrect, where the EC does not necessarily oscillate.
Immediate loading of pre-bound nucleotide via the main channel during translocation is perfectly in line
with a model of forward translocation locking during normal elongation, which is consistent with recent
studies indicating that translocation would not oscillate when substrates are supplemented at sufficient
concentrations. RNAP can be seen as a factory chain where substrates are lined up inside the enzyme
before undergoing catalysis. The enzymatic machine orchestrating genetic transcription, truly is, a
masterpiece of Engineering.
195
References
Abbondanzieri, E., et al., Direct observation of base-pair stepping by RNA polymerase, Nature, Vol.
438, 460-465 (2005)
Allner, O., et al., Magnesium Ion−Water Coordination and Exchange in Biomolecular Simulations,
Chem. Theory Comput., Vol. 8, 1493−1502 (2012)
Andreacka, J., et al., Nano positioning system reveals the course of upstream and nontemplate DNA
within the RNA polymerase II elongation complex, Nucleic Acids Research, Vol. 37, 1–7 (2009)
Aqvist, J., A Simple Way to Calculate the Axis of an α-Helix, Computers & Chemistry, Vol. 10, 97-99
(1986)
Aqvist, J., Ion-Water Potentials Derived from Free Energy Perturbation Simulations, J. Phys. Chem.,
Vol. 94, 8021-8024 (1990)
Arino, J., et al., Alkali Metal Cation Transport and Homeostasis in Yeasts, Microbiol. Mol. Biol. Rev.,
Vol. 74, 95–120 (2010)
Armache, K.-J., et al., Architecture of initiation-competent 12-subunit RNA polymerase II, PNAS, Vol.
100, 6964–6968 (2003)
Auesukaree, C., et al., Intracellular Phosphate Serves as a Signal for the Regulation of the PHO Pathway
in Saccharomyces cerevisiae, Vol. 279, 17289–17294 (2004)
Bai, L., et al., Sequence-dependent Kinetic Model for Transcription Elongation by RNA Polymerase, J.
Mol. Biol., Vol. 344, 335-349 (2004)
Bai, L., et al., Mechanochemical Kinetics of Transcription Elongation, Physical Review Letters, Vol.
98, 068103-1-068103-4 (2007)
Baker, C., M., et al., Development of CHARMM polarizable force field for nucleic acid bases based on
the classical Drude oscillator model, J. Phys. Chem. B, Vol. 155, 580-596 (2011)
Bansal, M., et al., HELANAL: A Program to Characterize Helix Geometry in Proteins, Journal of
Biomolecular Sructure & Dynamics, Vol. 17, 811-819 (2012)
Bar-Nahum, G., et al., A Ratchet Mechanism of Transcription Elongation and Its Control, Cell, Vol.
120, 183-193 (2005)
Barnes, C., O., et al., Crystal Structure of a Transcribing RNA Polymerase II Complex Reveals a
Complete Transcription Bubble, Molecular Cell, Vol. 59, 258–269 (2015)
Batada, N., et al., Diffusion of nucleoside triphosphates and role of the entry site to the RNA polymerase
II active center, PNAS, Vol. 101, 17361-17364 (2004)
Beauchamp, K., A., et al., Are Protein Force Fields Getting Better? A Systematic Benchmark on 524
Diverse NMR Measurements, J. Chem. Theory Comput., Vol. 8, 1409-1414 (2012)
Belogurov, G., A., et al., Transcription inactivation through local refolding of the RNA polymerase
structure, Nature, Vol. 457, 332-336 (2009)
196
Best, R., B., et al., Are Current Molecular Dynamics Force Fields too Helical?, Biophysical Journal:
Biophysical Letters, Vol. 95, L07-L09 (2008)
Bochkareva, A., et al., Factor-independent transcription pausing caused by recognition of the RNA–
DNA hybrid sequence, The EMBO Journal, Vol. 31, 630–639 (2012)
Boer, V., M., Growth-limiting Intracellular Metabolites in Yeast Growing under Diverse Nutrient
Limitations, Molecular Biology of the Cell Vol. 21, 198–211 (2010)
Brueckner, F., Cramer, P., Structural basis of transcription inhibition by α-amanitin and implications for
RNA polymerase II translocation, nature structural & molecular biology, Vol. 15, 811-816 (2008)
Brueckner, F., et al., A movie of the RNA polymerase nucleotide addition cycle, Current Opinion in
Structural Biology, Vol. 19, 294-299 (2009)
Bucher, D., et al., Accessing a Hidden Conformation of the Maltose Binding Protein Using Accelerated
Molecular Dynamics, PLoS Computational Biology, Vol. 7, e1002034 (2011A)
Bucher, D., et al., On the Use of Accelerated Molecular Dynamics to Enhance Configurational Sampling
in Ab Initio Simulations, J. Chem. Theory Comput., Vol. 7, 890–897 (2011B)
Burton, Z., F., et al., NTP-driven translocation and regulation of downstream template opening by multi-
subunit RNA polymerases, Biochem. Cell Biol., Vol. 83, 486–496 (2005)
Bushnell, D., A., et al., Structural basis of transcription: α-Amanitin–RNA polymerase II cocrystal at
2.8 Å resolution, PNAS, Vol. 99, 1218–1222 (2002)
Bushnell, D., A., Kornberg, R., D., Complete, 12-subunit RNA polymerase II at 4.1-Å resolution:
Implications for the initiation of transcription, PNAS, Vol. 100, 6969–6973 (2003)
Bushnell, D., A., et al., Structural Basis of Transcription: An RNA Polymerase II-TFIIB Cocrystal at
4.5 Angstroms, Science, Vol. 303, 983-988 (2004)
Camacho, M., et al., Potassium requirements of Saccharomyces cerevisiae. Current Microbiology, Vol.
6, 295-299 (1981)
Canelas, A., B., et al., Leakage-free rapid quenching technique for yeast metabolomics, Metabolomics,
Vol. 4, 226–239 (2008A)
Canelas, A., B., et al., Determination of the cytosolic free NAD/ NADH ratio in Saccharomyces
cerevisiae under steady-state and highly dynamic conditions, Biotechnol Bioeng, Vol. 100, 734–743
(2008B)
Cannon, W., R., et al., Sulfate Anion in Water: Model Structural, Thermodynamic, and Dynamic
Properties, J. Phys. Chem., Vol. 98, 6225-6230 (1994)
Case, D., A., et al., AMBER 2016, University of California, San Francisco (2016)
Cheung, A., C., M., Cramer, P., Structural basis of RNA polymerase II backtracking, arrest and
reactivation, Nature, Vol. 471, 249-253 (2011)
Cheung, A., C., M., Cramer, P., A Movie of RNA Polymerase II Transcription, Cell, Vol. 149, 1431-
1437 (2012)
197
Christopher, J., A., Swanson, R., et al., Algorithms for Finding the Axis of a Helix: Fast Rotational and
Parametric Least-Squares Methods, Computers Chem., Vol. 20, 339-345 (1996)
Chovancova, E., et al., CAVER 3.0: A Tool for the Analysis of Transport Pathways in Dynamic Protein
Structures, e1002708 (2012)
Cino, E., A., et al., Comparison of Secondary Structure Formation Using 10 Different Force Fields in
Microsecond Molecular Dynamics Simulations, J. Chem. Theory Comput., Vol. 8, 2725-2740 (2012)
Conaway, R., C., et al., TFIIS and GreB: Two Like-Minded Transcription Elongation Factors with
Sticky Fingers, Cell, Vol. 114, 272-274 (2003)
Cramer, P., et al., Architecture of RNA Polymerase II and Implications for the Transcription
Mechanism, Science, Vol. 288, 640-649 (2000)
Cramer, P., et al., Structural Basis of Transcription: RNA Polymerase II at 2.8 Ångstrom Resolution,
Science, Vol. 292, 1963-1876 (2001)
Da, L.-T., et al., Dynamics of Pyrophosphate Ion Release and Its Coupled Trigger Loop Motion from
Closed to Open State in RNA Polymerase II, J. Am. Chem. Soc., Vol. 134, 2399−2406 (2011)
Da, L.-T., et al., A Two-State Model for the Dynamics of the Pyrophosphate Ion Release in Bacterial
RNA Polymerase, PLOS Computational Biology, Vol. 9, 1-9 (2013)
Dalton, J., A., R., et al., Calculating of helix packing angles in protein strcutures, Bioinformatics, Vol.
19, 1298-1299 (2003)
Damsma, G., E., et al., Mechanism of transcriptional stalling at cisplatin-damaged DNA, Nature
Structural & Molecular Biology, Vol. 14, 1127-1133 (2007)
Dangkulwanich, M., et al., Complete dissection of transcription elongation reveals slow translocation
of RNA polymerase II in a linear ratchet mechanism, eLife, Vol. 2, 1-22 (2013)
Davenport, R., J., et al., Single-Molecule Study of Transcriptional Pausing and Arrest by E. coli RNA
Polymerase, Science, Vol. 287, 2497-2500 (2000)
de Oliviera, C., A., F., et al., On the Application of Accelerated Molecular Dynamics to Liquid Water
Simulations, J. Phys. Chem. B, Vol. 110, 22695-22701 (2006)
de Oliviera, C., A., F., et al., Large-Scale Conformational Changes of Trypanosoma cruzi Proline
Racemase Predicted by Accelerated Molecular Dynamics Simulation, PLoS Computational Biology,
Vol. 7, e1002178 (2011)
Domecq, C., et al., Site-directed mutagenesis, purification and assay of Saccharomyces cerevisiae RNA
polymerase II, Protein Expression and Purification, Vol. 69, 83-90 (2010)
Doshi, U., Hamelberg, D., Achieving Rigorous Accelerated Conformational Sampling in Explicit
Solvent, J. Phys. Chem. Lett., Vol. 5, 1217-1224 (2014)
Duan, B., et al., A Critical Residue Selectively Recruits Nucleotides for T7 RNA Polymerase
Transcription Fidelity Control, Biophysics Journal, Vol. 107, 2130-2140 (2014)
Eastman, P., et al., OpenMM 4: A Reusable, Extensible, Hardware Independent Library for High
Performance Molecular Simulation, J. Chem. Theory Comput., Vol. 9, 461-469 (2013)
198
Eastman, P., Pande, V., S., Constant Constraint Matrix Approximation: A Robust, Parallelizable
Constraint Method for Molecular Simulations, J. Chem. Theory Comput., Vol. 6, 434-437 (2010A)
Eastman, P., Pande, V., S., Efficient Nonbonded Interactions for Molecular Dynamics on a Graphics
Processing Unit, J. Comput. Chem., Vol. 31, 1268–1272 (2010B)
Enkhbayar, P., Damdinsuren, S., et al., HELFIT: Helix fitting by a total least squares method,
Computational Biology and Chemistry, Vol. 32, 307-310 (2008)
Erie, D., A., Kennedy, S., R., Forks, pincers, and triggers: the tools for nucleotide incorporation and
translocation in multi-subunit RNA polymerases, Current Opinion in Structural Biology, Vol. 19, 708-
714 (2009)
Eun, C., et al., Molecular Dynamics Simulation Study of Conformational Changes of Transcription
Factor TFIIS during RNA Polymerase II Transcriptional Arrest and Reactivation, PLOS ONE, Vol. 9,
1-8 (2014)
Feig, M., Burton, Z., F., RNA Polymerase II with Open and Closed Trigger Loops: Active Site
Dynamics and Nucleic Acid Translocation, Biophysical Journal, Vol. 99, 2577-2586 (2010)
Foster, J., E., et al., Allosteric Binding of Nucleoside Triphosphates to RNA Polymerase Regulates
Transcription Elongation, Cell, Vol. 106, 243–252 (2001)
Fouqueau, T., et al., The RNA polymerase trigger loop functions in all three phases of the transcription
cycle, Nucleic Acids Research, Vol. 41, 7048-7059 (2013)
Frenkel, D., Smit, B., Understanding Molecular Simulation, From Algorithms to Applications,
Academic Press, San Diego, USA (2002)
Friedrichs, M., S., et al., Accelerating Molecular Dynamic Simulation on Graphics Processing Units, J.
Comput. Chem., Vol. 30, 864-872 (2009)
Fu, J., et al., Yeast RNA Polymerase II at 5 Å Resolution, Cell, Vol. 98, 799–810, (1999)
Gnatt, A., L., et al., Structural Basis of Transcription: An RNA Polymerase II Elongation Complex at
3.3 Å Resolution, Science, Vol. 292, 1876-1882 (2001)
Gong, X., et al., Dynamic Error Correction and Regulation of Downstream Bubble Opening by Human
RNA Polymerase II, Molecular Cell, Vol. 18, 461–470 (2005)
Gonzalez, B., et al., Dynamic in vivo 31P nuclear magnetic resonance study of Saccharomyces cerevisiae
in glucose-limited chemostat culture during the aerobic-anaerobic shift, Yeast, Vol. 16, 483-497 (2000)
Grant, B., J., et al., Ras Conformational Switching: Simulating Nucleotide- Dependent Conformational
Transitions with Accelerated Molecular Dynamics, PLoS Computational Biology, Vol. 5, e1000325
(2009)
Graschopf, A., et al., The Yeast Plasma Membrane Protein Alr1 Controls Mg2+ Homeostasis and Is
Subject to Mg2+ -dependent Control of Its Synthesis and Degradation, The Journal of Biological
Chemistry, Vol. 276, 16216-16222 (2001)
Greive, S., J., von Hippel, P. H., Thinking Quantitatively About Transcriptional Regulation, Nature
Reviews Molecular Cell Biology, Vol. 6, 221-232 (2005)
199
Guajardo, R., Sousa, R., A Model for the Mechanism of Polymerase Translocation, J. Mol. Biol., Vol.
265, 8-19 (1997)
Guo, Q., Sousa, R., Translocation by T7 RNA Polymerase: A Sensitively Poised Brownian Ratchet, J.
Mol. Biol., Vol. 358, 241-254 (2006)
Hamelberg, D., et al., Accelerated molecular dynamics: A promising and efficient simulation method
for biomolecules, The Journal of Chemical Physics, Vol. 120, 11919-11929 (2004)
Hamelberg, D., et al., Sampling of slow diffusive conformational transitions with accelerated molecular
Dynamics, The Journal of Chemical Physics, Vol. 127, 155102-155110 (2007)
Hans, M., A., et al., Quantification of intracellular amino acids in batch cultures of Saccharomyces
cerevisiae, Appl Microbiol Biotechnol, Vol. 56, 776–779 (2001)
Hans, M., A., et al., Free Intracellular Amino Acid Pools During Autonomous Oscillations in
Saccharomyces cerevisiae, Biotechnology and Bioengineering, Vol. 82, 143-151 (2003)
Hein, P., P., et al., RNA Transcript 3′-Proximal Sequence Affects Translocation Bias of RNA
Polymerase, Biochemistry, Vol. 50, 7002-7014 (2011)
Herbert, K., M., et al., Sequence-Resolved Detection of Pausing by Single RNA Polymerase Molecules,
Cell, Vol.125, 1083–1094 (2006)
Herrera, R., et al., Subcellular potassium and sodium distribution in Saccharomyces cerevisiae wild-
type and vacuolar mutants, Biochem. J., Vol. 454, 525-532 (2013)
Holmes, S., F., Erie, D. A., Downstream DNA Sequence Effects on Transcription Elongation: Allosteric
Binding Of Nucleoside Triphosphates Facilitates Translocation Via A Ratchet Motion, J. Biol. Chem,
Vol. 278, 35597-35608 (2003)
Holmes, S., F., et al., Kinetic Investigation of Escherichia coli RNA Polymerase Mutants That Influence
Nucleotide Discrimination and Transcription Fidelity, J. Biol. Chem., Vol. 281, 18677-18683 (2006)
Homeyer, N., et al., AMBER force-field parameters for phosphorylated amino acids in different
protonation states: phosphoserine, phosphothreonine, phosphotyrosine, and phosphohistidine, J Mol
Model, Vol. 12, 281-289 (2006)
Horn, H., W., et al., Development of an improved four-site water model for biomolecular simulations:
TIP4P-Ew. J. Chem. Phys., Vol. 120, 9665-9678 (2004)
Horn, H., W., et al., J. Characterization of the TIP4P-Ew water model: Vapor pressure and boiling point.
J. Chem. Phys., Vol. 123, 194504 (2005)
Horn, A., H., C., A consistent force field parameter set for zwitterionic amino acid Residues, J Mol
Model, Vol. 20, 2478-2491 (2014)
Humphrey, W., et al., VMD-Visual Molecular Dynamics, J. Molec. Graphics, Vol. 14, 33-38 (1996)
Imashimizu, M., et al., Intrinsic Translocation Barrier as an Initial Step in Pausing by RNA Polymerase
II, J. Mol. Biol. Vol. 425, 697-712 (2013)
200
Jennings, M., L., Cui J., Chloride homeostasis in Saccharomyces cerevisiae: high affinity influx, V-
ATPase-dependent sequestration, and identification of a candidate Cl− sensor, J. Gen. Physiol., Vol.
131, 379-391 (2008)
Jiang, Y., et al., Refined Dummy Atom of Mg2+ by Simple Parameter Screening Strategy with Revised
Experimental Solvation Free Energy, J. Inf. Chem. Model., Vol. 55, 2575-2586 (2015)
Kahm, M., et al., Potassium Starvation in Yeast: Mechanisms of Homeostasis Revealed by Mathematical
Modeling, PLoS Computational Biology, Vol. 8, e1002548 (2012)
Kahn, P., C., Defining the Axis of a Helix, Computers Chem., Vol. 13, 185-189 (1988)
Kaplan, C., D., et al., The RNA Polymerase II Trigger Loop Functions in Substrate Selection and Is
Directly Targeted by a-Amanitin, Molecular Cell, Vol. 30, 547-556 (2008)
Kaplan, C., D., et al., Dissection of Pol II Trigger Loop Function and Pol II Activity–Dependent Control
of Start Site Selection In Vivo, PLoS Genetics, Vol. 8, 1-17 (2012)
Kappel, K., et al., Accelerated molecular dynamics simulations of ligand binding to a muscarinic G-
protein-coupled receptor, Quarterly Reviews of Biophysics, Vol. 48, 479-487 (2015)
Kashkina, E., et al., Multisubunit RNA Polymerases Melt Only a Single DNA Base Pair Downstream
of the Active Site, J. Biol. Chem., Vol. 282, 21578-21582 (2007)
Kennedy, S., Erie, D., Templated nucleoside triphosphate binding to a noncatalytic site on RNA
polymerase regulates transcription, PNAS, Vol. 108, 6079-6084 (2011)
Kettenberger, H., et al., Architecture of the RNA Polymerase II-TFIIS Complex and Implications for
mRNA Cleavage, Cell, Vol. 114, 347–357 (2003)
Kettenberger, H., et al., Complete RNA Polymerase II Elongation Complex Structure and Its
Interactions with NTP and TFIIS, Molecular Cell, Vol. 16, 955–965 (2004)
Kettenberger, H., et al., Structure of an RNA polymerase II-RNA inhibtor complex elucidates
transcription regulation by noncoding RNAs, Nature Structural & Molecular Biology, Vol. 13, 44-48
(2006)
Kireeva, M., L., et al., Nature of the Nucleosomal Barrier to RNA Polymerase II, Molecular Cell, Vol.
18, 97-108, (2005)
Kireeva, M., L., et al., Transient Reversal of RNA Polymerase II Active Site Closing Controls Fidelity
of Transcription Elongation, Molecular Cell, Vol. 30, 557-566 (2008)
Kireeva, M., L., et al., Millisecond phase kinetic analysis of elongation catalyzed by human, yeast and
Escherichia coli RNA polymerase, Methods, Vol. 48, 333-345 (2009)
Kireeva, M., L., et al., Translocation by multi-subunit RNA polymerases, Biochimica et Biophysica
Acta, Vol. 1799, 389-401 (2010)
Kireeva, M., L., et al., Interaction of RNA Polymerase II Fork Loop 2 with Downstream Non-template
DNA Regulates Transcription Elongation, J. Biol. Chem., Vol. 286, 30898-30910 (2011)
Kireeva, M., L., et al., Molecular dynamics and mutational analysis of the catalytic and translocation
cycle of RNA polymerase, BMC Biophysics, Vol. 5, 11.1-11.18 (2012)
201
Kirkegaard, K., et al., Mapping of single-stranded regions in duplex DNA at the sequence level: Single-
strand-specific cytosine methylation in level: Single-strand-specific cytosine methylation in RNA
polymerase-promoter complexes, Proc. Nati Acad. Sci., Vol. 80, 2544-2548 (1983)
Kolacna, L., et al., New phenotypes of functional expression of the mKir2.1 channel in potassium efflux-
deficient Saccharomyces cerevisiae strains, Yeast, Vol. 22, 1315-1323 (2005)
Komissarova, N., Kashlev, M., RNA Polymerase Switches between Inactivated and Activated States By
Translocating Back and Forth along the DNA and the RNA, J. Biol. Chem., Vol. 272, 15329-15338
(1997A)
Komissarova, N., Kashlev, M., Transcriptional arrest: Escherichia coli RNA polymerase translocates
backward, leaving the 3’ end of the RNA intact and extruded, Proc. Natl. Acad. Sci., Vol. 94, 1755-
1760 (1997B)
Komuro, Y., et al., CHARMM Force-Fields with Modified Polyphosphate Parameters Allow Stable
Simulation of the ATP-Bound Structure of Ca2+-ATPase, J. Chem. Theory Comput., Vol. 10,
4133−4142 (2014)
Korzheva, N., et al., A Structural Model of Transcription Elongation, Science, Vol. 289, 619-625 (2000)
Kozlikova, et al., CAVER Analyst 1.0: graphic tool for interactive visualization and analysis of tunnels
and channels in protein structures, Bioinformatics, Vol. 30, 2684-2685 (2014)
Krepl, M., et al., Reference simulations of noncanonical nucleic acids with different chi variants of the
AMBER force field: Quadruplex DNA, quadruplex RNA, and Z-DNA, J. Chem. Theory Comp., Vol.
8, 2506–2520 (2012)
Krieger, E., et al., Increasing the precision of comparative models with YASARA NOVA-a self-
parametizing force field, Proteins, Vol. 47, 393-402 (2002)
Kumar, P., Bansal, M., HELANAL-Plus: a web server for analysis for helix geometry in protein
structures, Journal of Biomolecular Structures and Dynamics, Vol. 30, 773-783 (2012)
Landick, R., NTP-entry routes in multi-subunit RNA polymerases, Trends in Biochemical Sciences,
Vol.30, 651-654 (2005)
Lange, O., F., et al., Scrutinizing Molecular Mechanics Force Fields on the Submicrosecond Timescale
with NMR Data, Biophysical Journal, Vol. 99, 647-655 (2010)
Langelier, M.-F., et al., The highly conserved glutamic acid 791 of Rpb2 is involved in the binding of
NTP and Mg(B) in the active center of human RNA polymerase II, Nucleic Acids Research, Vol. 33,
2629–2639 (2005)
Larson, M., H., et al., Trigger loop dynamics mediate the balance between the transcriptional fidelity
and speed of RNA polymerase II, PNAS, Vol. 109, 6555-6560 (2012)
Le Grand, S., et al., SPFP: Speed without compromise—A mixed precision model for GPU accelerated
molecular dynamics simulations, Computer Physics Communications, Vol. 184, 374-380 (2013)
Lee, H., S., et al., QHELIX: A Computational Tool for the Improved Measurement of Inter-Helical
Angles in Proteins, Protein J, Vol. 56, 556-561 (2007)
202
Li, P., et al., Systematic Parameterization of Monovalent Ions Employing the Nonbonded Model, J.
Chem. Theory Comput., Vol. 11, 1645-1657 (2015)
Li, P., Merz Jr., K., M., Taking into Account the Ion-induced Dipole Interaction in the Nonbonded
Model of Ions, J Chem Theory Comput, Vol. 10, 289-297 (2014)
Lindert, S., et al., Dynamics and Calcium Association to the N-Terminal Regulatory Domain of Human
Cardiac Troponin C: A Multiscale Computational Study, J. Phys. Chem. B, Vol. 116, 8449-8459 (2012)
Lindert, S., et al., Accelerated Molecular Dynamics Simulations with the AMOEBA Polarizable Force
Field on Graphics Processing Units, J. Chem. Theory Comput, Vol. 9, 4684−4691 (2013)
Lindorff-Larsen, K., et al., Systematic Validation of Protein Force Fields against Experimental Data,
PLoS ONE, Vol. 7, e32131: 6 (2012)
Lu, X.-J., Olson, W., L., 3DNA: a software package for the analysis rebuilding and visulization of three-
dimensional nucleic acid structures, Nucleic Acids Research, Vol. 31, 5108-5121 (2003)
Maathius, F., J., M., Amtmann A., K+ Nutrition and Na+ Toxicity: The Basis of Cellular K+/Na+ Ratios,
Annals of Botany, Vol. 84, 123-133 (1999)
Magdenoska, O., et al., Quantifying intracellular metabolites in yeast using a matrix with minimal
interference from naturally occurring analytes, Anal. Biochem., Vol. 487, 17-26 (2015)
Maier, J., et al., ff14SB: Improving the Accuracy of Protein Side Chain and Backbone Parameters from
ff99SB. J. Chem. Theory Comput., Vo. 11, 3696–3713 (2015)
Malagon, F., et al., Mutations in the Saccharomyces cerevisiae RPB1 Gene Conferring Hypersensitivity
to 6-Azauracil, Genetics, Vol.172, 2201-2209 (2006)
Malinen, A., M., et al., Active site opening and closure control translocation of multisubunit RNA
polymerase, Nucleic Acids Research, Vol. 40, 7442-7451 (2012)
Maoileidigh, D., O., et al., A Unified Model of Transcription Elongation: What Have We Learned from
Single-Molecule Experiments?, Biophysical Journal, Vol. 100, 1-10 (2011)
Markwick, P., R., L., et al., Exploring Multiple Timescale Motions in Protein GB3 Using Accelerated
Molecular Dynamics and NMR Spectroscopy, J. AM. CHEM. SOC., Vol. 129, 4724-4730 (2007)
Markwick, P., R., L., et al., Toward a Unified Representation of Protein Structural Dynamics in Solution,
J. AM. CHEM. SOC., Vol. 131, 16968-16975 (2009)
Markwick, P., R., L., McCammon, J., A., Studying functional dynamics in bio-molecules using
accelerated molecular dynamics, Phys. Chem. Chem. Phys., Vol. 13, 20053-20065 (2011)
Martinez, P., Persson, B., L., Identification, cloning and characterization of a derepressible Na+-coupled
phosphate transporter in Saccharomyces cerevisiae, Mol. Gen. Genet., Vol. 258, 628-638 (1998)
Martinez-Rucobo, F., Cramer, P., Structural basis of transcription elongation, Biochimica et Biophysica
Acta, Vol. 1829, 9-19 (2013)
McLahan, A., D., Gene Duplication in the Structural Evolution of Chymotrypsin, J. Mol. Biol., Vol.
128, 49-79 (1979)
Meagher, K., L., et al., Development of polyphosphate parameters for use with the AMBER force field,
203
Journal of Comp. Chemistry, Vol. 24, 1016-1025 (2003)
Meller, J., Molecular Dynamics, Encyclopedia of Life Sciences, Nature Publishing Group (2001)
Meyer, P., A., et al., Phasing RNA Polymerase II Using Intrinsically Bound Zn Atoms: An Updated
Structural Model, Structure, Vol.14, 973-982 (2006)
Miao, Y., et al., General trends of dihedral conformational transitions in a globular protein, Proteins,
Vol. 84, 501-514 (2016)
Miropolskaya, N., et al., Interplay between the trigger loop and the F loop during RNA polymerase
catalysis, Nucleic Acids Research, Vol. 42, 544-552 (2014)
Montiel, V., Ramos, J., Intracellular Na+ and K+ distribution in Debaryomyces hansenii. Cloning and
expression in Saccharomyces cerevisiae of DhNHX1, FEMS Yeast Res, Vol. 7, 102-109 (2007)
Mukhopadhyay, J., et al., Antibacterial Peptide Microcin J25 Inhibits Transcription by Binding within
and Obstructing the RNA Polymerase Secondary Channel, Molecular Cell, Vol. 14, 739-751 (2004)
Naryshkina, T., et al., The Role of the Largest RNA Polymerase Subunit Lid Element in Preventing the
Formation of Extended RNA-DNA Hybrid, J. Mol. Biol., Vol. 361, 634-643 (2006)
Nedialkov, Y., A., et al., NTP-driven Translocation by Human RNA Polymerase II, J. Biol. Chem., Vol.
278, 18303-18312 (2003)
Nedialkov, Y., A., et al., RNA polymerase stalls in a post-translocated register and can hyper-
translocate, Transcription, Vol. 3, 260-269 (2012)
Nick McElhinny, S., A., et al., Abundant ribonucleotide incorporation into DNA by yeast replicative
polymerases, PNAS, Vol. 107, 4949-4954 (2010)
Nierman, W., C., Chamberlin, M. J., The Effect of Low Substrate Concentrations on the Extent of
Productive RNA Chain Initiation from T7 Promoters A1 and A2 by Escherichia coli RNA Polymerase,
The Journal of Biological Chemistry, Vol. 225, 4495-4500 (1980)
Nudler, E., et al., The RNA–DNA Hybrid Maintains the Register of Transcription by Preventing
Backtracking of RNA Polymerase, Cell, Vol. 89, 33-41 (1997)
Nudler, E., RNA Polymerase Active Center: The Molecular Engine of Transcription, Annu. Rev.
Biochem., Vol. 78, 335-361 (2009)
Nudler, E., RNA Polymerase Backtracking in Gene Regulation and Genome Instability, Cell, Vol. 149,
1438-1443 (2012)
Olz, R., et al., Energy Flux and Osmoregulation of Saccharomyces cerevisiae Grown in Chemostats
under NaCl Stress, Journal OF Bacteriology, Vol. 175, 2205-2213 (1993)
Oster, G., Darwin’s motors, Nature, Vol. 417, p.25 (2002)
Palangat, M., Landick, R., Roles of RNA:DNA Hybrid Stability, RNA Structure, and Active Site
Conformation in Pausing by Human RNA Polymerase II, J. Mol. Biol., Vol. 311, 265-282 (2001)
Pande, V., S., Eastman, P., OpenMM: A Hardware-Independent Framework for Molecular Simulations,
Computing in Science & Engineering, Vol. 12, 34-39 (2010)
204
Panteva, M., T., et al., Comparison of Structural, Thermodynamic, Kinetic and Mass Transport
Properties of Mg2+ Ion Models Commonly used in Biomolecular Simulations, Journal of Computational
Chemistry, Vol. 36, 970-982 (2015A)
Panteva, M., T., et al., Force Field for Mg2+, Mn2+, Zn2+, and Cd2+ Ions That Have Balanced Interactions
with Nucleic Acids, J. Phys. Chem., Vol. 119, 15460-15470 (2015B)
Pavelka, A., et al., CAVER: Algorithms for Analyzing Dynamics of Tunnels in Macromolecules,
Transactions on Computational Biology and Bioinformatics, Vol. 13, 505-517 (2016)
Pellegrini-Calace, P., et al., PoreWalker: A Novel Tool ofr the Identification and Characterization of
Channels in Transmembreane Proteins from Their Three-Dimensional Structure, Vol. 5, e1000440
(2009)
Perez, A., et al., Refinement of the AMBER Force Field for Nucleic Acids: Improving the Description
of alpha/gamma Conformers, Biophys. J., Vol. 92, 3817-3829 (2007)
Perez-Villa, A., et al., ATP dependent NS3 helicase interaction with RNA: insights from molecular
simulations, Nucleic Acids Research, Vol. 43, 1-10 (2015)
Piana, S., et al., Assessing the accuracy of physical models used in protein-folding simulations:
quantitative evidence from long molecular dynamics simulations, Current Opinion in Structural
Biology, Vol. 24, 98-105 (2014)
Pierce, L., C., T., et al., Routine Access to Millisecond Time Scale Events with Accelerated Molecular
Dynamics, J. Chem. Theory Comput., Vol. 8, 2997-3002 (2012)
Ramos, J., et al., Yeast Membrane Transport, Advances in Experimental Biology and Medicine, ISBN
978-3-319-25304-6, p. 206 (2016)
Rodriguez-Navarro, A., Potassium transport in fungi and plants, Biochimica et Biophysica Acta, Vol.
1469, 1-30 (2000)
Romani, A., Scarpa A., Regulation of Cell Magnesium, Archives of Biochemistry and Biophysics, Vol.
298, 1-12 (1992)
Saeki, H., Svejstrup, J., Q., Stability, Flexibility, and Dynamic Interactions of Colliding RNA
Polymerase II Elongation Complexes, Molecular Cell, Vol. 35, 191-205 (2009)
Santangelo T., J., Roberts, J., W., Forward Translocation Is the Natural Pathway of RNA Release at an
Intrinsic Terminator, Molecular Cell, Vol. 14, 117-126 (2004)
Semenova, E., et al., Structure-Activity Analysis of Microcin J25: Distinct Parts of the Threaded Lasso
Molecule Are Responsible for Interaction with Bacterial RNA Polymerase, J. Bacteriol., Vol. 187, 3859-
3863 (2005)
Shaevitz, J., W., et al., Backtracking by single RNA polymerase molecules observed at near-base-pair
resolution, Nature, Vol. 426, 684-687 (2003)
Shi, Y., et al., Polarizable Atomic Multipole-Based AMOEBA Force Field for Proteins, J. Chem. Theory
Comput., Vol. 9, 4046-4063 (2013)
205
Sigel, H., Griesser, R., Nucleoside 5’-triphosphates: self-association, acid–base, and metal ionbinding
properties in solution, Chem. Soc. Rev., Vol. 34, 875-900 (2005)
Silva, D.-A., et al., Millisecond dynamics of RNA polymerase II translocation at atomic resolution,
PNAS, 1-6 (2014)
Sims III, R., J., et al., Elongation by RNA polymerase II: the short and long of it, Genes Dev., Vol. 18,
2437-2468 (2004)
Song, J., et al., Functional Loop Dynamics of the Streptavidin-Biotin Complex, Scientific Reports, Vol.
5, 7906: 10 (2015)
Sosunov, V., et al., Unified two-metal mechanism of RNA synthesis and degradation by RNA
polymerase, The EMBO Journal, Vol. 22, 2234-2244 (2003)
Stano, N., M., et al., The +2 NTP Binding Drives Open Complex Formation in T7 RNA Polymerase, J.
Biol. Chem, Vol. 277, 37292-37300 (2002)
Steinbrecher, T., et al., Revised AMBER parameters for bioorganic phosphates, J Chem Theory
Comput., Vol. 8, 4405-4412 (2012)
Steitz, T., A mechanism for all polymerases, Nature, Vol. 391, 231-232 (1998)
Sunder, S., et al., Regulation of intracellular level of Na+, K+ and glycerol in Saccharomyces cerevisiae
under osmotic stress, Molecular and Cellular Biochemistry, Vol. 158, 121-124 (1996)
Svetlov, et al., Discrimination against Deoxyribonucleotide Substrates by Bacterial RNA Polymerase,
J. Biol. Chem, Vol.279, 38087-38090 (2004)
Swaminathan, R., Magnesium Metabolism and its Disorders, Clin Biochem Rev, Vol. 24, 47-66 (2003)
Sychrova, H., Yeast as a Model Organism to Study Transport and Homeostasis of Alkali Metal Cations,
Physiol. Res., Vol. 53, S91-S98 (2004)
Sydow, J. F., Cramer, P., RNA polymerase fidelity and transcriptional proofreading, Current Opinion
in Structural Biology, Vol. 19, 732-739 (2009A)
Sydow, J. F., et al., Structural Basis of Transcription: Mismatch-Specific Fidelity Mechanisms and
Paused RNA Polymerase II with Frayed RNA, Molecular Cell, Vol. 34, 710-721 (2009B)
Tadigotla, V., R., et al., Thermodynamic and kinetic modeling of transcriptional pausing, PNAS, Vol.
103, 4439-4444 (2006)
Tahirov, T., H., et al., Structure of a T7 RNA polymerase elongation complex at 2.9Å resolution, Nature,
Vol. 420, 43-50 (2002)
Tan, L., et al., Bridge helix and trigger loop perturbations generate superactive RNA polymerases,
Journal of Biology, Vol.7, 40.1-40.15 (2008)
Temiakov, D., et al., Structural Basis for Substrate Selection by T7 RNA Polymerase, Cell, Vol. 116,
381-391 (2004)
Temiakov, D., et al., Structural Basis of Transcription Inhibition by Antibiotic Streptolydigin, Molecular
Cell, Vol. 19, 655-666 (2005)
206
Theobald, U., et al., Determination of In-vivo Cytosplasmic Orthophosphate Concentration in Yeast,
Biotechnology Techniques, Vol. 10, 297-302 (1996)
Theobald, U., et al., In Vivo Analysis of Metabolic Dynamics in Saccharomyces cerevisiae: I.
Experimental Observations, Biotechnology and Bioengineering, Vol. 55, 305-316 (1997)
Tikhonova, I., G., et al., Simulations of Biased Agonists in the β2 Adrenergic Receptor with Accelerated
Molecular Dynamics, Biochemistry, Vol. 52, 5593-5603 (2013)
Toulokhonov, I., et al., A Central Role of the RNA Polymerase Trigger Loop in Active-Site
Rearrangement during Transcriptional Pausing, Molecular Cell, Vol. 27, 406-419 (2007)
Traut, T., W., Physiological concentrations of purines and pyrimidines, Molecular and Cellular
Biochemistry, Vol. 140, 1-22 (1994)
van Eunen, K., Bakker, B., M., The importance and challenges of in vivo-like enzyme kinetics,
Perspectives in Science, Vol. 1, 126-130 (2014)
van Eunen, K., et al., Measuring enzyme activities under standardized in vivo-like conditions for
systems biology, FEBS Journal, Vol. 277, 749-760 (2010)
Vassylyev, D., G., et al., Crystal structure of a bacterial RNA polymerase holoenzyme at 2.6 Å
resolution, Nature, Vol. 417, 712-719 (2002)
Vassylyev, D., G., et al., Structural basis for transcription elongation by bacterial RNA polymerase,
Nature, Vol. 448, 157-164 (2007A)
Vassylyev, D., G., et al., Structural basis for substrate loading in bacterial RNA polymerase, Nature,
Vol. 448, 163-169 (2007B)
Vassylyev, D., G., Elongation by RNA polymerase: a race through roadblocks, Current Opinion in
Structural Biology, Vol. 19, 691-700 (2009)
Volkov, V., Quantitative description of ion transport via plasma membrane of yeast and small cells,
Front. Plant Sci., Vol. 6, art. 425 (2015)
Wang, H.-Y., et al., Force Generation in RNA Polymerase, Biophysical Journal, Vol. 74, 1186-1202
(1998)
Wang, J., et al., How Well Does a Restrained Electrostatic Potential (RESP) Model Perform in
Calculating Conformational Energies of Organic and Biological Molecules?, Journal of Computational
Chemistry, Vol. 21, 1049-1074 (2000)
Wang, H., Oster, G., Ratchets, power strokes, and molecular motors, Appl. Phys. A, Vol. 75, 315-323
(2002)
Wang, D., et al., Structural basis of transcription: role of the trigger loop in substrate specificity and
catalysis, Cell, Vol. 127, 941-954 (2006)
Wang, D., et al., Structural Basis of Transcription: Backtracked RNA Polymerase II at 3.4 Angstrom
Resolution, Science, Vol. 324, 1203-1206 (2009)
Wang, Y., et al., Enhanced Lipid Diffusion and Mixing in Accelerated Molecular Dynamics, J. Chem.
Theory Comput., Vol. 7, 3199-3207 (2011A)
207
Wang, Y., et al., Implementation of accelerated molecular dynamics in NAMD, Computational Science
& Discovery, Vol. 4, 015002: 10 (2011B)
Wang, B., et al., Computational Simulation Strategies for Analysis of Multisubunit RNA Polymerases,
Chem. Rev., Vol. 113, 8546-8566 (2013)
Weinzierl, R., O., J., Nanomechanical constraints acting on the catalytic site of cellular RNA
polymerases, Biochem. Soc. Trans., Vol. 38, 428-432 (2010A)
Weinzierl, R., O., J., The nucleotide addition cycle of RNA polymerase is controlled by two molecular
hinges in the Bridge Helix domain, BMC Biology, Vol. 8, 134.1-134.15 (2010B)
Weinzierl, R., O., J., The Bridge Helix of RNA Polymerase Acts as a Central Nanomechanical
Switchboard for Coordinating Catalysis and Substrate Movement, Archaea, Vol. 2011, 608385.1-
608385.7 (2011)
Weixlbaumer, A., et al., Structural Basis of Transcriptional Pausing in Bacteria, Cell, Vol. 152, 431-441
(2013)
Westover, K., D., et al., Structural Basis of Transcription: Nucleotide Selection by Rotation in the RNA
Polymerase II Active Center, Cell, Vol. 119, 481-489 (2004A)
Westover, K., D., et al., Structural Basis of Transcription: Separation of RNA from DNA by RNA
Polymerase II, Science, Vol. 303, 1014-1016 (2004B)
Woo, H.-J., et al., Molecular dynamics studies of the energetics of translocation in model T7 RNA
polymerase elongation complexes, Proteins, Vol. 73, 1021-1036 (2008)
Xie, P., A dynamic model for processive transcription elongation and backtracking long pauses by
multisubunit RNA polymerases, Proteins, Vol. 80, 2020–2034 (2012)
Xiong, Y., Burton, Z., A Tunable Ratchet Driving Human RNA Polymerase II Translocation Adjusted
by Accurately Templated Nucleoside Triphosphates Loaded at Downstream Sites and by Elongation
Factors, The Journal of Biological Chemistry, Vol. 282, 36582-36592 (2007)
Yaffe, E., et al., MolAxis: a server for identification of channels in macromolecules, Nucleic Acids
Research, Vol. 36, W210-W215 (2008)
Yu, J., Oster, G., A Small Post-Translocation Energy Bias Aids Nucleotide Selection in T7 RNA
Polymerase Transcription, Biophysical Journal, Vol. 102, 532-541 (2012)
Yuzenkova, Y., et al., Stepwise mechanism for transcription fidelity, BMC Biology, Vol.8, art. 54
(2010)
Zaychikov, A., et al., Translocation of the Escherichia coli transcription complex observed in the
registers 11 to 20: "Jumping" of RNA polymerase and asymmetric expansion and contraction of the
"transcription bubble", PNAS, Vol. 92, 1739-1743 (1995)
Zenkin, N., et al., Transcript-Assisted Transcriptional Proofreading, Science, Vol. 313, 518-520 (2006)
Zgarbova, M, et al., Refinement of the Cornell et al. Nucleic Acids Force Field Based on Reference
Quantum Chemical Calculations of Glycosidic Torsion Profiles, J. Chem. Theory Comput., Vol. 7, 2886-
2902(2011)
208
Zgarbova, M., et al., Toward improved description of dna backbone: Revisiting epsilon and zeta torsion
force field parameters, J. Chem. Theory Comput., Vol. 9, 2339-2354 (2013)
Zgarbova, M., et al., Refinement of the Sugar-Phosphate Backbone Torsion Beta for AMBER Force
Fields Improves the Description of Z- and B-DNA, J. Chem. Theor. and Comp., Vol. 12, 5723-5736.
(2015)
Zhang, G., et al., Crystal Structure of Thermus aquaticus Core RNA Polymerase at 3.Å Resolution, Cell,
Vol. 98, 811-824 (1999)
Zhang, C., et al., Combinatorial Control of Human RNA Polymerase II (RNAP II) Pausing and
Transcript Cleavage by Transcription Factor IIF, Hepatitis d Antigen, and Stimulatory Factor II, J. Biol.
Chem., Vol. 278, 50101-50111 (2003)
Zhang, C., Burton, Z., Transcription Factors IIF and IIS and Nucleoside Triphosphate Substrates as
Dynamic Probes of the Human RNA Polymerase II Mechanism, J. Mol. Biol., Vol. 342, 1085-1099
(2004)
Zhang, J., et al., Role of the RNA polymerase trigger loop in catalysis and pausing, Nature Structural &
Molecular Biology, Vol. 17, 99-105 (2010)
Zhang, Y., et al., Structural Basis of Transcription Initiation, Science, Vol. 338, 1076-1080 (2012)
Zhang, L., et al., Structural Model of RNA Polymerase II Elongation Complex with Complete
Transcription Bubble Reveals NTP Entry Routes, PLOS Computational Biology, Vol. 11, e1004354
(2015A)
Zhang, J., et al., A Fast Sensor For in Vivo Quantification of cytosolic Phosphate in Saccharomyces
Cerevisiae, Biotechnology and Bioengineering, Vol. 112, 1033-1046 (2015B)
209
Appendix 1: aMD simulation procedure
use File::Slurp; use Math::Round; use strict; use autodie; use warnings qw(all); use Statistics::Descriptive; use List::Util qw( min max ); $ENV{PYTHONPATH} = "/home/ng/amber16/lib/python2.7/site-packages"; $ENV{OPENMM_CUDA_COMPILER} = "/usr/local/cuda-8.0/bin/nvcc"; $ENV{LD_LIBRARY_PATH} = "/usr/local/cuda-7.5/lib64:/lib"; $ENV{PATH} = "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"; $ENV{AMBERHOME} = "/home/ng/amber16"; my $dna1; my $dna2; my $dna3; my $dna4; my $dna5; my $dna6; my $dna7; my $dna8; ############################################################################### ##Preliminary notice ############################################################################### ##We start with 2e2h structure, ##with pdb file cleaned up ##i.e. only keep ATOM, HETATM, TER and END lines, ##with extended nucleic acid frame, ##with gtp in A site removed, ##with missing loops added, ##and with C and NTER added, ##the structure is also pre-minimized (see chapter 2) ############################################################################### ##END Preliminary notice ############################################################################### ############################################################################### ##Execute first dummy leap run ############################################################################### #Note: #Execute first Leap run on initial structure (without metabolites), only hydrogenize and solvate #generate hydrogenated structure>struct-ini-hydro.pdb my $outfile="leap-1.scrpt"; open (FILE2, "> $outfile") || die; print (FILE2 "source leaprc.protein.ff14SB\n"); print (FILE2 "source leaprc.DNA.OL15\n"); print (FILE2 "source leaprc.RNA.OL3\n"); print (FILE2 "loadoff atomic_ions.lib\n");
210
print (FILE2 "loadamberparams frcmod.ions1lm_1264_tip4pew\n"); print (FILE2 "loadamberparams frcmod.ions234lm_1264_tip4pew\n"); print (FILE2 "loadoff solvents.lib\n"); print (FILE2 "loadamberparams frcmod.tip4pew\n"); print (FILE2 "sys = loadpdb 2e2h-pre-minimized.pdb\n"); print (FILE2 "saveamberparm sys 2e2h-pre-minimized.prmtop 2e2h-pre-minimized.inpcrd\n"); print (FILE2 "savepdb sys out-leap1-1.pdb\n"); print (FILE2 "solvatebox sys TIP4PEWBOX 15.0\n"); print (FILE2 "saveamberparm sys 2e2h-pre-minimized-solv.prmtop 2e2h-pre-minimized-solv.inpcrd\n"); print (FILE2 "savepdb sys out-leap1-2.pdb\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/xleap -s -f leap-1.scrpt > out-leap1.out"; system($cmd); #keep file: rename "leap.log", "leap-1.log"; ############################################################################### ##END Execute first leap run ############################################################################### ############################################################################### ##Extract number of protein residues and extract T and N strand anchors ############################################################################### my @pdb_input = read_file("out-leap1-1.pdb") or die; my $array_size=scalar @pdb_input; ############## EXTRACT DNA anchors and protein atom index range my $down_dna_anchor; my $up_dna_anchor; my $down=0; my $up=0; my $y=2; my $down_dna_anchor_first_id; my $down_dna_anchor_last_id; my $up_dna_anchor_first_id; my $up_dna_anchor_last_id; my $down_dna_anchor_first_id_chain1; my $down_dna_anchor_last_id_chain1; my $down_dna_anchor_first_id_chain2; my $down_dna_anchor_last_id_chain2; my $up_dna_anchor_first_id_chain1; my $up_dna_anchor_last_id_chain1; my $up_dna_anchor_first_id_chain2; my $up_dna_anchor_last_id_chain2; my $last_protein_id; my $last_protein_res_id; my $AA_prec = 0; my $AA_foll = 0; for (my $count=0; $count<$array_size; $count++) { my $line= @pdb_input[$count]; my $trigger_TER=substr($line, 0, 3); ############ Detect TER ##################
211
if ($trigger_TER eq "TER") { #EXTRACT DNA ANCHORS: my $line_prec= @pdb_input[$count-1]; my $RP=substr($line_prec, 17, 2); my $RP_alt=substr($line_prec, 18, 2); my $line_foll= @pdb_input[$count+1]; my $RF=substr($line_foll, 17, 2); my $RF_alt=substr($line_foll, 18, 2); if (($RF eq "DA") or ($RF eq "DG") or ($RF eq "DC") or ($RF eq "DT")){ $down_dna_anchor = "on"; } if (($RF_alt eq "DA") or ($RF_alt eq "DG") or ($RF_alt eq "DC") or ($RF_alt eq "DT")){ $down_dna_anchor = "on"; } if (($RP eq "DA") or ($RP eq "DG") or ($RP eq "DC") or ($RP eq "DT")){ $up_dna_anchor = "on"; } if (($RP_alt eq "DA") or ($RP_alt eq "DG") or ($RP_alt eq "DC") or ($RP_alt eq "DT")){ $up_dna_anchor = "on"; } ## EXTRACT down_dna_anchor if ($down_dna_anchor eq "on") { ##Go forwards on residue length my $line_DNA_segment_start=@pdb_input[$count+1]; #get first atom index (for first chain x=1, for second chain x=2) $down_dna_anchor_first_id = substr($line_DNA_segment_start, 6, 5); my $resid_DNA_segment_start=substr($line_DNA_segment_start, 22, 4); for (my $c=1; $c<$y; $c++) { my $line_DNA_segment= @pdb_input[$count+$c]; my $resid_DNA_segment=substr($line_DNA_segment, 22, 4); if ($resid_DNA_segment==$resid_DNA_segment_start) { $y++; $down_dna_anchor_last_id=substr($line_DNA_segment, 6, 5); } } $down_dna_anchor = "off"; $down++; } ## END EXTRACT down_dna_anchor ## EXTRACT up_dna_anchor if ($up_dna_anchor eq "on") { ##Go forwards on residue length my $line_DNA_segment_start=@pdb_input[$count-1]; #get first atom index (for first chain x=1, for second chain x=2) $up_dna_anchor_last_id = substr($line_DNA_segment_start, 6, 5); my $resid_DNA_segment_start=substr($line_DNA_segment_start, 22, 4); for (my $c=1; $c<$y; $c++) { my $line_DNA_segment= @pdb_input[$count-$c]; my $resid_DNA_segment=substr($line_DNA_segment, 22, 4); if ($resid_DNA_segment==$resid_DNA_segment_start) { $y++; $up_dna_anchor_first_id=substr($line_DNA_segment, 6, 5); } }
212
$up_dna_anchor = "off"; $up++; } ## END EXTRACT up_dna_anchor #EXTRACT PROTEIN atom index range: #if the residue preceding the TER is an AA, #but the next residue is not an AA #then we have reached the end of the protein atoms my $RP=substr($line_prec, 17, 3); my $RF=substr($line_foll, 17, 3); if (($RP eq "ALA") or ($RP eq "ARG") or ($RP eq "ASH") or ($RP eq "ASN") or ($RP eq "ASP") or ($RP eq "CYM") or ($RP eq "CYS") or ($RP eq "CYX") or ($RP eq "GLN") or ($RP eq "GLU") or ($RP eq "GLY") or ($RP eq "HID") or ($RP eq "HIE") or ($RP eq "HIP") or ($RP eq "HYP") or ($RP eq "ILE") or ($RP eq "LEU") or ($RP eq "LYN") or ($RP eq "LYS") or ($RP eq "MET") or ($RP eq "PHE") or ($RP eq "PRO") or ($RP eq "SER") or ($RP eq "THR") or ($RP eq "THR") or ($RP eq "TRP") or ($RP eq "TRP") or ($RP eq "TYR") or ($RP eq "VAL")){ $AA_prec = 1; } if (($RF eq "ALA") or ($RF eq "ARG") or ($RF eq "ASH") or ($RF eq "ASN") or ($RF eq "ASP") or ($RF eq "CYM") or ($RF eq "CYS") or ($RF eq "CYX") or ($RF eq "GLN") or ($RF eq "GLU") or ($RF eq "GLY") or ($RF eq "HID") or ($RF eq "HIE") or ($RF eq "HIP") or ($RF eq "HYP") or ($RF eq "ILE") or ($RF eq "LEU") or ($RF eq "LYN") or ($RF eq "LYS") or ($RF eq "MET") or ($RF eq "PHE") or ($RF eq "PRO") or ($RF eq "SER") or ($RF eq "THR") or ($RF eq "THR") or ($RF eq "TRP") or ($RF eq "TRP") or ($RF eq "TYR") or ($RF eq "VAL")){ $AA_foll = 1; } if (($AA_prec == 1) and ($AA_foll == 0)){ $last_protein_id = substr($line_prec, 6, 5); $last_protein_id = $last_protein_id - 1; $last_protein_res_id = substr($line_prec, 22, 4); } $AA_prec = 0; $AA_foll = 0; #END EXTRACT PROTEIN atom index range } ############ END Detect TER ################## if ($down == 1) { $down_dna_anchor_first_id_chain1 = $down_dna_anchor_first_id-1; $down_dna_anchor_last_id_chain1 = $down_dna_anchor_last_id-1; } if ($down == 2) { $down_dna_anchor_first_id_chain2 = $down_dna_anchor_first_id-1; $down_dna_anchor_last_id_chain2 = $down_dna_anchor_last_id-1; } if ($up == 1) { $up_dna_anchor_first_id_chain1 = $up_dna_anchor_first_id-1; $up_dna_anchor_last_id_chain1 = $up_dna_anchor_last_id-1; } if ($up == 2) { $up_dna_anchor_first_id_chain2 = $up_dna_anchor_first_id-1; $up_dna_anchor_last_id_chain2 = $up_dna_anchor_last_id-1; }
213
} ############## END of line loop and END EXTRACT DNA anchors print "chain1_dna_anchors are: $down_dna_anchor_first_id_chain1 to $down_dna_anchor_last_id_chain1 $up_dna_anchor_first_id_chain1 to$up_dna_anchor_last_id_chain1\n"; print "chain2_dna_anchors are: $down_dna_anchor_first_id_chain2 to $down_dna_anchor_last_id_chain2 $up_dna_anchor_first_id_chain2 to $up_dna_anchor_last_id_chain2\n"; print "protein index range is: 1 to $last_protein_id\n"; print "protein resid range is: 1 to $last_protein_res_id\n"; ############################################################################### ##END Extract number of protein residues and extract T and N strand anchors ############################################################################### ################################################################################ ##Extract number of water molecules ################################################################################ my @pdb_input = read_file("leap-1.log") or die; my $array_size=scalar @pdb_input; my $wat; my $trigger_solvate; my $trigger_wat=0; my $trigger_wat_line; my @line_handle; for (my $count=0; $count<$array_size; $count++) { $trigger_solvate=substr(@pdb_input[$count], 0, 9); $trigger_wat_line=substr(@pdb_input[$count], 2, 5); if ($trigger_solvate eq "> solvate") { $trigger_wat=1; } if (($trigger_wat_line eq "Added") and ($trigger_wat == 1)){ @line_handle = split ( /\s+/, @pdb_input[$count] ); $wat = @line_handle[2]; $trigger_wat = 0; } } print "\nwat is *$wat*\n"; ################################################################################ ##END Extract number of water molecules ################################################################################ ################################################################################ ##Extract water box size ################################################################################ #Get water dims: my $outfile="scr-box.vmd"; open (FILE2, "> $outfile") || die; print (FILE2 "set outFile out-box.txt\n"); print (FILE2 "set out [open \$outFile w]\n"); print (FILE2 "set mol [mol new out-leap1-2.pdb type pdb]\n"); print (FILE2 "set sel [atomselect top water]\n");
214
print (FILE2 "set minmax [measure minmax \$sel]\n"); print (FILE2 "set b [split \$minmax { }]\n"); print (FILE2 "set xmin [lindex \$b 0]\n"); print (FILE2 "set xmin [string trim \$xmin \"{\"]\n"); print (FILE2 "set ymin [lindex \$b 1]\n"); print (FILE2 "set zmin [lindex \$b 2]\n"); print (FILE2 "set zmin [string trim \$zmin \"}\"]\n"); print (FILE2 "set xmax [lindex \$b 3]\n"); print (FILE2 "set xmax [string trim \$xmax \"{\"]\n"); print (FILE2 "set ymax [lindex \$b 4]\n"); print (FILE2 "set zmax [lindex \$b 5]\n"); print (FILE2 "set zmax [string trim \$zmax \"}\"]\n"); print (FILE2 "set xdim [expr \$xmax - \$xmin]\n"); print (FILE2 "set ydim [expr \$ymax - \$ymin]\n"); print (FILE2 "set zdim [expr \$zmax - \$zmin]\n"); print (FILE2 "puts \$out \"\$xdim\"\n"); print (FILE2 "puts \$out \"\$ydim\"\n"); print (FILE2 "puts \$out \"\$zdim\"\n"); print (FILE2 "exit\n"); close (FILE2); my $cmd = "vmd -dispdev text -nt -e scr-box.vmd"; system($cmd); unlink "scr-box.vmd"; ################################################################################ ##END Extract water box size ################################################################################ ################################################################################ ##Read box size ################################################################################ my $count=0; my $x_box; my $y_box; my $z_box; my @pdb_input_ini = read_file("out-box.txt") or die; $x_box= @pdb_input_ini[0]; $x_box =~ s/^\s+|\s+$//g; $y_box= @pdb_input_ini[1]; $y_box =~ s/^\s+|\s+$//g; $z_box= @pdb_input_ini[2]; $z_box =~ s/^\s+|\s+$//g; $x_box= sprintf "%.3f", $x_box; $y_box= sprintf "%.3f", $y_box; $z_box= sprintf "%.3f", $z_box; print "x_box is *$x_box*\n"; print "y_box is *$y_box*\n"; print "z_box is *$z_box*\n"; ################################################################################ ##END Read box size ################################################################################ ################################################################################
215
##AMEND pdb, with water box size, and OXT atoms removed ################################################################################ #Note: #With box size, Prepare AddtoBox ready pdb file out-0.pdb >addtobox-ready-struct-ini-hydro.pdb, #and remove OXT atoms from pdb file (for second leap run) my $x_pdb; my $y_pdb; my $z_pdb; my $size_x= length($x_box); if ($size_x == 6){ $x_pdb= " ". $x_box; } if ($size_x == 7){ $x_pdb= $x_box; } my $size_y= length($y_box); if ($size_y == 6){ $y_pdb= " ". $y_box; } if ($size_y == 7){ $y_pdb= $y_box; } my $size_z= length($z_box); if ($size_z == 6){ $z_pdb= " ". $z_box; } if ($size_z == 7){ $z_pdb= $z_box; } my $sel_total = "CRYST1 " . $x_pdb . " " . $y_pdb . " " . $z_pdb . " 90.00 90.00 90.00 1" . "\n"; my @pdb_input = read_file("out-leap1-1.pdb") or die; my @pdb_output; my $array_size=scalar @pdb_input; my $outfile= "out-ini.pdb"; my $count_update_output=0; @pdb_output[0] = "$sel_total"; ########## LINE LOOP for (my $count=1; $count<$array_size; $count++) { my $line= @pdb_input[$count]; my $atom=substr($line, 13, 3); if ($atom ne "OXT"){ @pdb_output[$count+$count_update_output] = "$line"; } if ($atom eq "OXT"){ $count_update_output--; } } ########## END of LINE LOOP
216
# print to file open (FILE, "> $outfile"); for (my $count_2=0; $count_2<$array_size+$count_update_output; $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); ################################################################################ ##END AMEND pdb, with water box size, and OXT atoms removed ################################################################################ ################################################################################ ##CALCULATE metabolite amounts ################################################################################ #Note: #Calculate number of K+, Na+, glu, phos, mg, sul, mg2+, ca2+, (and gtps for later) needed #0.5 mM Ca2+: my $nb_Ca=($wat/55)*0.0005; $nb_Ca=nearest (1, $nb_Ca); #2 mM Mg2+: my $nb_Mg=($wat/55)*0.002; $nb_Mg=nearest (1, $nb_Mg); #5 mM S2+: my $nb_S=($wat/55)*0.005; $nb_S=nearest (1, $nb_S); #20 mM Na+: my $nb_Na=($wat/55)*0.02; $nb_Na=nearest (1, $nb_Na); #2 mM Lys: my $nb_ZK=($wat/55)*0.002; $nb_ZK=nearest (1, $nb_ZK); #2.5 mM His: my $nb_ZHE=($wat/55)*0.0025; $nb_ZHE=nearest (1, $nb_ZHE); #6 mM Arg: my $nb_ZR=($wat/55)*0.006; $nb_ZR=nearest (1, $nb_ZR); #8.5 mM Asp: my $nb_ZD=($wat/55)*0.0085; $nb_ZD=nearest (1, $nb_ZD); #80 mM Glu: my $nb_ZE=($wat/55)*0.08; $nb_ZE=nearest (1, $nb_ZE); #300 mM K+: my $nb_K=($wat/55)*0.3; $nb_K=nearest (1, $nb_K); #And calculations for phasis two (later), #when the gtps are added in a metabolite #relaxed solvent bath: #5.9 mM NTPs: my $nb_gtp=($wat/55)*0.0059; $nb_gtp=nearest (1, $nb_gtp); #number of Cl- to be removed later: my $del_Cl=$nb_gtp*2;
217
print "nb_Ca is *$nb_Ca*\n"; print "nb_Mg is *$nb_Mg*\n"; print "nb_S is *$nb_S*\n"; print "nb_HP is *$nb_HP*\n"; print "nb_2HP is *$nb_2HP*\n"; print "nb_Na is *$nb_Na*\n"; print "nb_ZK is *$nb_ZK*\n"; print "nb_ZHE is *$nb_ZHE*\n"; print "nb_ZR is *$nb_ZR*\n"; print "nb_ZD is *$nb_ZD*\n"; print "nb_ZE is *$nb_ZE*\n"; print "nb_K is *$nb_K*\n"; print "nb_gtp is *$nb_gtp*\n"; print "del_Cl is *$del_Cl*\n"; ################################################################################ ##END CALCULATE metabolite amounts ################################################################################ ################################################################################ ##ADD first round of metabolite to solvent box ################################################################################ #Note: #Execute AddToBox on out-ini.pdb to add first round of metabolites (not the gtps yet) my $nb_protein_res = $last_protein_res_id; my $cmd = "/home/ng/amber16/bin/AddToBox -c out-ini.pdb -a Ca.pdb -na $nb_Ca -o out2.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out2.pdb -a MG.pdb -na $nb_Mg -o out3.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out3.pdb -a SUL.pdb -na $nb_S -o out4.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out4.pdb -a Na+.pdb -na $nb_Na -o out7.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out7.pdb -a ZK.pdb -na $nb_ZK -o out8.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out8.pdb -a ZHE.pdb -na $nb_ZHE -o out9.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out9.pdb -a ZR.pdb -na $nb_ZR -o out10.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out10.pdb -a ZD.pdb -na $nb_ZD -o out11.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1";
218
system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out11.pdb -a ZE.pdb -na $nb_ZE -o out12.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out12.pdb -a K+.pdb -na $nb_K -o out13.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); ################################################################################ ##END ADD first round of metabolite to solvent box ################################################################################ ################################################################################ ##EXECUTE second dummy Leap run ################################################################################ #Note: #Run second Leap run, with required param files, to hydrogenise the mets #, and get unbalanced charge my $outfile="leap-2.scrpt"; open (FILE2, "> $outfile") || die; print (FILE2 "source leaprc.protein.ff14SB\n"); print (FILE2 "source leaprc.DNA.OL15\n"); print (FILE2 "source leaprc.RNA.OL3\n"); print (FILE2 "loadoff atomic_ions.lib\n"); print (FILE2 "loadamberparams frcmod.ions1lm_1264_tip4pew\n"); print (FILE2 "loadamberparams frcmod.ions234lm_1264_tip4pew\n"); print (FILE2 "loadoff zaa-new.off\n"); print (FILE2 "loadoff SUL.lib\n"); print (FILE2 "loadamberparams frcmod.sul\n"); print (FILE2 "sys = loadpdb out13.pdb\n"); print (FILE2 "charge sys\n"); print (FILE2 "setBox sys vdw 1.0\n"); print (FILE2 "set sys box {$x_box $y_box $z_box}\n"); print (FILE2 "saveamberparm sys out13.prmtop out13.inpcrd\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/xleap -s -f leap-2.scrpt > out-leap2.out"; system($cmd); #keep file: rename "leap.log", "leap-2.log"; ################################################################################ ##END EXECUTE second dummy Leap run ################################################################################ ################################################################################ ##EXTRACT unbalanced charge ################################################################################ my @pdb_input = read_file("leap-2.log") or die; my $array_size=scalar @pdb_input; my $charge; my $done_charge=0; my @line_handle;
219
for (my $count=0; $count<$array_size; $count++) { my $trigger_charge=substr(@pdb_input[$count], 0, 8); if (($trigger_charge eq "> charge") and ($done_charge == 0)){ @line_handle = split ( /\s+/, @pdb_input[$count+1] ); $charge = @line_handle[3]; $charge=nearest (1, $charge); $done_charge=1; } } print "\ncharge is *$charge*\n"; ################################################################################ ##END EXTRACT unbalanced charge ################################################################################ ################################################################################ ##CALCULATE number of Cl- required to neutralise the system ################################################################################ #Note: #Calculate number of Cl- required to neutralise the system now #and for later when the gtps will be added my $nb_Cl=$charge; print "nb_Cl is *$nb_Cl*\n"; ################################################################################ ##END CALCULATE number of Cl- required to neutralise the system ################################################################################ ################################################################################ ##ADD Cl- and water molecules to solvent box ################################################################################ my $cmd = "/home/ng/amber16/bin/AddToBox -c out13.pdb -a Cl-.pdb -na $nb_Cl -o out14.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out14.pdb -a WAT.pdb -na $wat -o out15.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); ################################################################################ ##END ADD Cl- and water molecules to solvent box ################################################################################ ################################################################################ ##EXECUTE non-dummy Leap run ################################################################################ #Note: #Execute third Leap run to generate the simulation ready amber inpcrd and prmtop files my $outfile="leap-3.scrpt"; open (FILE2, "> $outfile") || die; print (FILE2 "source leaprc.protein.ff14SB\n");
220
print (FILE2 "source leaprc.DNA.OL15\n"); print (FILE2 "source leaprc.RNA.OL3\n"); print (FILE2 "loadoff atomic_ions.lib\n"); print (FILE2 "loadamberparams frcmod.ions1lm_1264_tip4pew\n"); print (FILE2 "loadamberparams frcmod.ions234lm_1264_tip4pew\n"); print (FILE2 "loadoff solvents.lib\n"); print (FILE2 "loadamberparams frcmod.tip4pew\n"); print (FILE2 "loadoff zaa-new.off\n"); print (FILE2 "loadoff SUL.lib\n"); print (FILE2 "loadamberparams frcmod.sul\n"); print (FILE2 "sys = loadpdb out15.pdb\n"); print (FILE2 "setBox sys vdw 1.0\n"); print (FILE2 "set sys box {$x_box $y_box $z_box}\n"); print (FILE2 "saveamberparm sys out15.prmtop out15.inpcrd\n"); print (FILE2 "savepdb sys out15-leap.pdb\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/xleap -s -f leap-3.scrpt > out-leap3.out"; system($cmd); #keep file: rename "leap.log", "leap-3.log"; #Apply C-4 term to Lennard-Jones potential #and apply Panteva 2015 m1264 refined set my $outfile="parmed.in"; open (FILE2, "> $outfile") || die; print (FILE2 "setOverwrite True\n"); print (FILE2 "change AMBER_ATOM_TYPE :A*,DA*\@N7 NAMG\n"); print (FILE2 "change AMBER_ATOM_TYPE :G*,DG*\@N7 NGMG\n"); print (FILE2 "change AMBER_ATOM_TYPE :*\@OP* OPMG\n"); print (FILE2 "addLJType @\%NAMG\n"); print (FILE2 "addLJType @\%NGMG\n"); print (FILE2 "addLJType @\%OPMG\n"); print (FILE2 "add12_6_4 :ZN watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :ZN\n"); print (FILE2 "add12_6_4 :MG watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :MG\n"); print (FILE2 "add12_6_4 :Na+ watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :Na+\n"); print (FILE2 "add12_6_4 :Cl- watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :Cl-\n"); print (FILE2 "add12_6_4 :CA watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :CA\n"); print (FILE2 "add12_6_4 :K+ watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :K+\n"); print (FILE2 "outparm out15-parmed.prmtop out15-parmed.inpcrd\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/parmed -i parmed.in -p out15.prmtop -c out15.inpcrd >out-parmed.txt"; system($cmd); ################################################################################ ##END EXECUTE non-dummy Leap run ################################################################################ ################################################################################ ##EXECUTE first round of simulations ################################################################################
221
##################### MIN my $outfile="min1.in"; open (FILE2, "> $outfile") || die; print (FILE2 "2e2h: initial minimisation solvent + ions\n"); print (FILE2 " &cntrl\n"); print (FILE2 " imin = 1,\n"); print (FILE2 " ntmin = 2,\n"); print (FILE2 " maxcyc = 5000,\n"); print (FILE2 " ncyc = 1000,\n"); print (FILE2 " ntb = 1,\n"); print (FILE2 " ntr = 1,\n"); print (FILE2 " cut = 10.0\n"); print (FILE2 " /\n"); print (FILE2 "Hold the protein fixed\n"); print (FILE2 "500.0\n"); print (FILE2 "RES 1 $nb_protein_res\n"); print (FILE2 "END\n"); print (FILE2 "END\n"); close (FILE2); my $outfile="min2.in"; open (FILE2, "> $outfile") || die; print (FILE2 "2e2h: initial minimisation whole system\n"); print (FILE2 " &cntrl\n"); print (FILE2 " imin = 1,\n"); print (FILE2 " ntmin = 2,\n"); print (FILE2 " maxcyc = 5000,\n"); print (FILE2 " ncyc = 2500,\n"); print (FILE2 " ntb = 1,\n"); print (FILE2 " ntr = 0,\n"); print (FILE2 " cut = 10.0\n"); print (FILE2 " /\n"); print (FILE2 "END\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/sander -O -i min1.in -o min1-p.out -p out15-parmed.prmtop -c out15-parmed.inpcrd -r min1-p.rst -ref out15-parmed.inpcrd"; system($cmd); my $cmd = "/home/ng/amber16/bin/sander -O -i min2.in -o min2-p.out -p out15-parmed.prmtop -c min1-p.rst -r min2-p.rst"; system($cmd); ################################################################################ ##EXECUTE next preliminary steps with OPENMM ################################################################################ ##################### HEAT (MD1) 20 ps my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n");
222
print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out15-parmed.prmtop', 'min2-p.rst')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=10*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); #Add restraints to protein print (FILE2 "force_res_prot = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_res_prot.addGlobalParameter(\"k\", 10.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i <= $last_protein_id):\n"); print (FILE2 " force_res_prot.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_res_prot)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 100, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md1-p.nc', 5000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md1-p.rst7', 10000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(10000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md1-p.txt"; system($cmd); ##################### EQ-VEL (MD2-eq) 100 ps my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out15-parmed.prmtop', 'md1-p.rst7.10000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md1-p.rst7.10000')\n");
223
print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); #Add restraints to anchors print (FILE2 "force_anchors = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_anchors.addGlobalParameter(\"k\", 50.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " force_anchors.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_anchors)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 1000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md2-eq-p.nc', 10000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md2-eq-p.rst7', 50000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(50000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md2-eq-p.txt"; system($cmd); ##################### EQ-BOX (MD2-sim1) 20 ns my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out15-parmed.prmtop', 'md2-eq-p.rst7.50000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md2-eq-p.rst7.50000')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n");
224
print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); print (FILE2 "system.addForce(MonteCarloBarostat(1*bar, 300*kelvin))\n"); #Add restraints to anchors print (FILE2 "force_anchors = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_anchors.addGlobalParameter(\"k\", 50.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " force_anchors.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_anchors)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 250000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md2-sim1-p.nc', 250000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md2-sim1-p.rst7', 1000000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(10000000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md2-sim1-p.txt"; system($cmd); ##################### EQ-VEL2 (MD2-sim2) 20 ns my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out15-parmed.prmtop', 'md2-sim1-p.rst7.10000000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md2-sim1-p.rst7.10000000')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n");
225
print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); #Add restraints to anchors print (FILE2 "force_anchors = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_anchors.addGlobalParameter(\"k\", 50.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " force_anchors.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_anchors)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 250000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md2-sim2-p-rst.nc', 250000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md2-sim2-p-rst.rst7', 1000000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(1000000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md2-sim2-p-rst.txt"; system($cmd); ################################################################################ ##END EXECUTE next preliminary steps with OPENMM ################################################################################ ################################################################################ ##EXTRACT LAST FRAME ################################################################################ #Note: ##image the trajectory back inside the periodic box ##extract last frame, strip the water and convert to PDB #NB: ##One can strip the water directly because ##in contrast to phasis 1, one does not need to extract the ##water box dimensions, as the simulation routine has ##automatically implemented its information in the CRYST line ##of the PDB file $ENV{LD_LIBRARY_PATH} = "/home/ng/amber16/lib"; my $outfile="scr-frame.vmd";
226
open (FILE2, "> $outfile") || die; print (FILE2 "set outFile out-frame.txt\n"); print (FILE2 "set out [open \$outFile w]\n"); print (FILE2 "set mol [mol new out15-parmed.prmtop]\n"); print (FILE2 "mol addfile md2-sim2-p.nc waitfor all molid \$mol\n"); print (FILE2 "set n [molinfo top get numframes]\n"); print (FILE2 "puts \$out \"\$n\"\n"); print (FILE2 "exit\n"); close (FILE2); my $cmd = "vmd -dispdev text -nt -e scr-frame.vmd"; system($cmd); my @pdb_input_ini = read_file("out-frame.txt") or die; my $last_frame= @pdb_input_ini[0]; $last_frame =~ s/^\s+|\s+$//g; print "last_frame is *$last_frame*\n"; my $outfile="autoimage.ptraj"; open (FILE2, "> $outfile") || die; print (FILE2 "trajin md2-sim2-imaged-p.nc $last_frame $last_frame 1\n"); print (FILE2 "strip :WAT\n"); print (FILE2 "trajout md2-sim2-imaged-stripped-p.pdb\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/cpptraj out15-parmed.prmtop < autoimage.ptraj > out-ptraj3.txt"; system($cmd); $ENV{LD_LIBRARY_PATH} = "/usr/local/cuda-7.5/lib64:/lib"; ################################################################################ ##END EXTRACT LAST FRAME ################################################################################ ################################################################################ ##Extract water box size ################################################################################ my @pdb_input_ini = read_file("md2-sim2-imaged-stripped-p.pdb") or die; my $line_cryst= @pdb_input_ini[0]; my @line_cryst_handle = split ( /\s+/, $line_cryst ); my $x_box= @line_cryst_handle[1]; my $y_box= @line_cryst_handle[2]; my $z_box= @line_cryst_handle[3]; print "\nx_box is *$x_box*\n"; print "y_box is *$y_box*\n"; print "z_box is *$z_box*\n"; ################################################################################ ##END Extract water box size ################################################################################ ################################################################################ ##AMEND PDB ################################################################################ #Note:
227
##add again NCTER atoms removed by simulation routine ##and remove OXT atoms, because they are not supported by xLeap ##and remove del_Cl amount of Cl- ions my @pdb_input = read_file("md2-sim2-imaged-stripped-p.pdb") or die; my @pdb_output; my $array_size=scalar @pdb_input; my $outfile= "md2-sim2-imaged-stripped-amended-p.pdb"; ##Copy PDB file: for (my $count=0; $count<$array_size; $count++) { @pdb_output[$count] = @pdb_input[$count]; } ############## UPDATE NCTER lines my $update_NTER; my $update_CTER; my $y=2; for (my $count=0; $count<$array_size; $count++) { my $line= @pdb_input[$count]; my $trigger_TER=substr($line, 0, 3); ############ Detect TER ################## if ($trigger_TER eq "TER") { #Detect if TER event occurs: #1/ at the start of the protein #2/ inbetween two protein segments #3/ at the end of the protein #To do so, look at the residue type preceding and following TER line: my $line_prec= @pdb_input[$count-1]; my $RP=substr($line_prec, 17, 3); my $line_foll= @pdb_input[$count+1]; my $RF=substr($line_foll, 17, 3); if (($RP eq "ALA") or ($RP eq "ARG") or ($RP eq "ASH") or ($RP eq "ASN") or ($RP eq "ASP") or ($RP eq "CYM") or ($RP eq "CYS") or ($RP eq "CYX") or ($RP eq "GLN") or ($RP eq "GLU") or ($RP eq "GLY") or ($RP eq "HID") or ($RP eq "HIE") or ($RP eq "HIP") or ($RP eq "HYP") or ($RP eq "ILE") or ($RP eq "LEU") or ($RP eq "LYN") or ($RP eq "LYS") or ($RP eq "MET") or ($RP eq "PHE") or ($RP eq "PRO") or ($RP eq "SER") or ($RP eq "THR") or ($RP eq "THR") or ($RP eq "TRP") or ($RP eq "TRP") or ($RP eq "TYR") or ($RP eq "VAL")){ $update_CTER = "on"; } if (($RF eq "ALA") or ($RF eq "ARG") or ($RF eq "ASH") or ($RF eq "ASN") or ($RF eq "ASP") or ($RF eq "CYM") or ($RF eq "CYS") or ($RF eq "CYX") or ($RF eq "GLN") or ($RF eq "GLU") or ($RF eq "GLY") or ($RF eq "HID") or ($RF eq "HIE") or ($RF eq "HIP") or ($RF eq "HYP") or ($RF eq "ILE") or ($RF eq "LEU") or ($RF eq "LYN") or ($RF eq "LYS") or ($RF eq "MET") or ($RF eq "PHE") or ($RF eq "PRO") or ($RF eq "SER") or ($RF eq "THR") or ($RF eq "THR") or ($RF eq "TRP") or ($RF eq "TRP") or ($RF eq "TYR") or ($RF eq "VAL")){ $update_NTER = "on"; } ## UPDATE NTER if ($update_NTER eq "on") { ##Go forwards on residue length my $line_NTER_segment_start=@pdb_input[$count+1]; my $resid_NTER_segment_start=substr($line_NTER_segment_start, 22, 4);
228
for (my $c=1; $c<$y; $c++) { my $line_NTER_segment= @pdb_input[$count+$c]; my $resid_NTER_segment=substr($line_NTER_segment, 22, 4); if ($resid_NTER_segment==$resid_NTER_segment_start) { $y++; my $sel1=substr($line_NTER_segment, 0, 16); my $sel2="N"; my $sel3=substr($line_NTER_segment, 17, 63); my $sel_tot=$sel1 . $sel2 . $sel3; @pdb_output[$count+$c]="$sel_tot\n"; } } $update_NTER = "off" } ## END UPDATE NTER ## UPDATE CTER if ($update_CTER eq "on") { ##Go backwards on residue length my $line_CTER_segment_start= @pdb_input[$count-1]; my $resid_CTER_segment_start=substr($line_CTER_segment_start, 22, 4); for (my $c=1; $c<$y; $c++) { my $line_CTER_segment= @pdb_input[$count-$c]; my $resid_CTER_segment=substr($line_CTER_segment, 22, 4); if ($resid_CTER_segment==$resid_CTER_segment_start) { $y++; my $sel1=substr($line_CTER_segment, 0, 16); my $sel2= "C"; my $sel3=substr($line_CTER_segment, 17, 63); my $sel_tot = $sel1 . $sel2 . $sel3; @pdb_output[$count-$c] = "$sel_tot\n"; } } $update_CTER = "off" } ## END UPDATE CTER } ############ END Detect TER ################## } ############## END of line loop and END UPDATE NCTER lines # print to file open (FILE, "> $outfile"); for (my $count_2=0; $count_2<$array_size; $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); #Remove some Cl- to balance the NTPs to be injected #Note: $del_Cl is calculated in phasis 1 my $done=0; my $update_count=0; my $count_update_output=0; my @pdb_input = read_file("md2-sim2-imaged-stripped-amended-p.pdb") or die; my @pdb_output; my $array_size=scalar @pdb_input; my $outfile= "md2-sim2-imaged-stripped-amended2-p.pdb"; for (my $count=0; $count<$array_size; $count++) {
229
my $line= @pdb_input[$count]; my $trigger_Cl=substr($line, 17, 3); my $atom=substr($line, 13, 3); if ($done==0){ if ($trigger_Cl eq "Cl-") { #then remove twice the number of lines corresponding to #nb of Cl to be removed in order to account for TER $update_count=$del_Cl*2; $done=1; } } if ($atom ne "OXT"){ @pdb_output[$count+$count_update_output] = @pdb_input[$count+$update_count]; } if ($atom eq "OXT"){ $count_update_output--; } } # print to file open (FILE, "> $outfile"); for (my $count_2=0; $count_2<($array_size - $update_count); $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); ################################################################################ ##END AMEND PDB ################################################################################ ################################################################################ ##INJECT NTPs ################################################################################ my $cmd = "/home/ng/amber16/bin/AddToBox -c md2-sim2-imaged-stripped-amended2-p.pdb -a gtp.pdb -na $nb_gtp -o out16.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); ################################################################################ ##END INJECT NTPs ################################################################################ ################################################################################ ##AMEND NTPs ################################################################################ #Note: #MgB is supplemented directly with GTP (same resid for AddToBox) #Hence now MgB residues are to be specified in their own resid my $count=0; my $update_resid=0; my $count_update_output=0; my @pdb_input = read_file("out16.pdb") or die; my @pdb_output;
230
my $array_size=scalar @pdb_input; my $outfile= "out16-amended.pdb"; ########### LINE LOOP for (my $count=0; $count<$array_size; $count++) { my $line= @pdb_input[$count]; my $resid=substr($line, 22, 7); my $resname=substr($line, 17, 3); my $atom=substr($line, 13, 3); @pdb_output[$count+$count_update_output] = "$line"; if (($resname eq "gtp") and ($atom eq "MG ")){ my $sel1=substr($line, 0, 17); my $sel2="MG "; my $sel3=substr($line, 20, 2); my $sel4=$resid+$update_resid; my $sel5=substr($line, 26, 40); @pdb_output[$count+$count_update_output] = $sel1 . $sel2 . $sel3 . $sel4 . $sel5; $update_resid++; @pdb_output[$count+$count_update_output+1] = "TER \n"; $count_update_output++; } if (($resname eq "gtp") and ($atom ne "MG ")){ my $sel1=substr($line, 0, 22); my $sel2=$resid+$update_resid; my $sel3=substr($line, 26, 40); @pdb_output[$count+$count_update_output] = $sel1 . $sel2 . $sel3; } } open (FILE, "> $outfile"); for (my $count_2=0; $count_2<$array_size+$count_update_output; $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); ################################################################################ ##END AMEND NTPs ################################################################################ ################################################################################ ##ADD WATER AGAIN ################################################################################ my $cmd = "/home/ng/amber16/bin/AddToBox -c out16-amended.pdb -a WAT.pdb -na $wat -o out17.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); ################################################################################ ##END ADD WATER AGAIN ################################################################################ ################################################################################ ##EXECUTE second non-dummy Leap run ################################################################################
231
#Note: ##Execute Leap run, #to generate simulation ready files #and to be able to count nb of atoms used for the upcoming aMD run my $outfile="leap-4.scrpt"; open (FILE2, "> $outfile") || die; print (FILE2 "source leaprc.protein.ff14SB\n"); print (FILE2 "source leaprc.DNA.OL15\n"); print (FILE2 "source leaprc.RNA.OL3\n"); print (FILE2 "loadoff atomic_ions.lib\n"); print (FILE2 "loadamberparams frcmod.ions1lm_1264_tip4pew\n"); print (FILE2 "loadamberparams frcmod.ions234lm_1264_tip4pew\n"); print (FILE2 "loadoff solvents.lib\n"); print (FILE2 "loadamberparams frcmod.tip4pew\n"); print (FILE2 "loadoff zaa-new.off\n"); print (FILE2 "loadoff SUL.lib\n"); print (FILE2 "loadamberparams frcmod.sul\n"); print (FILE2 "loadamberprep gtp.prep\n"); print (FILE2 "loadamberparams frcmod.gtp\n"); print (FILE2 "sys = loadpdb out17.pdb\n"); print (FILE2 "setBox sys vdw 1.0\n"); print (FILE2 "set sys box {$x_box $y_box $z_box}\n"); print (FILE2 "saveamberparm sys out17.prmtop out17.inpcrd\n"); print (FILE2 "savepdb sys out17-leap.pdb\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/xleap -s -f leap-4.scrpt > out-leap4.out"; system($cmd); #keep file: rename "leap.log", "leap-4.log"; #Apply r-4 term to Lennard-Jones potential #and apply Panteva 2015 m1264 refined set my $outfile="parmed.in"; open (FILE2, "> $outfile") || die; print (FILE2 "setOverwrite True\n"); print (FILE2 "change AMBER_ATOM_TYPE :A*,DA*\@N7 NAMG\n"); print (FILE2 "change AMBER_ATOM_TYPE :G*,DG*\@N7 NGMG\n"); print (FILE2 "change AMBER_ATOM_TYPE :*\@OP* OPMG\n"); print (FILE2 "addLJType @\%NAMG\n"); print (FILE2 "addLJType @\%NGMG\n"); print (FILE2 "addLJType @\%OPMG\n"); print (FILE2 "add12_6_4 :ZN watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :ZN\n"); print (FILE2 "add12_6_4 :MG watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :MG\n"); print (FILE2 "add12_6_4 :Na+ watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :Na+\n"); print (FILE2 "add12_6_4 :Cl- watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :Cl-\n"); print (FILE2 "add12_6_4 :CA watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :CA\n"); print (FILE2 "add12_6_4 :K+ watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :K+\n"); print (FILE2 "outparm out17-parmed.prmtop out17-parmed.inpcrd\n"); print (FILE2 "quit\n"); close (FILE2);
232
my $cmd = "/home/ng/amber16/bin/parmed -i parmed.in -p out17.prmtop -c out17.inpcrd >out-parmed2.txt"; system($cmd); ################################################################################ ##END EXECUTE second non-dummy Leap run ################################################################################ ################################################################################ ##EXTRACT NB ATOMS (for aMD) ################################################################################ my @pdb_input=read_file("out17-leap.pdb") or die; my $array_size=scalar @pdb_input; my $nb_atoms=substr(@pdb_input[$array_size-2], 6, 6); print "\nnb_atoms is *$nb_atoms*\n"; ################################################################################ ##END EXTRACT NB ATOMS (for aMD) ################################################################################ ################################################################################ ##EXECUTE second round of simulations ################################################################################ ##################### MIN my $cmd = "/home/ng/amber16/bin/sander -O -i min1.in -o min1.out -p out17-parmed.prmtop -c out17-parmed.inpcrd -r min1.rst -ref out17-parmed.inpcrd"; system($cmd); my $cmd = "/home/ng/amber16/bin/sander -O -i min2.in -o min2.out -p out17-parmed.prmtop -c min1.rst -r min2.rst"; system($cmd); ################################################################################ ##EXECUTE next preliminary steps with OPENMM ################################################################################ ##################### HEAT (MD1) 20 ps my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out17-parmed.prmtop', 'min2.rst')\n");
233
print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); #Add restraints to protein print (FILE2 "force_res_prot = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_res_prot.addGlobalParameter(\"k\", 10.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i <= $last_protein_id):\n"); print (FILE2 " force_res_prot.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_res_prot)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 100, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md1.nc', 5000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md1.rst7', 10000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(10000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md1.txt"; system($cmd); ##################### EQ-VEL (MD2-eq) 100 ps my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out17-parmed.prmtop', 'md1.rst7.10000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md1.rst7.10000')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); #Add restraints to anchors
234
print (FILE2 "force_anchors = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_anchors.addGlobalParameter(\"k\", 50.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " force_anchors.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_anchors)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 1000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md2-eq.nc', 10000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md2-eq.rst7', 50000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(50000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md2-eq.txt"; system($cmd); ##################### EQ-BOX (MD2-sim1) 20 ns my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out17-parmed.prmtop', 'md2-eq.rst7.50000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md2-eq.rst7.50000')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); print (FILE2 "system.addForce(MonteCarloBarostat(1*bar, 300*kelvin))\n"); #Add restraints to anchors print (FILE2 "force_anchors = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n");
235
print (FILE2 "force_anchors.addGlobalParameter(\"k\", 50.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " force_anchors.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_anchors)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 250000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md2-sim1.nc', 250000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md2-sim1.rst7', 1000000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(10000000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md2-sim1.txt"; system($cmd); ##################### EQ-VEL2 (MD2-sim2) 20 ns my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "def forcegroupify(system):\n"); print (FILE2 " forcegroups = {}\n"); print (FILE2 " for i in range(system.getNumForces()):\n"); print (FILE2 " force = system.getForce(i)\n"); print (FILE2 " force.setForceGroup(i)\n"); print (FILE2 " forcegroups[force] = i\n"); print (FILE2 " return forcegroups\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out17-parmed.prmtop', 'md2-sim1.rst7.10000000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md2-sim1.rst7.10000000')\n");
236
print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "fgrps=forcegroupify(system)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); #Add restraints to anchors print (FILE2 "force_anchors = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_anchors.addGlobalParameter(\"k\", 50.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " force_anchors.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_anchors)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter('out-openmm-md2-sim2.txt', 250000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md2-sim2.nc', 250000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md2-sim2.rst7', 1000000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); ##Simulate in checkpoints in order to supervise data #40 *250000 = 10000000=20 ns print (FILE2 "for i in range (40):\n"); print (FILE2 " simulation.step(250000)\n"); #print total potential energy print (FILE2 " y = simulation.context.getState(getEnergy=True).getPotentialEnergy()\n"); print (FILE2 " y = y/4.184\n"); print (FILE2 " print(\"ET =\", y)\n"); #print dihedral potential energy print (FILE2 " x = simulation.context.getState(getEnergy=True,groups=4).getPotentialEnergy()\n"); print (FILE2 " x = x/4.184\n"); print (FILE2 " print(\"Ed =\", x)\n"); print (FILE2 "simulation.saveState('md2-sim2.rst7')\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-Ep.txt"; system($cmd); ################################################################################ ##END EXECUTE next preliminary steps with OPENMM ################################################################################
237
################################################################################ ##EXTRACT aMD parameters ################################################################################ my $median_EPtot; my $median_DIHED; my $count=0; my @md_output = read_file("out-Ep.txt", chomp => 1) or die; my @EPtot_output; #define array for total potential energy values my @DIHED_output; #define array for dihedral energy values my $array_size=scalar @md_output; my @line_handle; my $Ep; my $type; for (my $count=0; $count<$array_size; $count++) { @line_handle = split ( /\s+/, @md_output[$count] ); $type = $line_handle[0]; $Ep = $line_handle[2]; if ($type eq "ET"){ push(@EPtot_output,($Ep)); } if ($type eq "Ed"){ push(@DIHED_output,($Ep)); } } #computes basic statistics on data print "\nEPtot statistics:\n"; my $EPtot_stat=Statistics::Descriptive::Full->new(); $EPtot_stat->add_data(@EPtot_output); my $median=$EPtot_stat->median(); $median_EPtot= round($median); print "\nMedian value chosen for EPtot aMD parameter calculation is: $median_EPtot\n\n"; print "\nDIHED statistics:\n"; my $DIHED_stat=Statistics::Descriptive::Full->new(); $DIHED_stat->add_data(@DIHED_output); my $median=$DIHED_stat->median(); $median_DIHED= round($median); print "\nMedian value chosen for DIHED aMD parameter calculation is: $median_DIHED\n\n"; print "**********************************************************************"; print "\n\tCalculating parameters for aMD simulation\n"; print "\talpha factor:\t\t0.20\n\tnumber of residues:\t$nb_protein_res\n\tnumber of atoms:\t$nb_atoms\n\tDIHED:\t\t\t$median_DIHED\n\tEPtot:\t\t\t$median_EPtot\t\n\n"; print "Boosting DIHEDRAL potential:"; my $energy_contribution=$nb_protein_res*(3.5*4.184); print "\n\tenergy contribution (3.5kcal/mol/residue) =\t$energy_contribution"; my $alphaD=round($energy_contribution*0.20); print "\n\talphaD \t (rounded) =\t\t\t\t$alphaD"; my $EthreshD=round($energy_contribution+($median_DIHED*4.184)); print "\n\tEthreshD (rounded) =\t\t\t\t$EthreshD"; print "\n\nBoosting EPtot potential:"; my $alphaP=round($nb_atoms*(0.20*4.184)); print "\n\talphaP \t (rounded) =\t\t\t\t$alphaP";
238
my $EthreshP=round(($median_EPtot*4.184)+$alphaP); print "\n\tEthreshP (rounded) =\t\t\t\t$EthreshP"; ################################################################################ ##END EXTRACT aMD parameters ################################################################################ ################################################################################ ##EXECUTE aMD ################################################################################ my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "def forcegroupify(system):\n"); print (FILE2 " forcegroups = {}\n"); print (FILE2 " for i in range(system.getNumForces()):\n"); print (FILE2 " force = system.getForce(i)\n"); print (FILE2 " force.setForceGroup(i)\n"); print (FILE2 " forcegroups[force] = i\n"); print (FILE2 " return forcegroups\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out17-parmed.prmtop', 'md2-sim2.rst7.10000000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md2-sim2.rst7.10000000')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "fgrps=forcegroupify(system)\n"); print (FILE2 "integrator = DualAMDIntegrator(2*femtoseconds, 2, $alphaP, $EthreshP, $alphaD, $EthreshD)\n"); print (FILE2 "system.addForce(AndersenThermostat(300*kelvin, 1.0/picosecond))\n"); #Add restraints to anchors print (FILE2 "force_anchors = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_anchors.addGlobalParameter(\"k\", 50.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " force_anchors.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_anchors)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n");
239
print (FILE2 "test = Platform.getPluginLoadFailures()\n"); print (FILE2 "print(\"test-platform is\", test)\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "test = Platform.getPluginLoadFailures()\n"); print (FILE2 "print(\"test-platform is\", test)\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 250000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('aMD2.nc', 250000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('aMD2.rst7', 250000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(50000000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-aMD.txt"; system($cmd); ################################################################################ ##END EXECUTE aMD ################################################################################ exit;
240
Appendix 2: sMD simulation procedure
use File::Slurp; use Math::Round; use autodie; use warnings qw(all); $ENV{PYTHONPATH} = "/home/ng/amber16/lib/python2.7/site-packages"; $ENV{OPENMM_CUDA_COMPILER} = "/usr/local/cuda-8.0/bin/nvcc"; $ENV{LD_LIBRARY_PATH} = "/usr/local/cuda-7.5/lib64:/lib"; $ENV{PATH} = "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"; $ENV{AMBERHOME} = "/home/ng/amber16"; my $wat=159600; my $nb_protein_res=3795; my $x_box=168.024; my $y_box=187.235; my $z_box=170.807; my $last_protein_id=61866; my $dna1=585; my $dna2=615; my $dna3=1777; my $dna4=1809; my $dna5=1810; my $dna6=1839; my $dna7=3028; my $dna8=3058; my $line; my $trigger_Cl; my $trigger_MG; my $atom; my $resid; my $atom_type; my $trigger_gtp; my $resname; my $L0x; my $L0y; my $L0z; my $L1Ax; my $L1Ay; my $L1Az; my $L1Bx; my $L1By; my $L1Bz; my $L2x; my $L2y; my $L2z; my $L3x; my $L3y; my $L3z; my $L4x; my $L4y; my $L4z; my $L4x; my $L4y; my $L1x; my $L1y; my $L1z; my $L1A_id;
241
my $L1B_id; my $L2_id; my $L3_id; my $L4_id; my $gtp_first_id; my $smd_atom_id; my $gtp_last_id; my $last_step; my $exit; ################################################################################ ##EXTRACT LAST FRAME ################################################################################ $ENV{LD_LIBRARY_PATH} = "/home/ng/amber16/lib"; my $outfile="scr-frame.vmd"; open (FILE2, "> $outfile") || die; print (FILE2 "set outFile out-frame.txt\n"); print (FILE2 "set out [open \$outFile w]\n"); print (FILE2 "set mol [mol new out17-parmed.prmtop]\n"); print (FILE2 "mol addfile aMD2-rst.nc waitfor all molid \$mol\n"); print (FILE2 "set n [molinfo top get numframes]\n"); print (FILE2 "puts \$out \"\$n\"\n"); print (FILE2 "exit\n"); close (FILE2); my $cmd = "vmd -dispdev text -nt -e scr-frame.vmd"; system($cmd); my @pdb_input_ini = read_file("out-frame.txt") or die; my $last_frame= @pdb_input_ini[0]; $last_frame =~ s/^\s+|\s+$//g; print "last_frame is *$last_frame*\n"; my $outfile="autoimage.ptraj"; open (FILE2, "> $outfile") || die; print (FILE2 "trajin aMD2-rst.nc $last_frame $last_frame 1\n"); print (FILE2 "strip :WAT\n"); print (FILE2 "strip :gtp\n"); print (FILE2 "trajout frame.pdb\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/cpptraj out17-parmed.prmtop < autoimage.ptraj > out-ptraj3.txt"; system($cmd); $ENV{LD_LIBRARY_PATH} = "/usr/local/cuda-7.5/lib64:/lib"; ################################################################################ ##END EXTRACT LAST FRAME ################################################################################ ################################################################################################# #################################### PRELIMINARY PROCEDURES ############################### #################################################################################################
242
my @pdb_input = read_file("frame.pdb") or die; my @pdb_output; my $array_size=scalar @pdb_input; my $outfile= "frame-out.pdb"; ##Copy PDB file: for (my $count=0; $count<$array_size; $count++) { @pdb_output[$count] = @pdb_input[$count]; } ############## UPDATE NCTER lines my $update_NTER; my $update_CTER; my $y=2; for (my $count=0; $count<$array_size; $count++) { my $line= @pdb_input[$count]; my $trigger_TER=substr($line, 0, 3); ############ Detect TER ################## if (($trigger_TER eq "TER") or ($trigger_TER eq "CRY")){ #Detect if TER event occurs: #1/ at the start of the protein #2/ inbetween two protein segments #3/ at the end of the protein #To do so, look at the residue type preceding and following TER line: my $line_prec= @pdb_input[$count-1]; my $RP=substr($line_prec, 17, 3); my $line_foll= @pdb_input[$count+1]; my $RF=substr($line_foll, 17, 3); if (($RP eq "ALA") or ($RP eq "ARG") or ($RP eq "ASH") or ($RP eq "ASN") or ($RP eq "ASP") or ($RP eq "CYM") or ($RP eq "CYS") or ($RP eq "CYX") or ($RP eq "GLN") or ($RP eq "GLU") or ($RP eq "GLY") or ($RP eq "HID") or ($RP eq "HIE") or ($RP eq "HIP") or ($RP eq "HYP") or ($RP eq "ILE") or ($RP eq "LEU") or ($RP eq "LYN") or ($RP eq "LYS") or ($RP eq "MET") or ($RP eq "PHE") or ($RP eq "PRO") or ($RP eq "SER") or ($RP eq "THR") or ($RP eq "THR") or ($RP eq "TRP") or ($RP eq "TRP") or ($RP eq "TYR") or ($RP eq "VAL")){ $update_CTER = "on"; } if (($RF eq "ALA") or ($RF eq "ARG") or ($RF eq "ASH") or ($RF eq "ASN") or ($RF eq "ASP") or ($RF eq "CYM") or ($RF eq "CYS") or ($RF eq "CYX") or ($RF eq "GLN") or ($RF eq "GLU") or ($RF eq "GLY") or ($RF eq "HID") or ($RF eq "HIE") or ($RF eq "HIP") or ($RF eq "HYP") or ($RF eq "ILE") or ($RF eq "LEU") or ($RF eq "LYN") or ($RF eq "LYS") or ($RF eq "MET") or ($RF eq "PHE") or ($RF eq "PRO") or ($RF eq "SER") or ($RF eq "THR") or ($RF eq "THR") or ($RF eq "TRP") or ($RF eq "TRP") or ($RF eq "TYR") or ($RF eq "VAL")){ $update_NTER = "on"; } ## UPDATE NTER if ($update_NTER eq "on") { ##Go forwards on residue length my $line_NTER_segment_start=@pdb_input[$count+1]; my $resid_NTER_segment_start=substr($line_NTER_segment_start, 22, 4); for (my $c=1; $c<$y; $c++) { my $line_NTER_segment= @pdb_input[$count+$c]; $line_NTER_segment =~ s/\s*$//; my $resid_NTER_segment=substr($line_NTER_segment, 22, 4);
243
if ($resid_NTER_segment==$resid_NTER_segment_start) { $y++; my $sel1=substr($line_NTER_segment, 0, 16); my $sel2="N"; my $sel3=substr($line_NTER_segment, 17, 62); my $sel_tot=$sel1 . $sel2 . $sel3; @pdb_output[$count+$c]="$sel_tot\n"; } } $update_NTER = "off" } ## END UPDATE NTER ## UPDATE CTER if ($update_CTER eq "on") { ##Go backwards on residue length my $line_CTER_segment_start= @pdb_input[$count-1]; my $resid_CTER_segment_start=substr($line_CTER_segment_start, 22, 4); for (my $c=1; $c<$y; $c++) { my $line_CTER_segment= @pdb_input[$count-$c]; $line_CTER_segment =~ s/\s*$//; my $resid_CTER_segment=substr($line_CTER_segment, 22, 4); if ($resid_CTER_segment==$resid_CTER_segment_start) { $y++; my $sel1=substr($line_CTER_segment, 0, 16); my $sel2= "C"; my $sel3=substr($line_CTER_segment, 17, 62); my $sel_tot = $sel1 . $sel2 . $sel3; @pdb_output[$count-$c] = "$sel_tot\n"; } } $update_CTER = "off" } ## END UPDATE CTER } ############ END Detect TER ################## } ############## END of line loop and END UPDATE NCTER lines # print to file open (FILE, "> $outfile"); for (my $count_2=0; $count_2<$array_size; $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); #Remove 2 Cl- to balance the NTPs to be injected #and remove OXT atoms my $done=0; my $update_count=0; my $update_count2=0; my $count_update_output=0; my @pdb_input = read_file("frame-out.pdb") or die; my @pdb_output; my $array_size=scalar @pdb_input; my $outfile= "out1.pdb"; my $del_Cl=32; for (my $count=0; $count<$array_size; $count++) {
244
$line= @pdb_input[$count]; $trigger_Cl=substr($line, 17, 3); $trigger_MG=substr(@pdb_input[$count+$update_count], 22, 4); $atom=substr($line, 13, 3); if ($done==0){ if ($trigger_Cl eq "Cl-") { #then remove twice the number of lines corresponding to #nb of Cl to be removed in order to account for TER $update_count=2*$del_Cl; $done=1; } } #$trigger_MG < 5565 to remove MgB atoms if (($atom ne "OXT") and ($trigger_MG < 5565)){ @pdb_output[$count+$count_update_output] = @pdb_input[$count+$update_count]; } if ($atom eq "OXT"){ $count_update_output--; } } # print to file open (FILE, "> $outfile"); for (my $count_2=0; $count_2<($array_size - $update_count); $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); #Specify cubic region defined by the projection of a point #outside of the protein, in front of landmark 1 my @pdb_input = read_file("out1.pdb") or die; my $array_size=scalar @pdb_input; for (my $count=0; $count<$array_size; $count++) { $line= @pdb_input[$count]; $resid=substr($line, 22, 4); $atom_type=substr($line, 12, 4); if (($resid == 1317) and ($atom_type eq ' CG ')){ $L1Ax=substr($line, 31, 7); $L1Ay=substr($line, 39, 7); $L1Az=substr($line, 47, 7); } if (($resid == 3126) and ($atom_type eq ' CG ')){ $L1Bx=substr($line, 31, 7); $L1By=substr($line, 39, 7); $L1Bz=substr($line, 47, 7); } if (($resid == 76) and ($atom_type eq ' CG ')){ $L0x=substr($line, 31, 7); $L0y=substr($line, 39, 7); $L0z=substr($line, 47, 7); } } #Checkpoint 0 calculation : $L1x=($L1Ax+$L1Bx)/2;
245
$L1y=($L1Ay+$L1By)/2; $L1z=($L1Az+$L1Bz)/2; my $vec_x= $L1x - $L0x; my $vec_y= $L1y - $L0y; my $vec_z= $L1z - $L0z; my $norm=sqrt($vec_x*$vec_x+$vec_y*$vec_y+$vec_z*$vec_z); $vec_x=$vec_x/$norm; $vec_y=$vec_y/$norm; $vec_z=$vec_z/$norm; my $CK0x=$L1x+15*$vec_x; my $CK0y=$L1y+15*$vec_y; my $CK0z=$L1z+15*$vec_z; print "CK0 is $CK0x $CK0y $CK0z\n"; my $x= sprintf "%.3f", $CK0x; my $y= sprintf "%.3f", $CK0y; my $z= sprintf "%.3f", $CK0z; if ($x < 100){ $x= " ". $x; } if ($x < 100){ $x= " ". $x; } if ($y < 100){ $y= " ". $y; } #Extract inner box dimensions: #edges: my $edge1x=$CK0x-8; my $edge2x=$CK0x+8; my $edge1y=$CK0y-8; my $edge2y=$CK0y+8; my $edge1z=$CK0z-8; my $edge2z=$CK0z+8; #dimensions: my $range_x=$edge2x-$edge1x; my $range_y=$edge2y-$edge1y; my $range_z=$edge2z-$edge1z; #format dimensions for later use: $range_x= sprintf "%.3f", $range_x; $range_y= sprintf "%.3f", $range_y; $range_z= sprintf "%.3f", $range_z; if ($range_x < 100){ $range_x= " ". $range_x; } if ($range_y < 100){ $range_y= " ". $range_y; } if ($range_z < 100){ $range_z= " ". $range_z; } my $cryst_line = "CRYST1 " . $range_x . " " . $range_y . " " . $range_z . " 90.00 90.00 90.00 1" . "\n"; print "cryst_line is $cryst_line\n"; print "edge x are $edge1x $edge2x\n";
246
print "edge y are $edge1y $edge2y\n"; print "edge z are $edge1z $edge2z\n"; my $x1= sprintf "%.3f", $edge1x; my $y1= sprintf "%.3f", $edge1y; my $z1= sprintf "%.3f", $edge1z; my $x2= sprintf "%.3f", $edge2x; my $y2= sprintf "%.3f", $edge2y; my $z2= sprintf "%.3f", $edge2z; if ($x1 < 100){ $x1= " ". $x1; } if ($x2 < 100){ $x2= " ". $x2; } if ($y1 < 100){ $y1= " ". $y1; } if ($y2 < 100){ $y2= " ". $y2; } if ($z1 < 100){ $z1= " ". $z1; } if ($z2 < 100){ $z2= " ". $z2; } print "ATOM 68743 Cl- Cl- 9999 $x $y $z 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x1 $y1 $z1 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x1 $y1 $z2 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x1 $y2 $z1 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x1 $y2 $z2 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x2 $y1 $z1 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x2 $y1 $z2 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x2 $y2 $z1 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x2 $y2 $z2 1.00 0.00\n"; #Extract inner box from pdb: my @pdb_input = read_file("out1.pdb") or die; my @pdb_output; my $array_size=scalar @pdb_input; my $outfile= "box.pdb"; my $count_output=2; my $atom_x; my $atom_y; my $atom_z; @pdb_output[0] = "\n"; @pdb_output[1] = $cryst_line; for (my $count=0; $count<$array_size; $count++) { $atom_x=substr($pdb_input[$count], 31, 7); $atom_y=substr($pdb_input[$count], 39, 7); $atom_z=substr($pdb_input[$count], 47, 7); if (($atom_x >= $edge1x) and ($atom_x <= $edge2x) and ($atom_y >= $edge1y) and ($atom_y <= $edge2y) and ($atom_z >= $edge1z) and ($atom_z <= $edge2z)){ @pdb_output[$count_output] = $pdb_input[$count]; $count_output++; }
247
} #if no atoms in the inner box, create artificially some reference if ($count_output==2) { @pdb_output[3] = "ATOM 68743 Cl- Cl- 9999 $x$y$z 1.00 0.00\n"; $count_output++; } # print to file open (FILE, "> $outfile"); for (my $count_2=0; $count_2<($count_output + 1); $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); #Add gtp in the inner box my $cmd = "/home/ng/amber16/bin/AddToBox -c box.pdb -a gtp.pdb -na 1 -o out-box.pdb -P 0 -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); #Copy gtp in the global pdb my @pdb_input1 = read_file("out1.pdb") or die; my @pdb_input2 = read_file("out-box.pdb") or die; my @pdb_output; my $array_size1=scalar @pdb_input1; my $array_size2=scalar @pdb_input2; my $outfile= "out2.pdb"; my $i=1; for (my $count=0; $count<$array_size1-1; $count++) { @pdb_output[$count]=$pdb_input1[$count]; } @pdb_output[$array_size1]="TER\n"; for (my $count=0; $count<$array_size2; $count++) { $trigger_gtp=substr($pdb_input2[$count], 17, 3); if ($trigger_gtp eq 'gtp') { @pdb_output[$array_size1+$i]="$pdb_input2[$count]"; $i++; } } @pdb_output[$array_size1+$i]="END\n"; # print to file open (FILE, "> $outfile"); for (my $count_2=0; $count_2<($array_size1+$i+1); $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); #Amend gtp my $count=0; my $update_resid=0; my $count_update_output=0; my @pdb_input = read_file("out2.pdb") or die; my @pdb_output; my $array_size=scalar @pdb_input; my $outfile= "out3.pdb"; ########### LINE LOOP
248
for (my $count=0; $count<$array_size; $count++) { $line= @pdb_input[$count]; $resid=9998; $resname=substr($line, 17, 3); $atom=substr($line, 13, 3); @pdb_output[$count+$count_update_output] = "$line"; if (($resname eq "gtp") and ($atom eq "MG ")){ my $sel1=substr($line, 0, 17); my $sel2="MG "; my $sel3=substr($line, 20, 2); my $sel4=$resid+$update_resid; my $sel5=substr($line, 26, 40); @pdb_output[$count+$count_update_output] = $sel1 . $sel2 . $sel3 . $sel4 . $sel5; $update_resid++; @pdb_output[$count+$count_update_output+1] = "TER \n"; $count_update_output++; } if (($resname eq "gtp") and ($atom ne "MG ")){ my $sel1=substr($line, 0, 22); my $sel2=$resid+$update_resid; my $sel3=substr($line, 26, 40); @pdb_output[$count+$count_update_output] = $sel1 . $sel2 . $sel3; } } open (FILE, "> $outfile"); for (my $count_2=0; $count_2<$array_size+$count_update_output; $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); #Add water: my $cmd = "/home/ng/amber16/bin/AddToBox -c out3.pdb -a WAT.pdb -na $wat -o out4.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $outfile="leap.scrpt"; open (FILE2, "> $outfile") || die; print (FILE2 "source leaprc.protein.ff14SB\n"); print (FILE2 "source leaprc.DNA.OL15\n"); print (FILE2 "source leaprc.RNA.OL3\n"); print (FILE2 "loadoff atomic_ions.lib\n"); print (FILE2 "loadamberparams frcmod.ions1lm_1264_tip4pew\n"); print (FILE2 "loadamberparams frcmod.ions234lm_1264_tip4pew\n"); print (FILE2 "loadoff solvents.lib\n"); print (FILE2 "loadamberparams frcmod.tip4pew\n"); print (FILE2 "loadoff zaa-new.off\n"); print (FILE2 "loadoff SUL.lib\n"); print (FILE2 "loadamberparams frcmod.sul\n"); print (FILE2 "loadamberprep gtp.prep\n"); print (FILE2 "loadamberparams frcmod.gtp\n"); print (FILE2 "sys = loadpdb out4.pdb\n"); print (FILE2 "charge sys\n");
249
print (FILE2 "setBox sys vdw 1.0\n"); print (FILE2 "set sys box {$x_box $y_box $z_box}\n"); print (FILE2 "saveamberparm sys out4.prmtop out4.inpcrd\n"); print (FILE2 "savepdb sys out4-leap.pdb\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/xleap -s -f leap.scrpt > out-leap.out"; system($cmd); #keep file: rename "leap.log", "leap-1.log"; #Apply r-4 term to Lennard-Jones potential #and apply Panteva 2015 m1264 refined set my $outfile="parmed.in"; open (FILE2, "> $outfile") || die; print (FILE2 "setOverwrite True\n"); print (FILE2 "change AMBER_ATOM_TYPE :A*,DA*\@N7 NAMG\n"); print (FILE2 "change AMBER_ATOM_TYPE :G*,DG*\@N7 NGMG\n"); print (FILE2 "change AMBER_ATOM_TYPE :*\@OP* OPMG\n"); print (FILE2 "addLJType @\%NAMG\n"); print (FILE2 "addLJType @\%NGMG\n"); print (FILE2 "addLJType @\%OPMG\n"); print (FILE2 "add12_6_4 :ZN watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :ZN\n"); print (FILE2 "add12_6_4 :MG watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :MG\n"); print (FILE2 "add12_6_4 :Na+ watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :Na+\n"); print (FILE2 "add12_6_4 :Cl- watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :Cl-\n"); print (FILE2 "add12_6_4 :CA watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :CA\n"); print (FILE2 "add12_6_4 :K+ watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :K+\n"); print (FILE2 "outparm out4-parmed.prmtop out4-parmed.inpcrd\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/parmed -i parmed.in -p out4.prmtop -c out4.inpcrd >out-parmed.txt"; system($cmd); #EXTRACT LANDMARK INDEX #extract smd atom, landmarks, and gtp indexes my @pdb_input = read_file("out4-leap.pdb") or die; my $array_size=scalar @pdb_input; for (my $count=0; $count<$array_size; $count++) { $line= @pdb_input[$count]; $atom_type=substr($line, 12, 4); $resid=substr($line, 22, 4); $trigger_gtp=substr($line, 17, 3); if (($resid == 1317) and ($atom_type eq ' CG ')){ $L1A_id=substr($line, 6, 6) - 1; } if (($resid == 3126) and ($atom_type eq ' CG ')){ $L1B_id=substr($line, 6, 6) - 1; } if (($resid == 1373) and ($atom_type eq ' CG ')){ $L2_id=substr($line, 6, 6) - 1;
250
} if (($resid == 38) and ($atom_type eq ' N3 ')){ $L3_id=substr($line, 6, 6) - 1; } if (($trigger_gtp eq 'gtp') and ($atom_type eq ' O1G')){ $gtp_first_id=substr($line, 6, 6) - 1; } if (($trigger_gtp eq 'gtp') and ($atom_type eq ' N1 ')){ $smd_atom_id=substr($line, 6, 6) - 1; } if (($trigger_gtp eq 'gtp') and ($atom_type eq 'HO\'2')){ $gtp_last_id=substr($line, 6, 6) - 1; } } print "L1A id is $L1A_id\n"; print "L1B id is $L1B_id\n"; print "L2 id is $L2_id\n"; print "L3 id is $L3_id\n"; print "gtp_first_id is $gtp_first_id\n"; print "smd_atom_id is $smd_atom_id\n"; print "gtp_last_id is $gtp_last_id\n"; ################################################################################################# #################################### END PRELIMINARY PROCEDURES ############################### ################################################################################################# ################################################################################################# #################################### PREPARE THE SYSTEM ############################### ################################################################################################# #Minimize the system my $outfile="min1.in"; open (FILE2, "> $outfile") || die; print (FILE2 "2e2h: initial minimisation solvent + ions\n"); print (FILE2 " &cntrl\n"); print (FILE2 " imin = 1,\n"); print (FILE2 " ntmin = 2,\n"); print (FILE2 " maxcyc = 5000,\n"); print (FILE2 " ncyc = 1000,\n"); print (FILE2 " ntb = 1,\n"); print (FILE2 " ntr = 1,\n"); print (FILE2 " cut = 10.0\n"); print (FILE2 " /\n"); print (FILE2 "Hold the protein fixed\n"); print (FILE2 "500.0\n"); print (FILE2 "RES 1 $nb_protein_res\n"); print (FILE2 "END\n"); print (FILE2 "END\n");
251
close (FILE2); my $outfile="min2.in"; open (FILE2, "> $outfile") || die; print (FILE2 "2e2h: initial minimisation whole system\n"); print (FILE2 " &cntrl\n"); print (FILE2 " imin = 1,\n"); print (FILE2 " ntmin = 2,\n"); print (FILE2 " maxcyc = 5000,\n"); print (FILE2 " ncyc = 2500,\n"); print (FILE2 " ntb = 1,\n"); print (FILE2 " ntr = 0,\n"); print (FILE2 " cut = 10.0\n"); print (FILE2 " /\n"); print (FILE2 "END\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/sander -O -i min1.in -o min1.out -p out4-parmed.prmtop -c out4-parmed.inpcrd -r min1.rst -ref out4-parmed.inpcrd"; system($cmd); my $cmd = "/home/ng/amber16/bin/sander -O -i min2.in -o min2.out -p out4-parmed.prmtop -c min1.rst -r min2.rst"; system($cmd); ##################### HEAT (MD1) 20 ps my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out4-parmed.prmtop', 'min2.rst')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); #Add restraints to protein and gtp print (FILE2 "force_res_prot = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_res_prot.addGlobalParameter(\"k\", 10.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if ((i <= $last_protein_id) or (i >= $gtp_first_id and i <= $gtp_last_id)):\n"); print (FILE2 " force_res_prot.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_res_prot)\n");
252
print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 100, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md1.nc', 5000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md1.rst7', 10000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(10000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md1.txt"; system($cmd); ##################### EQ VEL (MD2-sim1) 20 ps my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out4-parmed.prmtop', 'md1.rst7.10000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md1.rst7.10000')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); #Add constraints to anchors and gtp-MG print (FILE2 "for i, atom in enumerate(parm.atoms):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8) or (i >= $gtp_first_id and i <= $gtp_last_id):\n"); print (FILE2 " system.setParticleMass(i, 0*dalton)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); print (FILE2 "system.addForce(MonteCarloBarostat(1*bar, 300*kelvin))\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter('out-openmm-md2-sim1.txt', 1000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md2-sim1.nc', 250000, crds=True))\n");
253
print (FILE2 "restrt = RestartReporter('md2-sim1.rst7', 10000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(10000)\n"); print (FILE2 "positions = simulation.context.getState(getPositions=True).getPositions()\n"); #print coordinates for first checkpoint: print (FILE2 "for i, atom_crd in enumerate(positions):\n"); print (FILE2 " if (i == $L1A_id):\n"); print (FILE2 " coords = atom_crd.value_in_unit(angstroms)\n"); print (FILE2 " x = coords[0]\n"); print (FILE2 " y = coords[1]\n"); print (FILE2 " z = coords[2]\n"); print (FILE2 " print(x)\n"); print (FILE2 " print(y)\n"); print (FILE2 " print(z)\n"); print (FILE2 " if (i == $L1B_id):\n"); print (FILE2 " coords = atom_crd.value_in_unit(angstroms)\n"); print (FILE2 " x = coords[0]\n"); print (FILE2 " y = coords[1]\n"); print (FILE2 " z = coords[2]\n"); print (FILE2 " print(x)\n"); print (FILE2 " print(y)\n"); print (FILE2 " print(z)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-md2-sim1-coords.txt"; system($cmd); ################################################################################################# #################################### END PREPARE THE SYSTEM ############################### ################################################################################################# ################################################################################################# #################################### FIRST CHECKPOINT SMD ############################### ################################################################################################# #First, calculate checkpoint coordinates my @pdb_input = read_file("out-md2-sim1-coords.txt") or die; $L1Ax=$pdb_input[0]; $L1Ax =~ s/^\s+|\s+$//g; $L1Ay=$pdb_input[1]; $L1Ay =~ s/^\s+|\s+$//g; $L1Az=$pdb_input[2]; $L1Az =~ s/^\s+|\s+$//g; $L1Bx=$pdb_input[3]; $L1Bx =~ s/^\s+|\s+$//g; $L1By=$pdb_input[4]; $L1By =~ s/^\s+|\s+$//g; $L1Bz=$pdb_input[5]; $L1Bz =~ s/^\s+|\s+$//g;
254
my $CK1x=($L1Ax+$L1Bx)/2; my $CK1y=($L1Ay+$L1By)/2; my $CK1z=($L1Az+$L1Bz)/2; print "CK1 is $CK1x $CK1y $CK1z\n"; $CK1x= sprintf "%.3f", $CK1x; $CK1y= sprintf "%.3f", $CK1y; $CK1z= sprintf "%.3f", $CK1z; my $k=0.075; ##################### SMD my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out4-parmed.prmtop', 'md2-sim1.rst7.10000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md2-sim1.rst7.10000')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); #Add constraints to anchors print (FILE2 "for i, atom in enumerate(parm.atoms):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " system.setParticleMass(i, 0*dalton)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); print (FILE2 "integrator.setConstraintTolerance(0.0000001)\n"); #End Add constraints to anchors #Add sMD force print (FILE2 "force_smd = CustomExternalForce(\"k*((x-xd)^2+(y-yd)^2+(z-zd)^2)\")\n"); print (FILE2 "force_smd.addGlobalParameter(\"k\", $k*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_smd.addGlobalParameter(\"xd\", $CK1x*angstroms)\n"); print (FILE2 "force_smd.addGlobalParameter(\"yd\", $CK1y*angstroms)\n"); print (FILE2 "force_smd.addGlobalParameter(\"zd\", $CK1z*angstroms)\n"); print (FILE2 "force_smd.addPerParticleParameter(\"xd\")\n"); print (FILE2 "force_smd.addPerParticleParameter(\"yd\")\n"); print (FILE2 "force_smd.addPerParticleParameter(\"zd\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i == $smd_atom_id):\n"); print (FILE2 " force_smd.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_smd)\n"); #End Add sMD force print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n");
255
print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter('out-openmm-smd1.txt', 1000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('smd1.nc', 50000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('smd1.rst7', 50000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); #### CHECKPOINT LOOP print (FILE2 "smd_loop = 0\n"); print (FILE2 "it_check = 1\n"); print (FILE2 "stop_check = 0\n"); print (FILE2 "step = 0\n"); print (FILE2 "while smd_loop < it_check:\n"); print (FILE2 " simulation.step(12500)\n"); print (FILE2 " positions = simulation.context.getState(getPositions=True).getPositions()\n"); print (FILE2 " for i, atom_crd in enumerate(positions):\n"); print (FILE2 " if (i == $smd_atom_id):\n"); print (FILE2 " coords_smd = atom_crd.value_in_unit(angstroms)\n"); print (FILE2 " x_smd = coords_smd[0]\n"); print (FILE2 " y_smd = coords_smd[1]\n"); print (FILE2 " z_smd = coords_smd[2]\n"); print (FILE2 " x = x_smd - $CK1x\n"); print (FILE2 " y = y_smd - $CK1y\n"); print (FILE2 " z = z_smd - $CK1z\n"); print (FILE2 " dist = math.sqrt(x*x+y*y+z*z)\n"); print (FILE2 " step = step + 12500\n"); print (FILE2 " print(\"step =\", step)\n"); print (FILE2 " print(\"dist =\", dist)\n"); print (FILE2 " smd_loop += 1\n"); print (FILE2 " it_check += 1\n"); print (FILE2 " stop_check += 1\n"); #synchronise stop check with traj writing print (FILE2 " if (stop_check == 4):\n"); print (FILE2 " stop_check = 0\n"); print (FILE2 " if (dist < 4):\n"); print (FILE2 " it_check = 0\n"); #avoid infinite loop, in case of stucked state print (FILE2 " if (smd_loop == 80):\n"); print (FILE2 " it_check = 0\n"); print (FILE2 " print(\"!!SMD stopped: state not reached after 2 ns!!\")\n"); #print coordinates for second checkpoint and rst step: print (FILE2 "for i, atom_crd in enumerate(positions):\n"); print (FILE2 " if (i == $L2_id):\n"); print (FILE2 " coords = atom_crd.value_in_unit(angstroms)\n"); print (FILE2 " x = coords[0]\n"); print (FILE2 " y = coords[1]\n"); print (FILE2 " z = coords[2]\n"); print (FILE2 " print(x)\n"); print (FILE2 " print(y)\n"); print (FILE2 " print(z)\n"); print (FILE2 "print(step)\n"); #### END CHECKPOINT LOOP print (FILE2 "quit\n"); close (FILE2);
256
my $cmd = "~/anaconda3/bin/python openmm-input.py >out-smd1-dist.txt"; system($cmd); ################################################################################################# #################################### END FIRST CHECKPOINT SMD ############################## ################################################################################################# ################################################################################################# #################################### SECOND CHECKPOINT SMD ############################### ################################################################################################# #First, calculate checkpoint coordinates my @pdb_input = read_file("out-smd1-dist.txt") or die; my $array_size = scalar @pdb_input; $L2x=$pdb_input[$array_size-4]; $L2x =~ s/^\s+|\s+$//g; $L2y=$pdb_input[$array_size-3]; $L2y =~ s/^\s+|\s+$//g; $L2z=$pdb_input[$array_size-2]; $L2z =~ s/^\s+|\s+$//g; $last_step=$pdb_input[$array_size-1]; $last_step =~ s/^\s+|\s+$//g; #EXIT if previous step not converged: $exit=substr($pdb_input[$array_size-8], 0, 2); if ($exit eq '!!') { exit; } my $CK2x=$L2x; my $CK2y=$L2y; my $CK2z=$L2z; print "CK2 is $CK2x $CK2y $CK2z\n"; $CK2x= sprintf "%.3f", $CK2x; $CK2y= sprintf "%.3f", $CK2y; $CK2z= sprintf "%.3f", $CK2z; ##################### SMD my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n");
257
print (FILE2 "\n"); print (FILE2 "parm = load_file('out4-parmed.prmtop', 'smd1.rst7.$last_step')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('smd1.rst7.$last_step')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); #Add constraints to anchors print (FILE2 "for i, atom in enumerate(parm.atoms):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " system.setParticleMass(i, 0*dalton)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); print (FILE2 "integrator.setConstraintTolerance(0.0000001)\n"); #End Add constraints to anchors #Add sMD force print (FILE2 "force_smd = CustomExternalForce(\"k*((x-xd)^2+(y-yd)^2+(z-zd)^2)\")\n"); print (FILE2 "force_smd.addGlobalParameter(\"k\", $k*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_smd.addGlobalParameter(\"xd\", $CK2x*angstroms)\n"); print (FILE2 "force_smd.addGlobalParameter(\"yd\", $CK2y*angstroms)\n"); print (FILE2 "force_smd.addGlobalParameter(\"zd\", $CK2z*angstroms)\n"); print (FILE2 "force_smd.addPerParticleParameter(\"xd\")\n"); print (FILE2 "force_smd.addPerParticleParameter(\"yd\")\n"); print (FILE2 "force_smd.addPerParticleParameter(\"zd\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i == $smd_atom_id):\n"); print (FILE2 " force_smd.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_smd)\n"); #End Add sMD force print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter('out-openmm-smd2.txt', 1000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('smd2.nc', 50000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('smd2.rst7', 50000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); #### CHECKPOINT LOOP print (FILE2 "smd_loop = 0\n"); print (FILE2 "it_check = 1\n"); print (FILE2 "stop_check = 0\n"); print (FILE2 "step = 0\n"); print (FILE2 "while smd_loop < it_check:\n"); print (FILE2 " simulation.step(12500)\n"); print (FILE2 " positions = simulation.context.getState(getPositions=True).getPositions()\n"); print (FILE2 " for i, atom_crd in enumerate(positions):\n"); print (FILE2 " if (i == $smd_atom_id):\n"); print (FILE2 " coords_smd = atom_crd.value_in_unit(angstroms)\n"); print (FILE2 " x_smd = coords_smd[0]\n");
258
print (FILE2 " y_smd = coords_smd[1]\n"); print (FILE2 " z_smd = coords_smd[2]\n"); print (FILE2 " x = x_smd - $CK2x\n"); print (FILE2 " y = y_smd - $CK2y\n"); print (FILE2 " z = z_smd - $CK2z\n"); print (FILE2 " dist = math.sqrt(x*x+y*y+z*z)\n"); print (FILE2 " step = step + 12500\n"); print (FILE2 " print(\"step =\", step)\n"); print (FILE2 " print(\"dist =\", dist)\n"); print (FILE2 " smd_loop += 1\n"); print (FILE2 " it_check += 1\n"); print (FILE2 " stop_check += 1\n"); #synchronise stop check with traj writing print (FILE2 " if (stop_check == 4):\n"); print (FILE2 " stop_check = 0\n"); print (FILE2 " if (dist < 7):\n"); print (FILE2 " it_check = 0\n"); #avoid infinite loop, in case of stucked state print (FILE2 " if (smd_loop == 80):\n"); print (FILE2 " it_check = 0\n"); print (FILE2 " print(\"!!SMD stopped: state not reached after 2 ns!!\")\n"); #print coordinates for third checkpoint and last step: print (FILE2 "for i, atom_crd in enumerate(positions):\n"); print (FILE2 " if (i == $L3_id):\n"); print (FILE2 " coords = atom_crd.value_in_unit(angstroms)\n"); print (FILE2 " x = coords[0]\n"); print (FILE2 " y = coords[1]\n"); print (FILE2 " z = coords[2]\n"); print (FILE2 " print(x)\n"); print (FILE2 " print(y)\n"); print (FILE2 " print(z)\n"); print (FILE2 "print(step)\n"); #### END CHECKPOINT LOOP print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-smd2-dist.txt"; system($cmd); ################################################################################################# #################################### END SECOND CHECKPOINT SMD ############################## ################################################################################################# ################################################################################################# #################################### THIRD CHECKPOINT SMD ############################### ################################################################################################# #First, calculate checkpoint coordinates my @pdb_input = read_file("out-smd2-dist.txt") or die; my $array_size = scalar @pdb_input; $L3x=$pdb_input[$array_size-4]; $L3x =~ s/^\s+|\s+$//g; $L3y=$pdb_input[$array_size-3];
259
$L3y =~ s/^\s+|\s+$//g; $L3z=$pdb_input[$array_size-2]; $L3z =~ s/^\s+|\s+$//g; $last_step=$pdb_input[$array_size-1]; $last_step =~ s/^\s+|\s+$//g; #EXIT if previous step not converged: $exit=substr($pdb_input[$array_size-8], 0, 2); if ($exit eq '!!') { exit; } my $CK3x=$L3x; my $CK3y=$L3y; my $CK3z=$L3z; print "CK3 is $CK3x $CK3y $CK3z\n"; $CK3x= sprintf "%.3f", $CK3x; $CK3y= sprintf "%.3f", $CK3y; $CK3z= sprintf "%.3f", $CK3z; ##################### SMD my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out4-parmed.prmtop', 'smd2.rst7.$last_step')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('smd2.rst7.$last_step')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); #Add constraints to anchors print (FILE2 "for i, atom in enumerate(parm.atoms):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " system.setParticleMass(i, 0*dalton)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); print (FILE2 "integrator.setConstraintTolerance(0.0000001)\n"); #End Add constraints to anchors #Add sMD force print (FILE2 "force_smd = CustomExternalForce(\"k*((x-xd)^2+(y-yd)^2+(z-zd)^2)\")\n"); print (FILE2 "force_smd.addGlobalParameter(\"k\", $k*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_smd.addGlobalParameter(\"xd\", $CK3x*angstroms)\n"); print (FILE2 "force_smd.addGlobalParameter(\"yd\", $CK3y*angstroms)\n"); print (FILE2 "force_smd.addGlobalParameter(\"zd\", $CK3z*angstroms)\n"); print (FILE2 "force_smd.addPerParticleParameter(\"xd\")\n"); print (FILE2 "force_smd.addPerParticleParameter(\"yd\")\n"); print (FILE2 "force_smd.addPerParticleParameter(\"zd\")\n");
260
print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i == $smd_atom_id):\n"); print (FILE2 " force_smd.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_smd)\n"); #End Add sMD force print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter('out-openmm-smd3.txt', 1000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('smd3.nc', 50000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('smd3.rst7', 50000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); #### CHECKPOINT LOOP print (FILE2 "smd_loop = 0\n"); print (FILE2 "it_check = 1\n"); print (FILE2 "stop_check = 0\n"); print (FILE2 "step = 0\n"); print (FILE2 "while smd_loop < it_check:\n"); print (FILE2 " simulation.step(12500)\n"); print (FILE2 " positions = simulation.context.getState(getPositions=True).getPositions()\n"); print (FILE2 " for i, atom_crd in enumerate(positions):\n"); print (FILE2 " if (i == $smd_atom_id):\n"); print (FILE2 " coords_smd = atom_crd.value_in_unit(angstroms)\n"); print (FILE2 " x_smd = coords_smd[0]\n"); print (FILE2 " y_smd = coords_smd[1]\n"); print (FILE2 " z_smd = coords_smd[2]\n"); print (FILE2 " x = x_smd - $CK3x\n"); print (FILE2 " y = y_smd - $CK3y\n"); print (FILE2 " z = z_smd - $CK3z\n"); print (FILE2 " dist = math.sqrt(x*x+y*y+z*z)\n"); print (FILE2 " step = step + 12500\n"); print (FILE2 " print(\"step =\", step)\n"); print (FILE2 " print(\"dist =\", dist)\n"); print (FILE2 " smd_loop += 1\n"); print (FILE2 " it_check += 1\n"); print (FILE2 " stop_check += 1\n"); #synchronise stop check with traj writing print (FILE2 " if (stop_check == 4):\n"); print (FILE2 " stop_check = 0\n"); print (FILE2 " if (dist < 3):\n"); print (FILE2 " it_check = 0\n"); #avoid infinite loop, in case of stucked state print (FILE2 " if (smd_loop == 80):\n"); print (FILE2 " it_check = 0\n"); print (FILE2 " print(\"!!SMD stopped: state not reached after 2 ns!!\")\n"); #### END CHECKPOINT LOOP print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-smd3-dist.txt";
261
system($cmd); ################################################################################################# #################################### END THIRD CHECKPOINT SMD ############################## ################################################################################################# exit;