Investigation of the nucleotide triphosphate diffusion

1

PhD thesis submitted November, 2016. This work is supported by Imperial College London, and performed as

partial fulfilment of the PhD in Molecular Biosciences, Faculty of Natural Sciences, Department of Life Sciences,

Division of Cell and Molecular Biology, Imperial College London, United Kingdom.

Nicolas Edmond Jean Génin is with Imperial College London, United Kingdom (corresponding author to provide,

phone: 0044-7453275275; e-mail: [email protected]).

He is under the supervision of Dr. R. Weinzierl, and co-supervision of Prof. M. Buck and Dr. A. De Simone, with

access to the Sir Alexander Fleming Building facilities, South Kensington Campus.

Investigation of the nucleotide triphosphate

diffusion into the active site of RNA Polymerase

N. E. J. Génin

PhD thesis submitted to Imperial College London

in partial fulfilment for the degree of

PhD in Molecular Biosciences

November 2016

mailto:[email protected]

2

Declaration of originality

I hereby declare the work presented in this thesis to be original, to belong solely to the author, except

stated otherwise, in which case it is rigorously referenced to the best of the author’s knowledge.

3

Copyright declaration

The copyright of this thesis rests with the author and is made available under a Creative Commons

Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit

the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that

they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear

to others the licence terms of this work.

4

Abstract

RNA Polymerase can be seen as a mobile molecular structure orchestrating the movement of substrate

NTP and nucleic acids, regulated by some control molecules (transcription factors) and the sequential

interplay of the enzyme domains. For the last 15 years, loading of rNTPs into the active site of the

enzymatic complex has been regarded more or less as a settled issue. Based on the first generated crystal

structures, substrates were thought to load via a pathway termed secondary channel (CH2). The latter

well-accepted paradigm regarding a fundamental aspect of the transcription process is refuted and a new

model, relying on overlooked structural characteristics (CH3, CH4), and accommodating a large body

of pre-existing information, is presented. Important implications involve notably the fact that CH2 is

mainly an exit channel, and that NTPs are selected prior to delivery into the catalytic center. Overlapping

partially with the new loading hypotheses, details about substrate discrimination, error recovery and the

translocation mechanism, which has been an open question in the domain for the past 20 years, are

discussed. Accelerated and Steered Molecular Dynamics simulations are computed and enable to gain

informative insight about the dynamics of the diffusion process. In-depth conformational and

electrostatic analyses are discussed and allow gauging propensity for substrate accommodation.

5

Acknowledgements

The author is particularly grateful to Dr. R. Weinzierl, supervising this project, for his continuous

support, for his precious guidance, and for having given the opportunity to the author to undertake this

doctoral project. The author is thankful to Prof. M. Buck and Dr. A. De Simone, co-supervising this

project, for their advice and support. Thanks are also due to post-doctoral researcher Dr. C. Amin who

worked in his group for her help and support. Finally, gratitude is expressed to Imperial College London

alumni students J. Wingfield and A. Valeva, for their support.

6

Table of Contents

List of tables .......................................................................................................................... 8

List of figures ........................................................................................................................ 9

List of abbreviations ............................................................................................................ 12

Chapter 1: Literature review ............................................................................................... 14

1. Introduction ................................................................................................................. 15

2. Secondary channel theory ........................................................................................... 18

3. Main channel theory .................................................................................................... 22

4. Non-controversial properties of CH2 and dynamic error correction .......................... 27

5. The ratchet issue .......................................................................................................... 31

6. The meting issue and details on cTFs ......................................................................... 38

7. Considerations on nucleotide selection ....................................................................... 52

8. Discussion ................................................................................................................... 58

9. Concluding remarks .................................................................................................... 66

Chapter 2: MD methods ...................................................................................................... 67

1. Introduction ................................................................................................................. 68

2. Metabolite pool ........................................................................................................... 69

3. Forcefields ................................................................................................................... 73

4. Accelerated MD simulations ....................................................................................... 75

5. Steered MD simulations .............................................................................................. 80

Chapter 3: Elongation Complex reconstruction .................................................................. 82

1. Introduction ................................................................................................................. 83

2. 3D Rotation ................................................................................................................. 84

3. Illustrative case: adding a single nucleotide ................................................................ 86

4. Transformations .......................................................................................................... 89

5. Principle application: constructing a complete EC ..................................................... 97

6. Closing remarks ......................................................................................................... 110

Chapter 4: Advanced Characterization of the Diffusional Pathways ................................ 112

1. Introduction ............................................................................................................... 113

2. Geometric pathway analysis ...................................................................................... 113

2.1. Introduction .................................................................................................... 113

2.2. Principle of the algorithm ............................................................................... 116

7

2.3. Detailed description of the algorithm ............................................................. 122

2.3. 1. Refine starting point ......................................................................... 122

2.3. 2. Virtual sphere scan ........................................................................... 137

2.3. 3. Walk forward along pathway axis .................................................... 139

2.3. 4. Convert COM map to distance bins ................................................. 140

2.3. 5. Calculate cross section area .............................................................. 141

3. Electrostatic analysis ................................................................................................. 142

Chapter 5: Results and Discussion .................................................................................... 143

1. Introduction ............................................................................................................... 144

2. Simulation summary ................................................................................................. 145

3. Results ....................................................................................................................... 148

3.1. Diffusional zones ............................................................................................ 148

3.2. CH2 Analysis ................................................................................................. 154

3.3. CH3A Analysis .............................................................................................. 159

3.4. CH3B Analysis ............................................................................................... 164

3.5. CH3C Analysis ............................................................................................... 169

3.6. CH3D Analysis .............................................................................................. 177

3.7. CH4 Analysis ................................................................................................. 178

3.8. Misloading recovery investigation ................................................................. 179

4. Discussion ................................................................................................................. 181

5. Future works .............................................................................................................. 191

6. Conclusions ............................................................................................................... 193

References ......................................................................................................................... 195

Appendix 1 ........................................................................................................................ 209

Appendix 2 ........................................................................................................................ 240

8

List of tables

Table 1: Comparison of nucleotide base discrimination between several studies for enzyme with deleted

TL domain.

Table 2: Comparison of nucleotide ribose discrimination between several studies for enzyme with

deleted TL domain.

Table 3: RNA nucleotides to be added.

Table 4: DNA nucleotides to be added.

Table 5: Alignment of an entire template helix to three reference anchoring points.

Table 6: aMD simulation summary.

Table 7: sMD simulation summary.

9

List of figures

Figure 1: Cross section through Sc RNAP II.

Figure 2: Cutaway view of rNTP loading via CH2.

Figure 3: CH3 access to the main channel.

Figure 4: Electrostatic Fork melting mechanism.

Figure 5: Comparison of FL2 interaction with downstream DNA in Tt RNAP.

Figure 6: Comparison of FL2 interaction with downstream DNA in Sc RNAP II.

Figure 7: TFIIS shielding of RNAP II secondary channel.

Figure 8: 5’-3’ direction of DNA extension.

Figure 9: 3’-5’ direction of DNA extension.

Figure 10: Backbone extension template for both the 5'-3' and the 3'-5- directions of DNA extension.

Figure 11: Nucleotide attachment to the DNA backbone host in the 5’-3’ direction.

Figure 12: Nucleotide attachment to the DNA backbone host in the 3’-5’ direction.

Figure 13: Schematic diagram of the first rotation transformation to align a nucleotide backbone to be

incorporated on DNA 5’ end.

Figure 14: Schematic diagram of the second rotation transformation to align a nucleotide backbone to

be incorporated on DNA 5’ end.

Figure 15: Translation transformation attaching the aligned backbone to DNA 5’end.

Figure 16: DNA nucleotide and backbone references to attach a new base group on the 5’ end.

Figure 17: Schematic diagram of the first rotation transformation to align a nucleotide base group to be

incorporated on DNA 5’ end.

Figure 18: Schematic diagram of the second rotation transformation to align a nucleotide base group to

be incorporated on DNA 5’ end.

Figure 19: Schematic diagram of the translation transformation attaching a new base group to DNA 5’

end backbone.

Figure 20: Schematic diagram of missing nucleotides in PDB#2E2H.

Figure 21: Comparison fit between initial downstream tDNA structure and superposed extended helix.

Figure 22: Comparison fit between initial downstream ntDNA structure and superposed extended helix.

Figure 23: Visualization of downstream DNA reconstruction.

Figure 24: Initial fitting of upstream ntDNA.

Figure 25: Visualization of the initial fitting of ntDNA template relative to the enzymatic structure.

Figure 26: Second fitting of upstream ntDNA.

Figure 27: Visualization of the second fitting of ntDNA template relative to the enzymatic structure.

Figure 28: Mutation of ntDNA template nucleotides to match Table 4 sequence.

Figure 29: Fitting of missing RNA nucleotides.

Figure 30: vdw representation of the full nucleic complex before potential energy minimization.

10

Figure 31: vdw representation of the full nucleic complex after potential energy minimization.

Figure 32: Schematic diagram of the main dimensions of a pathway.

Figure 33: Schematic diagram of a pathway cross section layer.

Figure 34: Pathway axis of an irregular channel.

Figure 35: Schematic diagram of the visualization through a pathway.

Figure 36: Projection of pathway points onto a tested direction.

Figure 37: Axis scan.

Figure 38: Contour scan.

Figure 39: Interlining atoms extraction.

Figure 40: Virtual sphere scan method.

Figure 41: Virtual sphere scan pathway axis detection.

Figure 42: Cross section area calculation.

Figure 43: CH2 and corridor pathways.

Figure 44: CH3 view from CH2.

Figure 45: Side view of CH3.

Figure 46: Side view of CH3C, CH3D and CH4, relative to CH2.

Figure 47: Front view of CH3C, CH3D and CH4.

Figure 48: Side view of CH3C, CH3D and CH4, relative to CH4.

Figure 49: Bottom view of CH3D entrance to CH3.

Figure 50: Front, side and back view of CH2 pathway axis.

Figure 51: CH2 minimal radius along diffusional path heatmap.

Figure 52: CH2 cross section area along diffusional path heatmap.

Figure 53: CH2 Electrostatic NTP interaction along diffusional path heatmap.

Figure 54: CH2 force-distance plot.

Figure 55: Front and side view of TL closing of opening CH3A.

Figure 56: Front, side and back view of CH3A pathway axis.

Figure 57: CH3A minimal radius along diffusional path heatmap.

Figure 58: CH3A cross section area along diffusional path heatmap.

Figure 59: CH3A Electrostatic NTP interaction along diffusional path heatmap.

Figure 60: CH3A force-distance plot.

Figure 61: Front, side and back view of CH3B pathway axis.

Figure 62: CH3B minimal radius along diffusional path heatmap.

Figure 63: CH3B cross section area along diffusional path heatmap.

Figure 64: CH3B Electrostatic NTP interaction along diffusional path heatmap.

Figure 65: CH3B force-distance plot.

Figure 66: GTP bound at CH3B entrance.

Figure 67: Longitudinal view through CH3C.

11

Figure 68: Side view of CH3C pathway axis.

Figure 69: CH3C minimal radius along diffusional path heatmap.

Figure 70: CH3C cross section area along diffusional path heatmap.

Figure 71: CH3C Electrostatic NTP interaction along diffusional path heatmap.

Figure 72: CH3C force-distance plot.

Figure 73: NTP diffusion through CH3C state 1.




Figure 77: NTP diffusion at CH3D entrance.

Figure 78: CH4 force-distance plot.

Figure 79: Pre-translocation protein re-adjustments occurring near the active site.

Figure 80: Mechanistic basis for pre-translocation.

Figure 81: Schematic representation of EC-RNAP coordination with substrate diffusion trajectory.

Figure 82: Schematic representation of on-pathway state 1.

Figure 83: Schematic representation of on-pathway state 2



Figure 86: Schematic representation of off-pathway state 1.







12

List of abbreviations

RNAP: RNA Polymerase

Sc: Saccharomyces cerevisiae

Ec: Escherichia coli

Tt: Thermus thermophilus

Ta: Thermus aquaticus

Mj: Methanocaldococcus jannaschii

WT: Wild Type

EC: Elongation Complex

BH: Bridge Helix

TL: Trigger Loop

FL2: Fork Loop 2

SW2: Switch 2 domain

TN: Transition Nucleotide

TF: Transcription Factor

cTF: cleaving Transcription Factor

NAC: Nucleotide Addition Cycle

DS: Downstream

A site: Active site

E site: Entry site

PS site: Pre-insertion site

tDNA: DNA template strand

ntDNA: DNA non-template strand

NTP: nucleoside triphosphate

NMP: nucleoside monophosphate

NDP: nucleoside diphosphate

rNTP: ribo nucleoside triphosphate

cNTP: cognate ribo nucleoside triphosphate

ncNTP: non-complementary ribo nucleoside triphosphate

dNTP: deoxy nucleoside triphosphate

dNMP: deoxy nucleoside monophosphate

ATP: adenosine triphosphate

GTP: guanosine triphosphate

CTP: cytidine triphosphate

UTP: uridine triphosphate

TTP: thymidine triphosphate

13

A: adenine

G: guanine

C: cytosine

U: uracil

T: thymine

PPi: inorganic pyrophosphate molecule

Pi compound: molecule formed by the association of multiple pyrophosphates

aMD: accelerated Molecular Dynamics

sMD: steered Molecular Dynamics

MD: Molecular Dynamics

VMD: Visual Molecular Dynamics

GPU: Graphic Processing Unit

CPU: Central Processing Unit

PDB: Protein Data Bank

PDB#: Protein Data Bank accession code

PME: particle mesh Ewald

vdw: van der Walls

CH1: Main channel

CH2: Secondary channel

CH3: Tertiary channel

CH3A: Tertiary channel opening A

CH3B: Tertiary channel opening B

CH3AB: Section of the tertiary channel formed by opening A, B and the tertiary channel itself

CH3C: Tertiary channel opening C

CH3D: Tertiary channel opening D

CH4: Quaternary channel

COM: Point lying on a pathway axis

14

Chapter 1

Literature Review

15

1. Introduction

RNA Polymerase is a nanoscopic machine located inside the cell nucleus, which is responsible for

transcribing sections of DNA information into mRNA. During the synthesis process, the NTP substrates

enter the molecular machine and reach a zone called the active site where they are assembled into an

RNA chain. According to the largely accepted paradigm the substrates load to the catalytic center via a

pathway termed “secondary channel” (also referred to as CH2 in this thesis). The latter channel is

localized beneath the active site, consists of a narrow corridor (≈ 7-12 Å in diameter, ≈ 15 Å in length)

leading directly to the active site cavity, extending towards the outside of the enzyme, and leading to a

large conic section occupying about two thirds of the pathway length and called “funnel”. Access to the

active from the secondary channel is enabled when the trigger loop is bent into an open conformation

and when the EC is in the post-translocated state (i.e. the RNA 3’ end closes against the BH) [Gnatt, et

al., 2001; Wang, et al., 2006] and disabled otherwise. TL refolding reduces the dimensions of the

secondary channel at the entrance to the active site from 15 * 22 Å in the open conformation to 11 *11

Å in the closed conformation [Vassylyev, et al., 2007B]. “Pore” is usually used to refer either to the

narrow corridor or to the entire tunnel. For more clarity, in this review, “sec. channel” (CH2) or “pore”

will be used to refer to the entire tunnel and “corridor” for the narrow pathway in proximity of the A

site. The theory according to which the NTPs primarily load to the active site via this pathway will be

referred to as the sec. channel theory (CH2 theory). RNAP also possesses a main channel, which will

also be referred to as CH1, allowing the insertion of the DNA inside the enzymatic complex. The main

channel is delimitated by the two largest Rpb1/2 sub-units and the Rpb5 sub-unit. It forms an elbow

shaped corridor across the crab-claw-like shape of the enzymatic complex separating the jaws of the

claw, and comprises a downstream section (which accommodates 12-13 base-pairs of downstream DNA

[Naryshkina, et al., 2006; Kireeva et al., 2010]) and an upstream section [Semenova, et al., 2005;

Kashkina, et al., 2007]. The sections intersect at the catalytic center [Semenova, et al., 2005; Kashkina,

et al., 2007]. The DNA bases are incrementally channeled from the downstream to the upstream

direction during NAC. During translocation (forward movement of the enzyme on the nucleic acids),

the DNA strands are unwound at the downstream boundary of the main channel and rewound at the

upstream edge of the elbow shaped channel [Naryshkina, et al., 2006; Kireeva et al., 2010]. Also during

the process, the upstream tDNA strand is associated with the RNA transcript and forms a RNA-DNA

hybrid (8-9 base-pairs long), which resides at the beginning of the upstream channel near the upstream

boundary of the transcription bubble [Naryshkina, et al., 2006; Belogurov, et al., 2009; Kireeva et al.,

2010]. The RNA chain when separating from the hybrid is extruded through a pathway termed RNA

exit channel [Vassylyev, et al., 2009]. An alternative theory for the diffusion of NTPs to the catalytic

site has proposed that the primary route of substrate diffusion would be via the main channel (termed

main channel theory or CH1 theory in this review).

16

Figure 1: Cross section through Sc RNAP II. tDNA, ntDNA, RNA and GTP in the A site, are shown in lime,

light blue, cyan and red respectively. RNAP II surface is shown in gray. The secondary and main channels

are indicated by dark blue and yellow dashed rectangles respectively. Enzyme structure is PDB#2E2H

([Wang, et al., 2006]).

Both theories agree on the NAC two-metal ion mechanism (molecular operations that are involved in

the polymerization reaction). The consensus proposition is the following. The nucleotide addition step

is presumed to involve two Mg2+ ions, one stably associated with the enzyme (MgA) located on an Rbp1

aspartyl residue at the entrance of the corridor (from the active site) and the other only transiently (MgB),

entering with the NTP [Cramer, et al., 2001; Kettenberger, et al., 2003; Wang, et al., 2006]. Prior to

catalysis, the MgB2+ ion binds to O- atoms of the incoming NTP polyphosphate tail and forms a NTP–

MgB complex [Sigel, et al., 2005; Langelier, et al., 2005; Maoileidigh, et al., 2011]. If the incoming

NTP (called NTP + 2) is the correct nucleotide, the complex is allowed to bind to the insertion site (MgA

site), while MgB binds to an aspartyl residue located near the active site [Abbondanzieri, et al., 2005;

Maoileidigh, et al., 2011]. NTP + 2 is then hydrolyzed producing nucleoside monophosphate (NMP)

and pyrophosphate (PPi) [Abbondanzieri, et al., 2005; Maoileidigh, et al., 2011]. MgB is coordinated

by the β and γ phosphates of NTP + 2 (in reality an NMP) [Stano, et al., 2002; Langelier, et al., 2005].

MgA interacts with the pyrophosphate 3′-OH group of NTP + 1 on the RNA 3’end, thereby lowering its

affinity for the hydrogen, to activate the -OH group for nucleophilic attack on the α-phosphate of NTP

+ 2 where MgB is located [Steitz, et al., 1998; Stano, et al., 2002; Langelier, et al., 2005; Abbondanzieri,

et al., 2005; Landick, et al., 2005; Maoileidigh, et al., 2011]. This results in the formation of a

CH2

CH1

17

phosphodiester bond. The PPi molecule (β and γ phosphates of the NTP + 2) and the MgB ion form a

MgB-PPi2- complex (usually referred to as PPi for convenience). PPi is then expelled through the

secondary channel and the polymerase translocates along DNA and the RNA transcript to free the

nucleotide addition site (register +1), allowing for binding of the next NTP. The sequential order

between PPi release and translocation is currently a matter of debate. According to [Martinez-Rucobo,

et al., 2013], the NAC was elucidated with NTP-containing EC crystal structures of RNAP II and of

bacterial RNAP.

In this review, I will first investigate the secondary channel theory, before considering the elements of

the alternative theory. Then the non-controversial properties of the secondary channel together with

dynamic error correction processes partly involving the latter channel will be examined in order to raise

potential implications for our investigation about the substrate mode of diffusion. I will then discuss one

of the main issues disputed in published literature which concerns the translocation model. The model

seems indeed particularly important to decide between the two substrate modes of entry. Thereafter, the

availability of DS registers discussed in the melting issue sub-section will be investigated, before raising

implications for transcription factors (TF) and substrate diffusion. How nucleotides are discriminated

will next be discussed, and we will see how the mechanism fits in each substrate loading model. Finally,

a general discussion will be undertaken.

18

2. Secondary channel theory

In 1999, the first mention of the secondary channel as a possible pathway for NTP diffusion to the active

site was made simultaneously, in the September issue of Cell magazine, by Zhang et al. [Zhang, et al.,

1999] and Fu et al. [Fu, et al., 1999], based on the observation of the newly generated x-ray

crystallography data of bacterial RNAP and eukaryotic RNAP II at 3 and 5 Å resolution respectively.

The postulate was proposed because the active site appeared directly connected to the exterior of the

enzyme through the secondary channel, and the latter seemed to be the only unobstructed pathway for

NTP diffusion. The hypothesis was subsequently restated by numerous researchers, based on the

generation and observation of T7 RNAP, T. thermophilus RNAP and S. cerevisiae RNAP II x-ray

structures [Korzheva, et al., 2000; Cramer, et al., 2000; Cramer, et al., 2001; Gnatt, et al., 2001; Bushnell,

et al., 2002; Vassylyev, et al., 2002; Westover, et al., 2004A; Kettenberger, et al., 2004; Temiakov, et

al., 2004; Temiakov, et al., 2005; Wang, et al., 2006].

The first sets of evidence in favor of the secondary channel theory came from the fact that NTPs were

observed pre-bound at the entrance of the corridor in proximity of the active site, indicating that NTPs

travelled through the CH2 pathway. In 2003, a non-template entry site (E site) for pre-binding of the

NTP substrate prior to NAC was first hypothesized by [Sosunov, et al., 2003]. From their biochemical

experiments, the researchers observed increased fluorescence (which was directly correlated to

nucleotide imprisonment in the enzymatic complex) when non-complementary nucleotides were

inserted. This was interpreted as a nucleotide binding phenomenon in a non-template site, as the active

site could normally only accommodate complementary nucleotides. However, other biochemical studies

have suggested that NTPs could bind to an allosteric or non-template site in the main channel, which

could explain the increased fluorescence stated above without validating CH2 as the main diffusion path

(details in further paragraphs). In 2004, Westover et al. [Westover, et al., 2004A] extended the

diffraction limit of RNAP II crystals to 2.3 Å, allowing to refine the inspection of the complex. A

mismatched NTP was directly observed bound to a site adjacent to the A site, in the secondary channel,

and consequently the hypothesis was raised that nucleotide selection includes an initial binding to an

entry site beneath the active center [Westover, et al., 2004A]. The entry site (E site) hypothesis was

reinforced by Wang et al. [Wang, et al., 2006] in 2006 on the basis of additional crystallographic data.

19

Figure 2: Cutaway view of rNTP loading via CH2. tDNA, RNA, GTP in PS site, GTP in E site and GTP in

A site are shown in light blue, lime, orange, hashed purple and yellow respectively. Mg2+ ions are

represented as black spheres. MgB site is shared between the PS and the E site bound nucleotides. The

pathway represented on the figure is the corridor section of the secondary channel leading to the active site.

Protein wall surface is represented in grey. The figure combines structural information of PDB#1R9T for

the E site [Westover, et al., 2004A], PDB#2O5J for the A site, [Vassylyev, et al., 2007B] and PDB#2PPB for

the PS site, [Vassylyev, et al., 2007B].

In 2004 and 2005, Temiakov, et al. in [Temiakov, et al., 2004] and [Temiakov, et al., 2005], and

Kettenberger et al. in [Kettenberger, et al., 2004], using Fourier Electron Density map calculations

applied to RNAP complexes cocrystallized with a non-hydrolyzable NTP analog, discovered a

preinsertion site to which the NTP substrate was thought to bind before accessing to the insertion site

where it undergoes catalysis. Although these results could seem in line with the E site postulate exposed

above, some important distinctions are to be made. First, the preinsertion site (PS) is located differently

than the E site exposed above. Indeed, the PS site is located at register i + 1 where the incoming NTP

bounds. The orientation of the register in the preinsertion state is such that the bound NTP is oriented

towards the secondary channel and the polyphosphate tail could therefore be partially inserted and/or

bound there, even though the i + 1 register resides in the A site. As such, only a small fraction of the PS

site can be considered as overlapping the secondary channel. In contrast, the E site resides entirely

outside the active center. Second, the PS site hypothesis does not validate CH2 (secondary channel)

theory, as the NTP could be carried there by pre-binding to tDNA, whereas the E site postulate does

seem to validate CH2 theory, as the only obvious access to the site is via the pore.

MgA

MgB

20

In 2004, Mukhopadhyay et al. [Mukhopadhyay, et al., 2004], observed that the insertion of the peptide

microcin J25 led to transcription inhibition in bacterial RNAP. Inhibition was partially competitive with

NTPs (e.g., high concentrations diminished inhibition) leading the researchers to the conclusion that the

toxin molecule interfered at the level of NTP delivery or NTP binding. Because the authors found that

microcin J25 fitted inside and obstructed almost perfectly CH2 and appeared to block passage of a NTP

molecule, they proposed that impediment of substrate diffusion to the active center was part of the

inhibition function: “MccJ25 inhibits transcription by interfering with NTP uptake by binding within

and obstructing the RNAP secondary channel—acting essentially as a cork in a bottle”. It follows that

the hypothesis according to which the secondary channel served substrate loading was reinforced.

Further evidence for CH2 accommodating substrate uptake was proposed by the following results from

Holmes et al. in 2006 [Holmes, et al., 2006]. They found that D675Y and D675V substitutions in Ec

RNAP reduced transcription fidelity. Because the residue is located inside the secondary channel, at

relative distance from the catalytic center, the researchers proposed that it played a role in

electrostatically filtering incoming substrates. While still considering that NTPs could diffuse via

multiple routes, they postulated that NTPs would load via CH2 at least sometimes.

In addition to the secondary channel theory biochemical and structural evidences stating the existence

of an E site that could bind NTPs in a preliminary step, and that the secondary channel delivers

substrates, a probabilistic model based on diffusion computational simulations from Batada et al.

[Batada, et al., 2004] seemed to both reinforce the plausibility of the E site hypothesis and to validate

CH2 as a plausible diffusion pathway, as well as yielding informative details about the diffusional

properties of the channel. The fact that the sec. channel would serve as the main entry route for NTPs

would suggest that the structure of the pathway plays a role in NTP diffusion to the active site and in

substrate discrimination. In their publication, Batada et al. studied the effect of the pore topology and

electrostatics on NTP diffusion. Their MD simulations allowed them to calculate that the topology of

the pore alone (i.e. restriction due to the funnel opening and pore walls), in the absence of an electrostatic

potential, reduced the rate of NTPs accessing the A site by a factor 1/16800. They also found that the

corridor had a strong negative electrostatic potential, reducing the rate of NTPs accessing the E site (note

that the authors considered electrostatic impediment for diffusion to the E site and not the A site) by a

factor 1/300. According to Batada and colleagues, this induced a total restriction in NTP diffusion by a

factor (1/16800) × (1/300) = 2 × 10-7. Correlating this result with the 1012.s-1.M-1 collision rate between

RNAP and NTPs and 1 mM concentration of substrate (assumed physiological) seemed to allow

successful diffusion to the A site at a level of 200 NTPs per second. Because of steric requirements for

binding, the authors then suggested that successful delivery would be reduced by one order of

magnitude: hence 20 NTP.s-1, or even two orders of magnitudes, i.e. ≤ 20 NTP.s-1. The authors then

stated that their ≤ 20 NTP.s-1 calculated rate was consistent with the ≈10 NTP.s-1 synthesis rate by RNAP

II in vivo. From their MD simulations, Batada et al. also calculated an enhanced NTP diffusion rate to

21

the A site in case of prior NTP binding to the E site (with a minimum transient binding time of 10 ns

calculated from chemical dissociation constants). These results seemed to improve their diffusion model

and appeared consistent with the E site hypothesis.

Another Molecular Dynamics investigation confirmed that the secondary channel was the most suitable

route for accommodating substrates [Zhang, et al., 2015A]. A comparative conformational analysis with

the program CAVER ([Chovancova, et al., 2012; Kozlikova, et al., 2014; Pavelka, et al., 2016]) between

the main and the secondary channel was carried out, and it was concluded that the latter was more

suitable to accommodate NTP substrates. It was also proposed that a substrate remaining in the funnel

is energetically more favorable than if it lies within the main channel, because of decreased Coulombic

repulsion.

22

3. Main channel theory

The first evidence in favor of the main channel theory arose from the 2001 study from Foster, Holmes

and Erie. By using alternative biochemical transient-state kinetic techniques, the group measured the

kinetics of single NTP incorporation steps as a function of NTP concentration for Ec (Escherichia coli)

RNAP [Foster, et al., 2001]. In their first experiment, they measured the rate of CMP incorporation as a

function of CTP concentration, where CTP is the next nucleotide (templated NTP) to be incorporated.

They noted that the substrate-saturation curve representing CMP incorporation kinetics as function of

CTP concentration had a quadratic dependence on CTP concentration. From this emerges that the

kinetics are biphasic (not hyperbolic as expected from the secondary channel paradigm) and thus that

RNAP must contain a second NTP binding site in addition to the catalytic site, which acts as an allosteric

effector, accelerating the incorporation of the templated NTP, where the next NTP to be added (CTP) is

both the substrate and the allosteric effector. In another experiment, they measured the rate of CMP

incorporation as a function of different concentrations of ATP, GTP and UTP (and with low CTP

concentration for matters of experimental convenience to force a control incorporation state). The

kinetics this time showed that non-templated NTPs did not affect the rate of incorporation, indicating

template specificity for the allosteric function of the binding site. Finally, they measured the kinetics of

AMP incorporation (where AMP is the next nucleotide to be added) as a function of AMP-CPP

concentrations, which showed that the templated but non-incorporable ATP analog accelerated AMP

addition (i.e. activated transcription to the fast state). From these important results the following

conclusion can be made. RNAP possesses an allosteric binding site in addition to the catalytic site, where

templated but not mismatched NTPs increase the rate of NTP incorporation, and where the allosteric

site probably resides downstream of the template DNA chain in the main channel. This confirmed an

early hypothesis by Nierman et al. ([Nierman, et al., 1980], cited by Foster et al.) drawn from the study

of transcription initiation kinetics stating that RNAP may contain a NTP binding site in addition to the

catalytic site. It was also postulated that NAC can either occur in a fast or slow state (consistent with a

publication from Davenport et al. in 2000 [Davenport, et al., 2000], cited by Foster et al.), with the

transition to the fast state being induced by the NTP binding to the allosteric site (tDNA i + 2 site). In

2003, Holmes and Erie presented new compelling evidences in favor of a secondary binding site in the

main channel [Holmes, et al., 2003]. They assembled mutant DNA templates and observed that the DNA

sequence one base pair downstream from the site of NTP addition affected the rate of subsequent NTP

incorporation. In 2003, Nedialkov, Burton et al. found results consistent with the main channel theory

using pre-steady state kinetics [Nedialkov, et al., 2003]. A running two-bond protocol was built and the

experimental protocol consisted of four ECs termed C40, A43, G44 and G45, which corresponded to

standard elongation positions. C40 EC is advanced to A43 by adding specific concentrations of NTPs.

After stalling briefly, A43 establishes a steady state distribution between a paused and an active EC.

The active A43 EC is such that when GTP concentrations are added, the complex moves to the G44 and

23

G45 positions where the rapid rates of elongation enable to reproduce the synthesis rates experimentally.

In this setup, G44 rates indicate recovery from a stalled A43 position, and G45 rates indicate processive

elongation from G44 to G45 (including RNA-DNA hybrid and tDNA translocation). As such, these ECs

positions capture snapshots of the steps corresponding to critical NAC sequential processes. For

example, translocation and pyrophosphate release are thought to occur between the synthesis of the G44

and G45 bonds (G44 corresponds to the synthesis of a first bond attaching substrate NTP to the growing

RNA chain and G45 corresponds to the synthesis of the next incorporation bond) and if G44 or G45 are

monitored exclusively then information about translocation could be distorted. The reaction pathways

are stimulated with TFIIF and HDAg (hepatitis δ antigen, elongation stimulant) elongation factors. The

supervision of the formation rates of the A43, G44 and G45 EC positions as a function of GTP substrate

concentrations led to the following observations. Recovery from a stalled EC and processive transition

from one bond (incorporation event) to the other can be highly dependent on the incoming NTP,

indicating that NTPs could pre-bind to a non-catalytic site in the main channel and play a role in driving

and/or triggering translocation. Furthermore, it is to be underlined that, inconsistent with the secondary

channel theory and confirming the results published in 2001 from Erie et al. [Foster, et al., 2001] and

Palangat et al. [Palangat, et al., 2001], the measured rates of NTP incorporation as a function of NTP

concentration did not reflect a hyperbolic dependence. In 2004, using the same RNAP II ECs as above

(notably A43, G44 and G45), in conjunction with TFIIF (which stimulates forward translocation) and

TFIIS (which factor appeared to improve the quality of the kinetic experimental data by promoting RNA

cleavage and re-start), Zhang and Burton ([Zhang, et al., 2004]) monitored the kinetic pathway between

the key transcription steps embodied by the control EC positions. In other words, they evaluated the

dependence between translocation and nucleotide addition in the interval of two bonds (two nucleotide

incorporation events). By using new quench techniques, they were able to measure the rate of substrate

tightening to the active site (termed G44 isomerization, correlated to the enzymatic complex confining

the active site and detected with EDTA quench) prior to phosphodiester bond formation (termed G44

chemistry, detected with HCl quench). The G44 isomerization state reflects substrate accessing the A

site. At higher GTP concentrations, EDTA quench rate curves for G44 isomerization were biphasic,

consistent with the NTP allosteric effect depicted above. Also, because the isomerization rate proved to

be rapid and not rate-limiting, they concluded that at high GTP concentrations, elongation kinetics were

not dependent on GTP loading. Instead, they found that template-dependent binding of substrate NTP

was coupled with the completion of the previous NAC (indicating that NTPs must pre-bind in the pre-

translocated EC), with the rate-limiting steps being translocation and PPi release. The results appeared

consistent with a NTP-driven translocation mechanism where downstream substrate NTPs pre-bound in

the main chain have a functional effect on subsequent NTP incorporation and inconsistent with the

secondary channel theory requiring rapid Brownian ratchet translocation and rapid PPi expulsion.

Furthermore, Batada et al. computational diffusion calculations [Batada, et al., 2004] indicated that the

≤ 20NTP. s-1 loading is rate limiting, but in the study, one of the measured NTP stable loading rate was

24

1450 +/- 330 s-1, indicating that loading was not rate-limiting for human RNAP II. Finally, in line with

their transient-state kinetics data from 2003 [Zhang, et al., 2003] and inconsistent with the secondary

channel theory, Burton et al. observed that the occlusion of the pore with TFIIS did not appear to hinder

NTP loading. In their 2005 publication [Gong, et al., 2005], using millisecond kinetics quench-flow

techniques (developed by the laboratory), Burton and co-workers yielded crucial results in favor of the

main channel theory by using a fascinating experimental approach based on a phenomenon termed

isomerization reversal, whose principle is the following. Translocation is blocked by α-amanitin

(mushroom toxin). High incoming NTP substrate concentrations (corresponding to the template i + 2

NTP), by promoting forward translocation on the EC blocked by α-amanitin, induce isomerization

reversal and dislodge (i.e. reverse the isomerization of the A site) the i + 1 NTP (isomerized i + 1 NTP

about to complete bond synthesis). This phenomenon is possible because tightening of the active site

(which is reversible) occurs before phosphodiester bond formation (which normally becomes

irreversible when PPi is released). Isomerization requires substrate sequestration in the A site, and

detection is allowed by the fact that the MgB ion of the i + 1 GTP becomes shielded from EDTA

chelation. Also, the metal ion not being inactivated by EDTA quench allows GTP to proceed to

phosphodiester bond formation. Quenching with HCl on the other hand stops the reaction instantly

giving precious information about the timing of the bond formation. The researchers experimentally

applied the principle as follows. A 40-CAAAGGCCTTT-50 template was used. Elongation was then

monitored between G44 and G45 (44 and 45 nucleotide RNAs ending in 3’-GMP) starting at a stalled

A43 EC, where G44 corresponded to an isomerized complex (substrate tightening in the A site), G45

corresponded to an incorporated NTP (GTP has formed the phosphodiester bond) and A43 represented

the post-translocated EC where the GTP substrate loads to the i + 1 and i + 2 sites. If an EDTA quench

was added, i + 1 GTP was inactivated, but not i + 2 GTP which was not protected from chelation by the

A site. In the continuing presence of i + 2 GTP substrates, at early EDTA addition (0.002s),

isomerization was not detected, i.e. more G44 product was observed, but at prolonged EDTA quenching

(0.1s), isomerization reversal was detected, i.e. more A43 product was observed, indicating that i + 2

GTP dislodged the catalytic GTP. Also, the slow convergence of EDTA (isomerization time) and HCl

(bond synthesis time) curves indicated a coupling between translocation (hypothesized in their research

to be NTP-driven) and PPi release (which coincided with the end of the phosphodiester bond synthesis),

because high concentration of GTP-Mg2+ (detected in G45 by HCl quenching) appeared necessary to

force G44 bond completion. The three following experimental results using the experimental principle

explained above demonstrate binding of substrate NTP in the pre-translocated EC at the i + 2 and i + 3

downstream sites. First, the researchers showed that i + 2 and i + 3 NTPs contribute to isomerization

reversal. With a 40-CAAAGCCTTT-49 template, i + 2 and/or i + 3 CTP stimulated isomerization, while

dCTP did not (indicating precise selectivity at downstream sites), neither did GTP, ATP, or UTP. With

a 40-CAAAGACTTT-49 template, both i + 2 ATP and i + 3 CTP contributed to i + 1 GTP expulsion,

but i + 2 ATP, i + 2 UTP, i + 2 CTP alone, or i + 2 ATP in conjunction with i + 3 UTP did not. Also, in

25

the presence of dCTP, the EDTA and HCl quench curves converged slowly, indicating that it is i + 2

and/or i + 3 CTP which drove G44 bond completion. Second, a dynamic error correction process was

postulated thanks to the following experimental results. With a 40-CAAAGCCTTT-49 template, CTP

cancelled misincorporation of AMP for GMP (i.e. induced isomerization reversal of incorrect i + 1

AMP), but UTP did not. The researchers underlined that physiologically, not just in the presence of α-

amanitin, dynamic error correction occurs. Third, regulation of downstream template opening was

suggested. With a 40-CAAAGTCTTT-49 template, CTP or UTP alone did not appear to stimulate the

formation of the post-translocated A43 EC, indicating that combination of i + 2 and i + 3 optimally

triggered the formation. In 2007, Burton and colleagues pursued their isomerization experiments [Xiong,

et al., 2007]. They showed that NTP substrates templated at i + 2, i + 3 and i + 4 sites, but mismatched

NTPs, matched dNTPs and matched NDPs, could not induce isomerization reversal of the i + 1 site.

With a 40-CAAAGCCUUU-49 template, NTP binding at downstream sites was demonstrated because

accurately templated CTP and possibly UTP at i + 4, i + 5 and i + 6, had an effect on the fate of i + 1

GTP loaded in the active site. When 2.5 mM CTP and UTP are substituted with 5 mM ATP (an NTP

that is not accurately templated at adjacent downstream sites), isomerization reversal was significantly

reduced. Also, when CTP and UTP were replaced with CTP and ATP, the substitution of UTP with ATP

appeared to slightly reduce isomerization reversal indicating a role for the i + 4 (UTP templated) binding

site. A second experiment using the same template tested the requirements for i + 2 and i + 3 CTP sites

occupancy and i + 4, i + 5 and i + 6 UTP sites occupancy. The results were as follows. Reversal was

weak in reaction lacking CTP, strong in reactions containing GTP, CTP and UTP, and weak for the

combination using GTP and UTP but substituting CTP with dCTP or CDP. Also, dTTP, dUTP and UDP

did not stimulate reversal in the presence of CTP. Otherwise, the observation of the separation between

the EDTA and HCl curves seemed to indicate that at high CTP and UTP concentrations, CTP and UTP

induced increased translocation strain on the EC. In contrast, at low CTP and UTP concentrations, a

reduced translocation pressure was postulated to be applied against the translocation block by the

downstream NTPs. The latter corroborated a regulation role for downstream NTPs on translocation. In

their 2008 study [Kireeva, et al., 2008], Kireeva, Burton, et al., found that in mutant E1103G RNAP II,

the predominantly pre-translocated EC experienced a dramatic increase in NTP sequestration (at least

1200 isomerization events per second) as compared with the wild type EC, which is inconsistent with

the maximum isomerization events which could be accommodated by a NTP diffusion through the pore

according to Batada et al.’s diffusion calculations [Batada, et al., 2004]. In addition, the only way NTPs

could enter a predominantly pre-translocated EC would be during hypothetical short pre- to post-

translocated EC time windows (assuming the EC could oscillate between these positions), rendering the

successful diffusion through the secondary channel even less plausible. In their 2011 publication

[Kennedy, et al., 2011], Kennedy and Erie, using transient state kinetics and a mutant of RNAP, put

forward the following results. First, pre-incubating the complex with an NTP at i + 2 site increased the

subsequent rate of NAC, suggesting the existence of a NTP allosteric site in the main channel. Second,

26

pre-incubating the complex with an ATP at i + 2 led to its rapid sequestration in the active site after the

incorporation of a second CTP nucleotide. This was detected by HCl/EDTA quench assays revealing an

accumulation of enzyme-substrate in the complex, and suggested that CTP was sequestered prior to its

incorporation. Also, EDTA quench measures indicated that the sequestered ATP was committed to bond

formation prior to incorporation of CMP. Therefore, the quench data indicated that RNAP can

simultaneously imprison CTP and ATP prior to incorporation of CMP, which seemed to indicate that

the ATP had to be sequestered in a non-catalytic site without being released from the enzyme after CMP

incorporation. Consequently, it was suggested that NTPs can bind to a site in the main channel (i + 2)

that is involved in the regulation of NAC. In a paper published in 2006, Holmes et al. [Holmes, et al.,

2006] observed that mutating Ec RNAP residues R678 and D814, which in the secondary channel

loading model appear to interact with the nucleotide phosphate group and to coordinate MgB bound on

the NTP, virtually did not affect the transcription kinetics. This result seemed very inconsistent with

CH2 theory. In accordance with the results of the kinetic experiments exposed above, three single-

molecule studies [Abbondanzieri, et al., 2005; Larson, et al., 2012; Dangkulwanich, et al., 2013] seemed

to yield consistent information. Using an optical trap assay, the researchers measured the step magnitude

and velocity of translocation events, under assisting or opposing forces, from which they derived the

force dependence of the NAC. They found that the experimental force-velocity data supported a kinetic

model involving a secondary substrate binding site in the pre-translocated state. Finally, we will see in

chapter 5, that available routes for substrate diffusion to the downstream section of the main channel,

accommodating NTP pre-binding, exist. For the sake of the argument, the additional pathways will be

referred to as the tertiary channel (CH3) in the rest of this chapter.

Figure 3: CH3 access to the main channel. tDNA i + 3, i + 2 and i + 1 registers are represented in yellow,

green and blue respectively. i and i - 1 registers are represented in red and are bound to the RNA chain

colored in orange. The GTP substrate in the active site and bound to i + 1 register is represented in pink.

Protein walls are represented in grey surface. Enzyme structure is PDB#2E2H ([Wang, et al., 2006]).

27

4. Non-controversial properties of CH2 and dynamic error correction

While the diffusion function of the secondary channel is a matter of debate concerning its role in

channeling NTP substrates to the active center for catalysis (which implies exchanging correct/wrong

substrates in and out of the torus structure), it is accepted as an exit channel for incorrect NTP and PPi.

At this stage, scarce information is available about the kinetics of incorrect substrate expulsion, but

recent studies have pointed out interesting information concerning the properties of the pore involved

in PPi expulsion.

The pore serves as an exit tunnel for two PPi release events: after NAC and after TFIIS/GreA/B cleavage

[Zhang, et al., 2004; Sims III, et al., 2004]. In 2011, Da et al. [Da, et al., 2011] investigated the kinetics

of PPi release on the microsecond timescale by applying a Markov state model (predictive calculation

method allowing to guess a simulation pathway during a prolonged period of time across known control

states) using all-atom MD simulations and single-mutant simulations. They found that the PPi molecule

experienced a hopping behavior during its expulsion where hopping sites at the inner extremity of the

pore in the active site and further down in CH2 accelerated the release. The conserved positively charged

residues, such as yeast RNAP II residues Rpb1 K518, 619, 620, 752 and H1085 were shown to offer

constructive electrostatic interactions with the negatively charged (Mg−PPi)2− group, and to play an

important role in the expulsion. The authors note that all five residues are highly conserved among

species. Interestingly, K619 and 752 are located in the E site. Hence, the authors propose that these

residues, which could play a role in attracting the negatively charged substrate during NTP entry, could

have the double purpose of facilitating the expulsion of the positively charged PPi molecule.

In 2013, Da and colleagues [Da, et al., 2013] using the same experimental approach as above, studied

the dynamics of PPi release in Tt RNAP. They observed that the expulsion rate of the inorganic

pyrophosphate molecule was three-fold faster than in yeast RNAP II and occurred at a submicrosecond

timescale. Similarly, to the mechanism proposed for eukaryotic RNAP II, they found that PPi exit was

facilitated by favorable electrostatic interactions with basic residues in the secondary channel (K908,

912, 780, 1362 and 1369). The authors suggested that one of the causes of the faster expulsion dynamics

in the case of bacterial RNAP could result from the shorter dimensions of CH2.

In addition to its diffusion properties, CH2 also has non-diffusion functions (non-controversial at this

stage) which are RNA backtracking site and TFIIS/GreA/B binding site. In contrast to DNA Polymerase,

RNAP can backtrack the nascent transcript (through the secondary channel) in order to correct

transcription errors or to allow regulatory pauses to occur, whereas DNA Polymerase requires alternative

processes (notably the implication of exonucleases). This embedded fidelity/regulatory mechanism

underlines the amazing precision and efficiency of RNAP and renders the molecular machine as a master

piece of Engineering. First, the concept of RNAP backtracking with the latest postulates about the

molecular mechanisms underlying such a process will be investigated. Then the TFIIS and GreA/B TFs

28

which bind in the secondary channel and are involved in the RNAP error correction processes will be

presented. Other TFs (bacterial) which bind in CH2 include DksA and Gfh1.

RNAP enters an off-pathway state when it aborts processive transcription. The latter off-pathway state

can be subdivided into two states [Xie, 2012]. The first state is referred to as pausing or arrest and

corresponds to a brief suspension of transcription (1–6 s for multi-subunit RNAP) where RNA does not

normally backtrack [Nudler, et al, 1997; Shaevitz, et al., 2003], but where the elongation rate is regulated

[Xie, 2012]. Pausing is thought to be induced by signals coded directly into the DNA template, that is

to say to be triggered by specific tDNA sequences [Herbert, et al., 2006]. The second state usually

involves prolonged pauses (> 20 s for multi-subunit RNAP) where the enzyme experiences backtracking

[Xie, 2012]. The process of the latter state is the following. RNAP can literally rewind its forward step-

wise motion along DNA and RNA, and slide in the opposite direction on the nucleic acids in order to

reset the transcription mechanism several base-pairs backwards or in order to expel a full aberrant RNA

chain. The roles of backtracking include transcription error recovery, control of transcription elongation

(function slightly distinct from error recovery), recovery from pause-arrest, exposition of damaged DNA

for repair, termination of elongation and initiation (where the enzyme cycles between several RNA

synthesis and extrusion phases until a 13-15 nucleotide long RNA chain has been successfully

synthesized [Batada, et al., 2004; Vassylyev, et al., 2007A; Nudler, et al., 2012]. In such a process, the

DNA molecule can be directly extruded through the downstream main channel outside of the enzyme,

but the 3’ end of the nascent RNA transcript, being located at the center of the complex, needs a pathway

inside the RNAP for accommodating its retrograde motion. CH2 serves this very purpose as it connects

to the active site where the RNA 3’ end lies and offers an empty cavity for the transcript to be extruded.

According to Martinez-Rucobo and colleague in [Martinez-Rucobo, et al., 2013], RNA backtracking

through the secondary channel has been elucidated thanks to the direct observation of the phenomenon

in RNAP crystallographic data. According to Xie in [Xie, 2012], knowledge about the transcription

pausing characteristics arose from single-molecule studies of RNAP.

The backtracking state is triggered by destabilized RNA–DNA hybrid [Nudler, et al., 1997; Shaevitz, et

al., 2003; Sosunov, et al., 2003; Greive, et al., 2005; Kireeva, et al., 2005; Zenkin, et al., 2006]. An

incorporation error leads to a weakening of the hybrid, which in turn increases the probability of

backtracking [Nudler, 2009]. The mechanism by which the hybrid loosens its contacts from the active

site has been theorized by Vassylyev et al. in [Vassylev, et al., 2007A] and Xie in [Xie, 2012]. According

to the former group, when the hybrid is packed in the active site, it forms polar and van der Waals

interactions with conserved protein residues. They propose that the protein structure may act as a shape-

sensor of the hybrid, where incorrect RNA sequence leads to increased repulsive van der Walls

interactions between the protein and the hybrid. The shape-sensor theory was foreseen in 2001 by [Gnatt,

et al., 2001]. In 2012, Bochkareva et al., using transcription assay kinetic techniques generated results

consistent with the shape-sensor theory [Bochkareva, et al., 2012]. Xie on the other hand proposes the

29

following model. During correct transcription elongation, the RNA-DNA hybrid is not unwound which

induces a positioning of the RNA 3’end away from the secondary channel. However, if an incorporation

error occurs, the resulting mismatch in the nascent hybrid is likely to cause the RNA chain to lose its

canonical A form and to be deviated from the DNA. This deviation could highly increase the probability

of the RNA to position in front of CH2, allowing its extrusion. The author also underlines that when the

RNA-DNA pair is not unwound, which corresponds to correct transcription, the 3’end of the RNA chain

is positioned at the i site and structurally prevents the enzyme from translocating backwards. In support

of Xie’s model, frayed RNA 3’end has been observed in crystallographic structures consisting of a

misincorporated nucleotide [Sydow, et al., 2009A; Sydow, et al., 2009B; Wang, et al., 2009]. In addition,

Toulokhonov et al. in 2007 ([Toulokhonov, et al., 2007]) found results consistent with RNA 3’ end

fraying during the elemental pause state (probably preceding the other off-pathway states such as

backtracking). Nudler in [Nudler, 2012], summarizes the mechanism by stating that incorrect substrate

pairing would facilitate backtracking and its own expulsion through the secondary channel, and therefore

backtracking may assist in NTP selection. In addition, the author proposes in [Nudler, 2012] and

[Nudler, 2009] that the trigger loop may play a role in allowing the backtracking process to occur. The

trigger loop close conformation depends indeed on the accuracy of the loaded NTP. However, the extent

at which backtracking causes or is caused by the trigger loop conformation change does not seem fully

elucidated at this stage.

According to Wang et al. in [Wang, et al., 2009], RNA backtracking is reversible for one or a few

nucleotides, but becomes irreversible afterwards. Transcription factors TFIIS for eukaryotic RNAP II

and GreA/B for bacterial RNAP have the ability to rescue an arrested RNAP in a backtracked state, by

cleaving off the RNA chain and facilitating transcriptional restart. Their mechanism of action is the

following (reviewed in [Conaway, et al., 2003; Sims III, et al., 2004; Nudler, 2009; Cheung, et al.,

2011]). Both TFIIS and GreA/B TFs possess a long protrusion which inserts in the secondary channel,

with a tip referred to as NTD (coil-coiled N-terminal domain) reaching the active center. NTD is thought

to provide a basic and two acidic residues interacting chemically with the active site [Nudler, 2009].

The acidic residues interact with MgA and mobilize MgB triggering a chemical reaction termed

pyrophosphoryolysis (RNA hydrolysis, reverse of the polymerization reaction) resulting in the cleavage

of the RNA backtracked transcript. In other words, the factors allow separating the backtracked biased

chain from the non-backtracked chain, and this separation is done directly in the active site. The cleavage

reaction is driven by a two metal-ion-hydrolysis mechanism [Kettenberger, et al., 2003; Sosunov, et al.,

2003], which is identical to the two metal-ion mechanism driving the NTP addition cycle, with the fine

distinction that MgA binds the +1 RNA phosphate to align the scissile bond, in contrast to its binding

of the RNA 3’ -OH group during nucleotide addition [Cheung, et al., 2011]. The secondary channel can

accommodate both the transcript and the TF protrusion, while not impeding the expulsion of the

transcript. It is also hypothesized that the protein conformational changes induced by TF insertion

30

realign the RNA chain in the hybrid, therefore allowing forward elongation to resume [Kettenberger, et

al., 2004; Cheung, et al., 2011].

In 2011 [Cheung, et al., 2011] and 2013 [Martinez-Rucobo, et al., 2013], Cramer and colleagues have

brought forward informative details. They suggested that the NTD charged residues might catalyze

proton transfer during the cleavage reaction. The researchers found that the backtracked RNA was gated

from the secondary channel by a tyrosine residue. They postulated that during backtracking, the RNA

chain bypasses the gating residue until it binds to a site in the sec. channel, termed backtrack site. They

proposed that TFs may facilitate reactivation by competing with the residues in the secondary channel

binding the extruded transcript (therefore helping detaching the chain) and by locking the trigger loop

away from the transcript. Their findings help to refine what is known about the sec. channel non-

diffusional properties (e.g., to shed some light on CH2 residues forming part of the backtrack site).

An additional error recovery mechanism has also been described [Zenkin, et al., 2006; Sydow, et al.,

2009 A; Sydow, et al., 2009 B; Wang, et al., 2009; Martinez-Rucobo, et al., 2013] where the RNA chain

can backtrack its aberrant tailing residue in reaction to an incorporation error, but where the enzymatic

complex does not need to be rescued by a transcription factor. Instead, an intrinsic cleavage phenomenon

occurs. The backtracking motion results in the positioning of the nascent 3’end at a position termed “P”

for proofreading site by [Wang, et al., 2009], which corresponds to the +2 site of backtracked RNA,

where hydrolysis of the scissile phosphodiester bond is stimulated by the favorable chemical

configuration of the active site. According to Wang et al. in [Wang, et al., 2009], TFIIS cleavages occur

more than 100 times faster in vivo as relative to intrinsic cleavages. Therefore, one can consider the TF

stimulated cleavage as the main error recovery pathway. The intrinsic cleavage state is irrelevant to the

properties of the sec. channel, but is relevant for gauging dynamic error avoidance processes that could

occur in both the main channel and secondary channel loading models.

31

5. The ratchet issue

The hypothesis according to which NTPs load to the active site via CH2 in order to bind directly to the

DNA template register i + 1 was shown to be very consistent with a model depicting the translocation

mechanism and termed the Brownian-ratchet model. The latter model is largely accepted and seems to

be confirmed by a large amount of experimental evidences. The main channel theory on the other hand

seems inconsistent with one of the postulates of the Brownian-ratchet model, which has resulted in a lot

of controversy. In this section, we will demonstrate that the evidences do indeed validate most of the

Brownian-ratchet model. But a very important point will be raised: while the Brownian-ratchet model

is essentially correct, one of the two following axioms might be wrong. The incoming NTP acts as the

ratchet bias in the active site, or alternatively the EC experiences several oscillations during processive

synthesis. I will show that the Brownian-ratchet evidences do not necessarily contradict the main

channel theory. In other words, while the secondary channel Brownian-ratchet model could be partially

erroneous, its main assumptions are probably right; a Brownian-energetic mechanism seems to be indeed

involved and is consistent with the main channel theory. We will first consider the translocation

background, theory and implications, generally, then we will have a closer look to the problem.

Following an early postulate about thermal energy fluctuations powering molecular motors, Guardajo

and Sousa in 1997 [Guardajo, et al., 1997], as well as Oster and Wang in 2002 [Oster, 2002; Wang, et

al., 2002], proposed that RNAP translocation was driven by a Brownian ratchet. More or less at the same

time, the secondary channel theory was formulated. The assumption according to which NTP substrates

diffuse through the secondary channel and load in the active site during the post-translocated EC, seemed

to be almost perfectly in line with the more general Brownian-ratchet model. Because the latter model

seemed to be validated from several experimental proofs, it ironically seemed to validate the CH2 theory

in return. While the specific translocation model is still an open question at this stage, experimental

work generally agrees with the fact that translocation can oscillate (although whether it can oscillate in

the fast state or whether the oscillations are rapid or not, is still disputed), and with the fact that the

Brownian molecular storm seems to be the source of energy of the powerful translocation mechanism

(RNAP can be viewed as force-generating for this reason). The latter assumptions seem supported by

strong structural [Gnatt, et al., 1997; Westover, et al., 2004A; Westover, et al., 2004B; Wang, et al.,

2006; Brueckner, et al., 2008; Vassylyev, et al., 2007A], biochemical [Komissarova, et al., 1997A;

Komissarova, et al., 1997B; Bai, et al., 2004; Bar-Nahum, et al., 2005; Guo, et al., 2006; Damsma, et

al., 2007; Brueckner, et al., 2008; Hein, et al., 2011; Maoileidigh, et al., 2011; Malinen, et al., 2012;

Nedialkov, et al., 2012; Imashimizu, et al., 2013], statistical [Wang, et al., 1998; Tadigotla, et al., 2006;

Yu, et al., 2012], single-molecule [Abbondanzieri, et al., 2005; Bai, et al., 2007; Larson, et al., 2012;

Dangkulwanich, et al., 2013] and Molecular Dynamic [Woo, et al., 2008; Feig, et al., 2010; Da, et al.,

2011; Silva, et al., 2014] evidences. Furthermore, details about specific protein domains involved in the

translocation process have emerged, such as the contribution of the TL [Wang, et al., 2006; Vassylyev,

32

et al., 2007A; Feig, et al., 2010], the BH [Tan, et al., 2008; Weinzierl, 2010A; Weinzierl, 2010B;

Weinzierl, 2011; Kireeva, et al., 2012; Silva, et al., 2014] and the FLoop [Miropolskaya, et al., 2014].

The most popular model, in line with the secondary channel theory, relies on an elegant and simple

concept. The elongation complex oscillates spontaneously between two states: post-translocation and

pre-translocation, and the binding of a NTP in the former state would constitute the ratchet bias. Forward

elongation is triggered by a single and simple event: cognate NTP loading to the active site in the post-

translocated EC. Movies summarizing the whole process have been presented by Cramer et al. in

[Brueckner, et al., 2009; Cheung, et al., 2012] and Silva et al. in [Silva, et al., 2014].

The detailed process is the following. In the absence of NTP in the A site, RNAP slides back and forth

on the nucleic acids frame structure within a single base-pair interval. The EC can be considered to

oscillate freely between two-states: pre- and post-translocation states. The post-translocation process

drives the EC from the pre- to the post-translocated state, where tDNA register i + 2 shifts above the

bridge helix into the active site and occupies the i + 1 register, and the i + 1 register slides towards the

RNA transcript occupying the i register bound to the RNA 3’ end. During the pre-translocation process,

i + 1 register shifts to i + 2 and i register to i + 1. The template register that oscillates between the i + 1

and i + 2 registers is called the transition nucleotide (TN). The latter tDNA base slides back and forth

above the bridge helix. The translocation processes and states are to be differentiated. The pre-

translocated state occurs after the pre-translocation process has been completed and is precisely reached

when the EC has formed a particular geometry: some protein conformational changes have occurred

such as the – 90° tDNA rotation and the straightening of the bridge helix. In the pre-translocation state,

access to the active site is prevented from CH2 because the RNA 3’ end (register i) has shifted in the

active site and because the bridge helix has partially invaded the active site. The post-translocation state

occurs precisely after the post-translocation process has occurred, when the tDNA strand has undertaken

a + 90° rotation, the bridge helix has adopted a bent conformation, and the TN facing the secondary

channel becomes available for base-pairing. If a NTP loads in the A site during the post-translocated

state, the backward oscillation of the TN is disabled and a new oscillation is enabled. The TN now at

register i + 1 cannot shift backwards anymore. Instead the post-translocated template base i + 2

(equivalent to pre-translocated base i + 3) becomes the new transition nucleotide. As such, the loaded

NTP has incremented the ratchet one base-pair forward. More precisely, the ratchet-bias behavior of the

NTP can be considered as follows. While the NTP is inserted in the catalytic center and polymerization

chemistry occurs, backward translocation is impeded. Therefore, the translocation oscillation is biased

towards the forward motion. While the substrate is being added to the RNA 3’end, forward translocation

proceeds and the oscillation process is reset one template-base forward. Therefore, the nucleotide cycle

has occurred between two post-translocation events: the first one places the TN in the A site, the next

one shifts the next template register (i + 2) to the A site. It is also during this post-translocation 1 to post-

translocation 2 time window, that a base-pair in the DS bubble is melted (according to the main channel

33

theory, it would probably be i + 3 or i + 4), while a DNA pair is reassociated upstream. Interestingly,

and counter-intuitively, between post-translocation 1 and post-translocation 2, the EC will be

momentarily in the pre-translocated state (with the newly added substrate in the A site attached to RNA

3’end and kinked bridge helix) without having experienced any pre-translocation motion. It follows that

the pre-translocated state can be divided into two different categories: transient pre-translocated state

between two post-translocation motions and standard pre-translocated resulting from a pre-translocation

motion. In the absence of substrate, the translocation process is not reset one step forward after the

shifting of the TN in the A site, because the unbound template register does not allow to alleviate the

upward pawl, but oscillates between the pre- and post- translocated states, where the TN successively

enters and leaves the catalytic cavity. Concerning the location of the DNA registers, the following

consideration is useful. i + 2 base in pre-translocation (normal state) is equivalent to the i + 1 base in

post-translocation 1 (for free 1 increment oscillations), i + 2 in pre-translocation (normal state preceding

addition) is equivalent to i in post-translocation 2 (after addition of NTP) and i + 2 in pre-translocation

(transient state) is equivalent to i + 1 in post-translocation 2.

Otherwise, an immediate question that can be raised is why translocation oscillates on a single base-pair

increment. The answer is that RNAP cannot slide on an interval of several nucleotides because it is

locked between two pawls: the upward pawl consisting of the post-translocated protein geometry

including the previously added NTP and the downward pawl consisting of the pre-translocated protein

geometry. This explanation seems however inconsistent with the fact mentioned above stating that an

NTP addition occurs between two consecutive post-translocation events. Further explanation is that

when the incoming matched NTP loads in the post-translocated EC, it triggers protein conformational

changes that unlock forward translocation. Therefore, not only does the loaded NTP bias the ratchet

towards forward translocation, but it also temporally inactivates the upward pawl and consequently

allows one more round of forward translocation. In the secondary channel theory, the EC experiences

several translocation oscillations until the A site is bound by a NTP, allowing RNAP to increment its

cognate register one base forward.

The main channel theory implies an already bound TN, where several translocation oscillations appear

inconsistent with the NTP-TN pair binding in the active site and acting as the ratchet bias. Because then

only one forward translocation would block backward translocation. On the other hand, the secondary

channel theory is consistent with several translocation oscillations, where the incoming NTP binds the

TN after being loaded in the active center via the secondary channel and where such a binding acts as

the ratchet bias. It follows that for the main channel theory to be correct, one of the following must be

incorrect: either the EC does not oscillate but only proceeds forwards, or the ratchet bias is not located

in the A site but it is a binding event in the downstream bubble that biases the ratchet to post-

translocation. However, both scenarios are consistent with a Brownian-ratchet mechanism. In the first

case, the elongation complex could be simply locked to the post-translocation mode, where backward

34

translocation is forbidden, but where the base pair entering the active site allows the upward pawl to be

shifted. Therefore, it is almost equivalent to the Brownian-ratchet model. The second scenario resembles

even more the Brownian ratchet mechanism, with the fine distinction that the ratchet bias trigger point

is located in the downstream channel, not in the A site.

In [Holmes, et al., 2003], Holmes and Erie suggest that binding of the NTP in the downstream channel

facilitates translocation by locking the EC in the post-translocation mode. In other words, allosteric NTP

would abort the translocation oscillations of the EC between pre- and post-translocation. That is to say

that the EC would shift from post-translocated state 1 to post-translocated state 2. The EC would not

experience backwards motion where the TN shifts behind the bridge helix. In contrast, the TN would

shift in a unilateral direction: forward shift where the TN slides in front of the bridge helix and becomes

the i + 1 template register. As mentioned above, this model is consistent with the Brownian ratchet

model if one of the postulates is put aside: the EC does not necessarily oscillate. The Molecular Dynamic

observations of translocation oscillations (e.g., [Silva, et al., 2014]) could then be explained by the fact

that the observed enzymatic complex is not in processive elongation. Indeed, the main channel theory

is consistent with translocation oscillations during non-processive transcription, because the allosteric

effect of downstream NTPs could not be accounted for and/or because the complex is substrate free, not

allowing sequential energy redirection triggering events to occur (triggered by interactions with NTP).

Also, consistent with the EC not oscillating are the observations that the pre-translocated state is

dominant, when no or scarce template NTP is present (e.g., [Kireeva, et al., 2008], [Dangkulwanich, et

al., 2013]). The postulate that translocation only proceeds forward, in normal transcription (four NTPs

present in solution, fast state) seems more plausible than translocation oscillations. This is inferred

because an oscillating already bound NTP-dNMP at TN position seems to pose a few issues. For in and

out motions to occur, the entering NTP would need to not bind to the A site (binding of MgB to Rpb1

D481, Rpb2 D837, and biding of NTP to MgA site). For this to happen, the NTP polyphosphate tail

would need to be shielded from the A site. It seems unlikely to explain how this could occur, even while

considering the hypothesis that the PPi from the previous NAC stays in the A site during a while and

plays the role of shield or the hypothesis that the A site is shielded by active center geometry (e.g., by

the TL). An alternative solution could be that translocation oscillations are so strong that binding to the

A site does not trap the NTP, and that the enzymatic complex requires binding in the DS bubble (e.g.,

at i + 4 position) in order to bias the ratchet forward. However, NTP diffusion and hence binding in the

DS bubble is not rate limiting in the main channel theory if substrates are not provided at subsaturating

amounts, and therefore one can consider that immediately (to simplify) after a DS register becomes

available, it is paired. Consequently, the hypothesis according to which a binding event in the DS bubble

bias forward the ratchet is inconsistent with the hypothesis of several translocation oscillations. It

follows that the whole assumption of translocation oscillations can be eliminated if the main channel

theory is correct, because it is hard to imagine what would trigger the ratchet forward if it is not a NTP

35

binding event. As a conclusion, the solution of forward translocation locking seems much simpler and

therefore is probably the right solution. Furthermore, forward translocation locking fits extremely well

in a general and extended model of translocation (explored in chapter 5). Also, a subsequent conclusion

is that the observation of pre-translocated states in experiments is consistent with forward locking,

because translocation oscillations can occur in the absence of substrate and because pre-translocation

can occur in reaction to a misloading event, an incorporation error, or during pause/arrest (e.g., triggered

by specific DNA sequence). Furthermore, we have seen that there exists a transient pre-translocated

state that does not originate from any pre-translocation motion.

Burton and colleagues in [Zhang, et al., 2004] and [Burton, et al., 2005] claim that the allosteric effect

of NTP on transcription means that the downstream dNMP-NTP pair drives forward translocation. In

particular, Burton et al. claim in [Zhang, et al., 2004] that “the dNMP-NTP basepair is thought to drive

RNA DNA hybrid displacement”. The following objections could be raised. First, the allosteric effect

of downstream NTP on transcription and hypothetically on translocation is not equivalent with the axiom

that NTP drives translocation. In fact, NTPs could very well facilitate the decoupling of the Brownian

energy in order to accelerate forward translocation, without providing any additional energy. Next,

downstream NTPs could attenuate a rate-limiting factor (e.g., PPi release) distinct from the hypothetical

rate-limiting forward translocation process and therefore allow translocation indirectly (and hence

transcription) to accelerate, without directly driving the translocation.

As a summary of the ratchet issue discussed above, the NTP-translocation and allosteric models might

be right regarding the fact that binding of a NTP has an allosteric effect on translocation, but seem wrong

when they imply that translocation could be energetically NTP-driven. The Brownian ratchet model,

although it might assume wrong hypotheses such as substrate diffusion through the secondary channel,

and spontaneous translocation oscillations, might be correct concerning the source of energy and might

describe translocation occurring in a substrate free enzyme. The main channel theory seems to be

consistent with a Brownian ratchet mechanism. More importantly, because the forward locking

postulate, which is inconsistent with CH2 theory, seems to fit particularly well in an extended model of

translocation (described in chapter 5), the main channel theory gains very serious credibility.

After these general considerations, let us have a closer look at the mechanism. Although it could appear

that there is an argument about the fundamental details of translocation, this is not necessarily the case

when observing closely the conditions under which the process oscillates or not. In particular, if

biochemical and MD experiments are investigated more thoroughly and privileged over other

experimental methods such as x-ray data (by essence more static is less informative), the following

picture emerges. The literature is actually consistent with the EC oscillating, but in non-processive (i.e.

not fast) elongation and subsaturating/null substrate concentration [Bar-Nahum, et al., 2005; Feig, et al.,

2010; Silva et al., 2014; Dangkulwanich, et al., 2013]. Because if for some reasons i + 2 is not bound

36

(subsaturating substrate concentrations, substrate-free enzyme, no presence if i + 2 NTP, etc.), then there

is no obstacle in the CH1 model as to why translocation would not oscillate. Consistent also with the

idea of i + 2 not being bound at subsaturating concentrations is the fact that NTP binding is rate limiting

if not supplemented at sufficient amounts [Bai, et al., 2004; Tadigotla, et al., 2006; Bai, et al., 2007;

Dangkulwanich, et al., 2013]. A second point to investigate further is if literature data is actually

consistent with translocation being locked forward when NTP binding is not rate limiting. Of particular

interest is the study of [Dangkulwanich, et al., 2013] where the researchers were able to derive almost

all the kinetic parameters related to translocation in a very precise manner.

The authors yield the kf forward kinetic parameter by solving the following equations.

𝑘𝑓 = 𝑘0. exp(𝐹. ∆/𝑘𝐵𝑇) (1)

𝑘𝑏 = 𝑘0. exp(−𝐹. (1 − ∆)/𝑘𝐵𝑇) (2)

ψ(t) = (𝑘𝑓/𝑘𝑏)−0.5. (exp(−(𝑘𝑓 + 𝑘𝑏)𝑡)/𝑡). (2𝑡(𝑘𝑓/𝑘𝑏)

−0.5)−1 (3)

Where, 𝑘𝑓, 𝑘𝑏, 𝑘0 are the forward, backward and intrinsic zero-force stepping rate constants

respectively, 𝐹 is the assisting or opposing force, ∆ is the transition state distance at each step, 𝑇 is the

temperature, 𝑘𝐵 is the Boltzmann constant, 𝑡 is time.

They assume in their model of probability density of pause duration ψ(t), that the pause distribution

probability is equivalent to a diffusion in one direction, then a return to original place. They deduct from

there the distribution of pauses. To simplify, let us have the following reasoning. The shorter the detected

pause, the smaller the probability that it has occurred (<0.2 for a pause <0.5s); then the longer the pause,

the higher the probability that it has occurred (if the pause >4s, the probability is >0.8). If the pause is

longer than 10s, the probability converges to certainty.

They cumulate the distribution of the pause duration probabilities (converging to 1), which yields k0.

They then solve (1) and find 𝑘𝑓.

Inputting 𝑘𝑓 in the equation describing the foward kinetic parameter when a nucleosome barrier is

present gives the factor 𝛾𝑈 (fraction of the time the nucleosomal barrier is unwrapped):

𝑘𝑓(𝑛𝑢𝑐𝑙) = 𝛾𝑈. 𝑘𝑓

The researchers then calculate the forward translocation rate 𝑘1 and the catalysis rate 𝑘3, by using the

following trick: the nucleosome roadblock induces an asymmetry in the kinetic equations below

allowing to separate 𝑘1 from 𝑘3:

𝑉𝑚𝑎𝑥(𝑛𝑢𝑐𝑙) = ((𝛾𝑈. 𝑘1. 𝑘3)/((𝛾𝑈. 𝑘1) + 𝑘3)). 𝑑

37

𝑉𝑚𝑎𝑥 = ((𝑘1. 𝑘3)/(𝑘1 + 𝑘3)). 𝑑

Where 𝑉𝑚𝑎𝑥(𝑛𝑢𝑐𝑙) and 𝑉𝑚𝑎𝑥 are the maximal pause-free velocitites in the presence and absence of

nucleosomal DNA, and where 𝑑 is the stepping distance.

In the end, the following important rates are calculated.

𝑘1(𝑝𝑜𝑠𝑡 − 𝑡𝑟𝑎𝑛𝑠) = 1/112 = 0.0089 𝑠

𝑘3(𝑐𝑎𝑡𝑎𝑙𝑦𝑠𝑖𝑠) = 1/35 = 0.02857 𝑠

𝐸𝑙𝑜𝑛𝑔𝑎𝑡𝑖𝑜𝑛 (𝑝𝑎𝑢𝑠𝑒 − 𝑓𝑟𝑒𝑒 𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦, 2𝑚𝑀 𝑁𝑇𝑃𝑠) = 26.7 𝑛𝑡. 𝑠−1 = 0.037453 𝑠. 𝑛𝑡−1

It worth noticing that 𝑘1 + 𝑘3 ≈ 0.037453 𝑠. 𝑛𝑡−1. Hence, accuracy standard error put aside, elongation

is virtually equivalent to translocation time plus catalysis time. Their study shows that elongation is

indeed locked forward (𝑘1 is the forward translocation rate), and that NTP binding is not rate limiting

(given amounts not diverging far from physiology). Their results seem to fit virtually perfectly with a

locked post-translocation model of elongation (given that NTPs are supplemented in an amount

comparable to physiology), and consequently with the CH1 model. Another recent experimental work

is consistent with translocation being locked forward in the fast state [Nedialkov, et al., 2012]. Finally,

as mentioned in the main channel theory section, recent single-molecule researches also corroborate that

there exists a secondary binding site that is not related to the translocation state [Larson, et al., 2012;

Dangkulwanich, et al., 2013].

Taking the elements altogether, we can hypothesize that it is indeed the trapping of i + 2 NTP at i + 1

position during post-translocation that constitutes the very first step that will lead to the ratchet being

incremented forward and that translocation is locked forward in normal processive elongation. This

resolves the problem of the NTP leaving the catalytic center and coming back to CH1, where the

mechanism would not be rectified and the upward pawl would not be unblocked. Hence, while not

invalidating CH2, it is at least consistent with CH1, where an already bound i + 2 register seems not

easy to reconcile with rapid oscillations. But if for some reasons i + 2 is not bound (subsaturating

substrate concentrations, substrate-free enzyme, no presence if i + 2 NTP, etc.), then there is no obstacle

in the CH1 model, as to why translocation would not oscillate: free translocation oscillations in the

absence of substrate (binding to i + 2) appears to be indeed correct. In chapter 5, a general

translocation/NTP loading mechanism will be proposed.

38

6. The melting issue and details on cTFs

It has been suggested from several studies that substrate pre-binding in the main channel was impossible

as DNA strands were evidenced to be fully associated (the opposite is referred to as melted) up to the i

+ 2 or i + 3 registers. For example, Vassylyev et al. and Kashkina et al., on the basis of structural and

biochemical data, have proposed that i + 2 was paired [Vassylyev, et al., 2007A; Vassylyev, et al.,

2007B; Kashkina, et al., 2007]. The melting evidences seemed to confirm the secondary channel as the

only possible pathway. It is worth mentioning that paired i + 2 register seems rather inconsistent with

free translocation oscillations during fast transcription, as the disjointed breakings of the hydrogen bond

between ntDNA and tDNA strands at i + 2 register would appear to be too energy costly. It follows that

if i + 2 register was really paired (a fact that will be proven wrong in this chapter), it would probably

only leave forward translocation locking as a plausible option anyway. In this section, we will

investigate the theory of strand separation, before analyzing structural and footprinting biochemical

experiments which could be informative about downstream DNA association and finally expanding on

the role of transcription factors, which appear to both play a key role in DNA melting and to impose

new conditions in order to decide between the modes of substrate channeling to the catalytic center.

Before, proposing an extended theory of strand separation, let us demonstrate that the mechanism is

likely to be universal, at least for bacterial RNAP and eukaryotic RNAP II, as well as gain insight on

the relative positions of the strands. In 2009, Andreacka et al. using single-molecule Fluorescence

Resonance Energy Transfer (smFRET) resolved the trajectory of the ntDNA strand in yeast RNAP II

[Andreacka, et al., 2009]. They found that the ntDNA strand passed above lobe region (Rpb2 272-278),

close to rudder (Rpb1 305-324) residues 309-315, near FL1 (Rpb2 461-480) residue 471, that the nt and

tDNA strands separated near FL2 (Rpb2 501-511) residue 504, and that most residues were conserved

in human RNAP II. In 2012, Zhang et al. resolved the structure of Tt RNAP IC using a complete ntDNA

strand (PDB#4G7O, [Zhang, et al., 2012]). The ntDNA strand was resolved on its full length and its

trajectory is very consistent with the FRET results from Andreacka and colleagues, where the strand

shifts at 90° from the tDNA strand near register i + 2, pointing towards the inside of the enzyme, before

looping backwards and perpendicularly above the tDNA strand and running outside of the enzymatic

complex. Therefore, one can reasonably postulate that the mechanism of strand separation is universal.

In 2011, Kireeva et al., [Kireeva, et al., 2011], using a RNAP mutant lacking the FL2 loop interacting

with i + 2, showed that FL2 did not play a significant role on melting. It is commonly believed that

electrostatic “switches” adjacent to the DS bubble are responsible for DNA melting. For instance, in

2004, Kettenberger and colleagues ([Kettenberger, et al., 2004]) proposed that three Rpb1 positively

charged residues (R326, K330, and R337) belonging to switch region 2 and two Rpb1 negatively

charged residues (E1403, E1404, and E1407) belonging to switch region 1 could separate the strands

39

near i + 2 to i + 4 registers, with the negative amino-acids repelling the ntDNA strand, while the positive

amino acids pulled the tDNA strand away from the helix axis.

In this paragraph, let us propose a reviewed and extended theory of strand separation based on the

analysis of yeast RNAP II structure (PDB#2E2H, [Wang, et al., 2006]). Observation of the electrostatic

configuration of amino acids near DNA register i + 2 enables to propose an “electrostatic fork” theory

of strand melting. The electrostatic fork comprises three zones of charged amino acids. Zone 1 consists

of Rpb1 residues K1102 (ε TL), R840 (ε BH), R1386 (ε switch 1) and attracts ntDNA strand downwards

and towards the left (towards inside of enzyme). Zone 3 comprises Rpb1 residues R839 (ε BH), K330,

K332, R337, and attracts the tDNA strand on the right further upstream. And zone 2 consisting of Rpb1

negatively charged residues E1403, E1404, E1407, all belonging to switch 1 region, creates a buffer

area preventing tDNA strand to be attracted towards zone 1 and ntDNA strand to be attracted to zone 3.

Rpb1 residue E884 (ε BH) appears to play a subtle role, pushing tDNA towards zone 1 and pushing

ntDNA strand away. The principle is summarized in Figure 4.

Figure 4: Electrostatic Fork melting mechanism. Key junction area is represented, where tDNA and ntDNA

strands melt. Left figure displays separation of tDNA (light blue vdw representation) and ntDNA (cyan vdw

representation) and is taken from PDB#2E2H (Wang, et al., 2006]) RNAP II structure. Electrostatic zone

pulling tDNA towards direction A, electrostatic zone 3 deviating ntDNA towards direction B (allowing

looping above tDNA), and electrostatic zone 2 creating a wedge region between zone 1 and 3, are shown in

magenta, light pink and dark pink surface representations respectively. tDNA i + 2 register is indicated in

yellow. Right figure displays the same information as the left figure; with the distinction that the specific

electrostatic residue indexed are indicated and that the residues are represented as cubes allowing to

simplify and characterize the Electrostatic Fork region as consisting of two attractive and one repelling

layers. DNA strands are represented as ribbons.

40

In addition to this key fork junction region, other residues appear to guide in a subtle manner the DNA

strands. Downstream deviation of the ntDNA strand is initiated around registers i + 9 to i + 11 thanks to

Rpb1 R175, K100 and Rpb2 R337 residues. Upstream guidance of the ntDNA strand, notably in order

to initiate perpendicular looping of the chain above the T strand, is performed by Rpb1 TL residues

R1100, K1109 and K1112. Otherwise, downstream deviation of the tDNA strand could be initiated

around i + 6 to i + 11 positions by Rpb2 residues K228, K257, R261, K277 (ϵ lobe region) and K471 (ϵ

FL1). Furthermore, amino acids K228, K257, K277 and K471 could have the double purpose of guiding

the upstream section of the ntDNA strand (positions i to i – 6) above the tDNA strand.

From the above electrostatic model of strand separation, it follows that the downstream bubble needs to

close sufficiently for optimal DNA melting, in order to bring the deoxyribonucleic helix with the

electrostatic protein residues close together. Another fact to be considered is that temperature might play

a direct role in promoting DNA melting. In 1983, Kirkegaard et al., [Kirkegaard, et al., 1983], using

cytosine methylation DNA footprinting found that melting of an Ec RNAP IC was strongly dependent

on temperature.

The crystallographic experiments performed on single subunit, bacterial RNAP and eukaryotic RNAP

II, which could be informative about downstream DNA association, will be reviewed. Several remarks

are to be stated before investigating the structural data. Although the i + 2 base in pre-translocation is

equivalent to the i + 1 base in post-translocation, translocation conserves the relative positions of the

bases. For example, if i + 2 is melted in pre-translocation, then the position will also be melted in post-

translocation, as the relative position of the t and ntDNA strand will not change. Only the position of i

+ 1 register relative to the RNA 3‘end will change, for RNAP walks away from RNA in post-

translocation and walks towards RNA in standard pre-translocation or catalytic site is occupied by newly

added NTP in transient pre-translocation (see ratchet issue above). Therefore, in this sub-section,

numbering of the nucleic registers will ignore the translocation state in order to focus on the melting

properties. In addition, unresolved RNA and DNA registers that are positioned outwards, near the

external surface of the enzymatic complex, will be ignored when resolution of the nucleic bases is

discussed, as they do not bring informative detail about DNA strand separation in the DS bubble. The

resolution of tDNA registers will be ignored as in almost all the structures they are resolved due their

stabilization with the wall of the downstream bubble. Particular focus will be given to the molecular

resolution of ntDNA registers close to the active site, because when a base is resolved or unresolved, it

corresponds to a well-ordered or disordered base respectively, which can give insight by extension to

strand melting. In other words, if the base is not-resolved from electron density refinement and if the

length of the non-template strand used in the crystallographic experimental procedure included the latter

base, it means that the strands might be unpaired at this position, because one would imagine that strand

association stabilizes the ntDNA strand. However, this is not definite evidence as a NT base could

41

disordered (i.e. mobile) while being associated. Stronger evidence of melting is when a ntDNA base is

resolved and observed melted.

First, let us analyze crystallographic/electron density refinement data in favor of i + 2 pairing. In 2002,

Tahirov et al. ([Tahirov, et al., 2002], PDB#1H38), as well as Temiakov et al. in 2004 ([Temiakov, et

al., 2004], PDB#1S0V), resolved the atomic coordinates of viral T7 RNAP EC using a tDNA, ntDNA

and RNA strand template of 18, 10 and 8 base lengths respectively. tDNA and RNA strands can be

considered as complete, ntDNA strand stops at i + 1. Observation of PDB#1H38 and PDB#1S0V shows

that ntDNA was resolved on its full length, up to i + 1, DNA strands are associated up to i + 2. In the

former PDB structure, ntDNA i + 2 base is slightly shifted relative to the opposite tDNA register, as the

base competes with protein residue F644, and could be considered as partially melted. In the latter

structure, the downstream DNA bases are ill-aligned (helix keeps its canonical form but base moiety-

hydrogen bonds are out of plane), which indicates that downstream DNA is partially disordered. From

2007 to 2012, several crystallographic experiments were performed on Tt RNAP and are the following.

In 2007, Vassylyev et al. generated PDB#2O5I ([Vassylyev, et al., 2007A]) and PDB#2O5J

([Vassylyev, et al., 2007B]) using tDNA and RNA templates which can be considered as complete and

a ntDNA template stopping at i + 1 register. In both structures, downstream DNA duplex is observed

well-ordered and paired up to i + 2 register. In the Tt RNAP IC from Zhang et al. with ntDNA resolved

virtually on its full length [Zhang, et al., 2012], DS registers were observed paired up to i + 2 and well-

ordered. DNA strands were mismatched between i + 1 and i – 6 positions, which did not appear to affect

the downstream DNA stability. Finally, a structural study of yeast RNAP II from Cheung and Cramer

in 2011 [Cheung, et al., 2011], showed paired i + 2 register in arrested RNAP II EC (PDB# 3PO2), with

a ntDNA resolved up to i + 1 register, using a nucleic template stopping and containing a mismatch at i

+ 1 position.

Next, let us review structural data which does not display downstream nucleic association and therefore

could support i + 2 melting. In 2013, Weixlbaumer et al. generated two sets of atomic coordinates for a

Tt RNAP paused EC (PDB#4GZY, 4GZZ, [Weixlbaumer, et al., 2013]), using a ntDNA strand stopping

at i + 2 register. Both structures are virtually identical, display ntDNA bases paired and resolved up to i

+ 4, and a downstream bubble largely open. i + 2/i + 3 bases are not resolved, which could be consistent

with the DNA pair being melted at these positions. Otherwise, from 2001 to 2011, several generated

yeast RNAP II structures could support downstream unwinding. In 2001, Gnatt and co-researchers

conducted a crystallographic study of RNAP II (PDB#1I6H, [Gnatt, et al., 2001] using ntDNA that can

be considered as complete (stops at i - 10). The authors proposed that the strands were melted from i +

4 register and upstream because their Electron Density data only exhibited double-helix DNA up to i +

5. However, the evidence for this is not strong as the Electron Density data was weak and discontinuous

allowing only an approximate localization of double-stranded downstream DNA. In 2004, Kettenberger

et al., using a full ntDNA strand, resolved the bases of the latter chain up to i + 3 (PDB#1Y77, 1Y1W,

42

[Kettenberger, et al., 2004]). The structure consisted of a TFIIS bound RNAP II. Although the fact that

a mismatch at i + 2 position was present in the nucleic template does not allow to draw a conclusion

concerning the register, the fact that H-bond alignment deviation occurs from register i + 4 could indicate

partial melting from i + 4 and upstream. Westover et al. in 2004, and Wang et al. in 2006, using the

same nucleic template consisting of ntDNA running up to i + 5 position, generated PDB#1R9T

([Westover, et al., 2004A]) and 2E2H ([Wang, et al., 2006]) respectively, which both displayed the

following. i + 6 DNA bases were misaligned indicating a possible deviation initiation and i + 5 base of

ntDNA chain was resolved and observed melted. Brueckner and colleagues solved the structure of an

RNAP EC in 2008 (PDB#2VUM, [Brueckener, et al., 2008]) with a ntDNA strand stopping at i + 3. i +

4 position was observed paired, yet it is to be noted that RNAP II was bound to α-amanitin. Because i +

3 position was not detected, one can postulate its melting. In 2011, Cheung and Cramer generated a

second set of atomic coordinates using the same nucleic template and experimental setup as exposed in

previous paragraph, which corresponded to a RNAP II reactivation intermediate (PDB#3PO3, [Cheung,

et al., 2011]). This time, i + 2 ntDNA position was not resolved, indicating its possible melting.

The structural data presented above seems puzzling. In viral RNAP structure from Tahirov et al.

([Tahirov, et al., 2002]), ntDNA base i + 2 is slightly shifted relative to the opposite tDNA register, as

the base competes with protein residue Phe:644, and could be considered as partially melted. On the

other hand, Temiakov et al.’s structure ([Temiakov, et al., 2004]) displays i + 2 association. For T.

thermophilus RNAP, some experiments seem to support associated i + 2 register ([Vassylyev, et al.,

2007A; Vassylyev, et al., 2007B; Zhang, et al., 2012]), while others support the possibility of its melting

([Weixlbaumer, et al., 2013]). The same holds for yeast RNAP II, where PDB#3PO2 structure ([Cheung,

et al., 2012]) supports i + 2 association and where the structures listed in the previous paragraph are

consistent with i + 2 melting. Recent developments even display up to i + 6 melting in a complete

transcription bubble [Barnes, et al., 2015]. In this paragraph, we will resolve this apparent dilemma and

demonstrate that the structural data is particularly inconclusive concerning DNA melting. First, all the

structures display a downstream bubble that is reasonably or largely open. However, as mentioned in

the theory of strand separation, it is possible that the downstream part of the main channel needs to close

sufficiently in order to trigger the electrostatic separation mechanism (as DNA is to be brought close

enough to the key electrostatic protein residues). More importantly, the tri-dimensional configurations

resolved by the x-ray and Electron density studies are partly unnatural due to crystal packing

(mechanical constraint applied to certain domains between adjacent RNAPs in crystals) and/or

temperature (low un-physiological temperatures are used in order to prepare the crystals). As exposed

in the theory of strand separation, melting could be dependent on temperature. It is worth mentioning

that the physiological temperature at which T. thermophilus evolves is 65 °C, which is very far from the

experimental conditions. It cannot be excluded that the RNAP of this particular organism requires a

higher temperature to initiate DS bubble closing. Otherwise, let us propose a hypothesis concerning the

43

inconsistency of base resolution in experiments. Close observation of the T. thermophilus RNAP and

yeast RNAP II, shows that in the structures where i + 2 base is resolved, FL2 domain is in close

proximity (see Figure 5 for bacterial RNAP and Figure 6 for yeast RNAP II). It follows that FL2

promotes stochastically stabilization of DS DNA and hence its resolution. Another possibility, although

unlikely, for association being observed when FL2 closes on i + 2 ntDNA base is that the strands could

be melted when FL2 does not close and interaction with FL2 brings them together. In any case, the FL2

stochastic interaction explanation does not contradict the fact mentioned above that the domain is not

involved in strand separation. The domain only seems to allow stabilization of bases in crystallographic

experiments allowing their resolution. In other words, the discrepancy between the studies might be due

to the stochastic stabilization of the bases with protein domains. For the structures from Westover,

Wang, et al. ([Westover, et al., 2004A; Wang, et al., 2006]), it is to be noted that i + 5 is probably

resolved (although DNA is disordered) because it forms electrostatic interaction with one of the residue

of the trigger loop, which reinforces the idea that ntDNA base resolution requires interaction with the

protein structure. Non-resolution of ntDNA strand or deviation of bases is inconclusive as their

stabilization only requires stochastic stabilization with protein domains, and observation of i + 2

association is also inconclusive as the topology and experimental temperatures strongly distort the

structure and do not allow normal melting to occur.

Figure 5: Comparison of FL2 interaction with downstream DNA in Tt RNAP. FL2 domain and protein

walls are represented as lime and grey surfaces respectively. tDNA and ntDNA, are represented as red and

green ribbons. i + 2 tDNA register is indicated in blue. A) RNAP EC structure from [Vassylyev, et al.,

2007B] (PDB#2O5I) displays strong stabilization of i + 2 positions with FL2. B) RNAP IC structure from

[Zhang, et al., 2012] (PDB#4G7O) displays strong interaction between FL2 and i + 2 register. C) RNAP

paused EC structure from [Weixlbaumer, et al., 2013] (PDB#4GZZ), displays deviation of NT-strand near

i + 2 register and probably corresponds to a weak interaction of t and ntDNA i + 2 register with FL2 domain.

A B C

44

Figure 6: Comparison of FL2 interaction with downstream DNA in Sc RNAP II. FL2 domain and protein

walls are represented as lime and grey surfaces respectively. tDNA and ntDNA, are represented as red and

green ribbons. i + 2 tDNA register is indicated in blue. A) RNAP EC structure from [Cheung, et al., 2011]

(PDB#3PO2) displays strong stabilization of i + 2 positions with FL2. B) RNAP EC structure from [Cheung,

et al., 2011] (PDB#3PO3). This structure and C, E and F display weak interaction between FL2 and i + 2

registers. C) RNAP paused EC structure from [Westover, et al., 2004B] (PDB#1R9T). Nucleic acids are

indicated in CPK representation instead of ribbons because the strands are too distorted in the initial

structure. D) RNAP EC from [Kettenberger, et al., 2004] (PDB#1Y77). It is to be noted that the last ntDNA

strand base is i + 3 position, and that i + 2 ntDNA would probably position in front of tDNA register i + 2

(represented in blue ribbon). FL2 interaction is close to A) but i + 2 ntDNA base was not resolved. A possible

explanation could be that FL2 shape near the extremity of ntDNA is concave, while for the tDNA strand it

is convex, inducing an unstable interaction. Otherwise, distribution of electrostatic charges might disfavor

ntDNA strand interaction. E) RNAP EC from [Wang, et al., 2006] (PDB#2E2H). F) RNAP EC from

[Brueckner, et al., 2008] (PDB#2VUM).

A B C

D E F

45

Now, let us investigate biochemical experiments tackling the DNA melting issue. In 2007, Kashkina et

al. [Kashkina, et al., 2007], proposed that multi-subunit RNAPs did not melt any downstream base-pairs

and therefore that the main channel theory could not be right. The downstream melting was detected

using the following biochemical approach. A template strand scaffold was modified with a pyrrolo-

cytosine (pC) or 2-aminopurine fluorescent base analogue at i + 1, i + 2, i + 3 or i + 4 position. In case

of stacking with adjacent bases, which is thought to be strengthened when the DNA is double-stranded,

the fluorescent base quenches. Therefore, strand separation is detected by high fluorescence apparition.

The researchers proposed that only i + 1 register was melted and that the main channel theory was

discarded because for T7 and Ec RNAP, as well as for Sc RNAP II, the fluorescent data did not show

strong fluorescence either at i + 2 as the minimum requirement for the main channel theory nor up to i

+ 4 for multiple-substrate pre-loading. Consistent with the latter claim, for yeast RNAP II, strong

fluorescence at i + 2 tDNA probe only appeared after addition of i + 1 NTP, leading to the shift of i + 2

NTP in the active site and strand separation. However, let us have a close look at their scientific

correlations. First the study from Kashkina et al. seems completely inconclusive as the levels of

fluorescence detected do not accurately match a simple event of strand separation. In other words,

correlating the fluorescent values, which do not converge in clear distinct sub-groups, to a single event

of strand separation, does not make any physical sense. For Ec RNAP, Figure 2B (therein) seems to

indicate that i + 2 could be partially melted as the relative fluorescence is smaller (about 40%) than that

of the melted i + 1 register but higher (about 33%) than that of the i + 3/i + 4 registers further

downstream. However, the possibility that the i + 2 register experiences reduced quenching due to

decreased confinement in the main channel (e.g., i + 2 is kept strongly separated from upstream register

i + 1 by the bridge helix) cannot be excluded. Hence, although the data could indicate i + 2 partial

melting, it could also indicate an unrelated phenomenon. In any case, the authors’ claim stating that i +

2 register is associated seems to be very questionable. For yeast RNAP II, Figure 2C (therein) shows

different values of fluorescence for a given register between two experiments: difference of about 20%,

25% and 25% for i + 1, i + 2 and i + 3 registers respectively. The latter is an indication that their method

of strand separation is inaccurate. Figure S2 (therein) shows a level of fluorescence of about two-fold

higher for the i + 3 register as compared to i + 2 for bacterial EC. Also, the fluorescence quenching is

much higher (60-80 %) than that of eukaryotic and viral EC (40-45%). Therefore, not only the detected

fluorescence emissions do not appear to exactly indicate strand separation, but also the latter emissions

could depend on other factors such as the type of the surrounding nucleic acids and the type of RNAP.

The exhibited results of Kashkina et al.’s experiments appear inconclusive. Finally, the authors’ claim

that i + 2 is associated is directly contradicted by several biochemical studies supporting the opposite

phenomenon. In 1995, Zaychikov et al. conducted chemical footprinting on Ec RNAP [Zaychikov, et

al., 1995]. Melting up to i + 3 register was detected in some of the ECs. In 2004, Santangelo and Roberts

([Santangelo, et al., 2008]) using notably covalent DNA interstrand crosslinks, showed that inhibiting

downstream strand separation, impairs transcript release during elongation termination. Their data also

46

suggested that elongation termination normally consists in forward translocation on an interval of 4 base

pairs. The latter hypotheses taken together with the evidences that termination generally involve dA/dT

rich downstream sequence (which promotes strand unwinding), seems to suggest that the transcription

bubble involves the melting of i + 2 to i + 4 registers preceding and/or during elongation termination.

Although the above postulate concerns intrinsic transcription termination, it seems consistent with a

normal dissociation of a few base pairs downstream from the catalytic center during transcription

elongation. In 2009, Saeki and Svejstrup detected up to i + 3 register melting in yeast RNAP II with

potassium permanganate footprinting [Saeki, et al., 2009]. Consistent with the latter result, in 2011,

Kireeva et al. defended the partial melting of i + 2 register in their ECs and detected i + 3 register in

hybridization equilibrium in one EC, on the basis of potassium permanganate footprinting of yeast

RNAP II EC [Kireeva, et al., 2011]. Finally, in 2009, Andreacka et al. ([Andreacka, et al., 2009]) on the

basis of smFRET experiment on yeast RNAP II, suggested that DNA strands separated at i + 2 register,

which indicates its melting.

Now let us discuss the melting results exposed above. All the kinetics studies presented in the main

channel section ([Foster, et al., 2011; Holmes, et al., 2003; Nedialkov, et al., 2003; Zhang, et al., 2003;

Zhang, et al., 2004; Gong, et al., 2005; Holmes, et al., 2006; Xiong, et al., 2007; Kireeva, et al., 2008;

Kennedy, et al., 2011]) are indirect evidence of i + 2 melting, for pre-loading in the main channel

requires DNA to be in at least a partial melting state. Furthermore, substrate pre-binding in the

downstream bubble could not require significant strand separation. For example, a slight longitudinal

shift of the tDNA dNMP nucleotide could allow hybridization with an incoming NTP to occur. If

considering the tertiary channel as the substrate entrance in the downstream channel, only the tailing

part of the tDNA base would need to be oriented towards the pathway to allow for pre-binding.

Otherwise, the studies in [Gong, et al., 2005] and in [Xiong, et al., 2007] from Burton et al. defend the

melting of up to i + 3 and i + 4 positions respectively. Let us correlate the latter results with the melting

information presented above. How is it possible that i + 3 and i + 4 melting are not always detected?

First, in the [Gong, et al., 2005] study, i + 4 melting was not tested for, therefore one can postulate its

melting as the experimental conditions resemble the ones of the second study. A reasonable hypothesis

to be made is that i + 4 melting occurs because TFIIS in conjunction of TFIIF is present in the

experiments, whereas for the other melting researches exposed above, TFIIF is never present, and TFIIS

is sometimes present. It follows that TFIIF (possibly only in the presence of TFIIS) appears to promote

downstream melting. Another apparent inconsistency is the irregular detection of i + 3. It is possible that

the latter register exists in a hybridization equilibrium (as termed by Kireeva et al.) and stochastically

melts. One can also hypothesize that in real transcription activity conditions, i + 3 register could conserve

its melting. The mechanism for such a melting conservation could be rapid translocation hindering the

hydrogen bond stochastic reformation, or the presence of transcription factors (naturally present in cell)

such as TFIIF promoting downstream bubble re-adjustments and by extension DNA melting.

47

Alternatively, periodic or incoming NTP-triggered availability of i + 3 position could allow a NTP to

pre-load at the base. The extent at which TFIIS alone promotes DNA melting is unclear at this stage and

requires further investigation. Altogether, it is hypothesized that in physiological conditions (hence in

the presence of TFIIS and TIIF) downstream melting up to (and perhaps further downstream) i + 4

register is achieved. A subsidiary conclusion to be deducted is that experiments lacking the presence of

TFIIF could not accurately depict DNA melting. Finally, one can consider that the minimum melting

requirement for the main channel theory to hold and consisting in a melted i + 2 register is assured.

In this sub-section, we will investigate details of the cleaving transcription factors mechanism and what

the consequences are for our discussion about DNA melting and substrate loading. For synthetic matters,

the cleaving TF will be referred to as cTF. As mentioned in previous sections, cTFs exist in two forms:

TFIIS/SII for eukaryotic RNAP II and GreA/B for bacterial RNAP. Although the molecules are

sequence unrelated, they are considered to behave in the same way (e.g., both share the same basic

structural geometry and principle of action). Therefore, information about one type of cTF can be

approximately considered to apply for the other molecule. In this sub-section, only eukaryotic TFIIS

will be investigated and one will assume that the findings apply to GreA/B TFs. In addition, TFIIS

domain I will be ignored as it is not required for activity and only plays a minor role. The recent

Molecular Dynamic results from Eun et al. ([Eun, et al., 2014]) enabled to tackle the cTF mechanism in

a new way. The researchers found that TFIIS was in the folded form (also referred to as close form) in

solution, where the contracted linker region brings together domain III and domain II and the molecule

forms a compact mass reducing hydrophobic contacts with the surrounding solvent. This finding has a

very important implication, which is the following. TFIIS always binds in the folded form to RNAP II,

where domain II (and possibly a fraction of the linker region at a smaller extent) binds to the external

surface of the enzyme near the funnel entrance. Also, it follows that the insertion of the transcription

factor in the enzyme requires the molecule to switch from the folded to the unfolded form (where the

linker region extends outwards and longitudinally) after a binding event has occurred, allowing the

linker to insert inside the secondary channel, bringing domain III near the active site, while domain II

stays bound at the surface of the enzymatic complex. In short, the cTF elementary behavior can be seen

as following a two phases step: binding to the surface of the enzyme, then unfolding allowing insertion.

This process can also be viewed as a harpoon mechanism, where domain II is the fixed element shooting

away domain III via the linker acting as the rope, and where the domain III head holds the sharp arrow

(the acidic hairpin region at the extremity of domain III containing the key second metal ion allowing

the two-metal ion pyrophosphoryolisis cleaving reaction to occur) triggering the cleavage reaction.

Domain III can also affect the active site geometry such as the realignment of a distorted RNA chain.

Other key information concerning cTF arose from the 2003 and 2004 crystallographic experiments from

Kettenberger and colleagues. In the 2003 experiment [Kettenberger, et al., 2003], RNAP II complex

lacking nucleic acids were soaked with TFIIS and the resolution of the C alpha atoms (PDB#1PQV)

48

evidenced that the transcription factor was inserted inside the protein, i.e. that the linker region was

positioned inside the secondary channel and that domain III was located at the extremity of the channel

near the active site. In their 2004 study [Kettenberger, et al., 2004], the researchers soaked RNAP II

complex with a tDNA template consisting of 3’-AGTACTTACGCCTGGTCAT-5’ (C denotes i + 1

position), a 5’-TCATGAA-3’ ntDNA strand running from i + 3 to i + 9 registers, and a 5’-

CGGACCAGAA-3’ RNA molecule running from i to i – 9 registers. The DNA duplex did not contain

mismatches, neither did the RNA-DNA hybrid. The TFIIS molecule was resolved and observed inserted

inside the enzyme (PDB#1Y1V, nucleic acids are not present in the PDB structure but present in the

crystallographic process). In both experiments, the fact of soaking RNAP II crystals with TFIIS, induced

a TFIIS in the inserted form, although the complex needed not to be rescued by the latter molecule. This

information seems to raise several important conclusions. First, TFIIS can bind to any complex, even

when not needed. Second, because the fact that the molecule was resolved in the inserted form means

that this very conformation remained, it appears that an inserted TFIIS could unfold or unbind only after

a cleavage reaction has occurred. This can be inferred because in the 2003 study, a fully active TFIIS

was used, but no nucleic acids were present, forbidding a cleavage event to occur. In the 2004

experiment, the TFIIS used in the experiments was muted to neutralize its cleaving capability

(negatively charged hairpin residues D290 and E291 replaced by neutral alanine). Alternatively, the

possibility that the unphysiological crystallographic conditions altered a TF retraction process cannot be

excluded.

Another puzzling fact concerning TFIIS being resolved inserted inside the enzyme is how did the factor

switch from close to open conformation if the insertion of domain III was not needed? Two possibilities

arise: cTF automatically unfolds upon initial binding of domain II to RNAP inducing its inserted

conformation, or the crystallization process triggered an unnatural unfolding. Let us deepen the

unfolding consideration. The question to be raised is: what the source of energy and mechanism driving

cTF unfolding is? Eun et al. in [Eun, et al., 2014] suggest that hydrophobic forces can be excluded (based

on potential of mean force umbrella sampling calculations) and that the only remaining suspect is

protein-protein interactions. This seems to be indeed a credible explanation. However, one can then

wonder what the molecular mechanism underlying such protein-protein interactions is? A possible

explanation could be the following. During normal transcription, a fraction of the energy generated by

thermal fluctuations is liberated in the form of translocation oscillations. In the event of misincorporation

and entry in an off-pathway state, the RNA chain backtrack, RNAP binds to the backtracked transcript

via the secondary channel and the complex is immobilized. One could therefore imagine that the thermal

fluctuations would then increase on the structure of RNAP, as it cannot be released in the form of

translocation anymore. This additional vibratory constraint could propagate to the bound cTF and

facilitate its unfolding. A second possibility could be that fine conformational changes occur within

RNAP upon misincorporation and RNA backtracking and that this conformation changes somewhat

49

propagate to the external surface where cTF is located, and triggers an equilibrium change in the TF

structure allowing its unfolding. Finally, a third possibility could be that upon initial binding to RNAP,

the equilibrium conformation of the cTF immediately changes and allows it to switch automatically into

open (unfolded) conformation. Possibilities 1 and 3 imply that TFIIS automatically unfolds upon

binding and therefore would imply that TFIIS would necessarily interfere with hypothetical NTP

diffusion via the secondary channel. These mechanisms seem also more plausible than possibility 2, as

the latter seems to require complex long range conformational propagation along the molecular structure

of RNAP. The last pieces of information about TFIIS that will be exposed before being applied in the

discussion below are the elements brought forward by the 2003 and 2004 kinetics study from Zhang and

co-researchers [Zhang, et al., 2003; Zhang, et al., 2004]. They found that in the presence of TFIIF, TFIIS

did not hinder synthesis rates. Zhang and Burton also found (and confirming earlier studies) that,

combined with TFIIF, TFIIS suppressed elemental pause (where no backtracking occurs) by promoting

quick backtracking and/or re-entry in the active synthesis pathway. The latter observation seems

consistent with the hypothesis that TFIIS has a prolonged function during active synthesis, and

consequently seems consistent with the transcription factor staying bound permanently to the enzymatic

complex. However, at this stage there is no definite evidence supporting this hypothesis, as the TF could

stochastically bind and interfere with the structure without necessarily staying bound to or inserted in it.

In this paragraph, the above hypotheses will be implemented in order to investigate the possible

scenarios underlying TF function and to draw the implications for substrate loading. Three possible

outcomes can follow initial cTF binding to RNAP. First, the molecule stochastically binds to RNAP and

then unbinds if not needed, i.e. if no cleavage reaction is required. This possibility is inconsistent with

the structural data from Kettenberger et al. ([Kettenberger, et al., 2003; Kettenberger, et al., 2004])

exposed above and hence can be eliminated. Second, cTF binds to RNAP and then unfolds

automatically. It seems impossible to explain the maintenance of a high synthesis rate in the presence

of TFIIS (kinetic results from Zhang et al., [Zhang, et al., 2003; Zhang, et al., 2004]) in the secondary

channel paradigm because the insertion of the molecule appears to strongly hinder substrate loading via

the secondary channel (see Figure 7) and appears to only be able to aggravate the rate limiting factor of

substrate diffusion to the active site. In other words, the postulate of cTF automatic unfolding eliminates

the plausibility of the secondary channel theory. The third scenario is that TFIIS binds to RNAP, but

only unfolds if required (folds only if complex arrested and backtracked). The latter scenario can be

subdivided in three potential outcomes. First, TFIIS stays permanently inserted inside CH2. It follows

that NTP diffusion via the secondary channel would be greatly reduced, which is inconsistent with

substrate loading being rate limiting in the CH2 theory paradigm. On the other hand, it could be

consistent with loading via the main channel, if CH2 can accommodate both PPi expulsion and a bound

TF (see Figure 7), if PPi exit is not rate limiting, and if more generally active site chemistry and

enzymatic function can be maintained. The only two remaining possibilities which could be consistent

50

with cTF not impeding hypothetical NTP diffusion via the secondary channel would be if TFIIS unbinds

and diffuses out of the complex after cleavage or if TFIIS unfolds, stays bound to the exterior of the

enzyme and clears the path for NTP diffusion via the secondary channel after cleavage. However, the

diffusion probabilistic study from Batada et al. which can be seen as the very upper limit (discussed in

more details in the discussion section), would be reviewed downwards because during hypothetical cTF

unbinding or unfolding, NTP diffusion via CH2 can only be temporally hindered, and consequently this

imposes an even higher constraint on the rate limiting aspect of NTP loading in the secondary channel

paradigm. Interestingly, bound TF leaves intact opening B leading to the tertiary channel (see Figure 7;

for details about opening B, see chapter 5), which seems to indicate that RNAP maintains its substrate

loading/expulsion capacity during TF insertion in the main channel theory paradigm. It does not mean

though that TF stays inserted permanently. Nevertheless, because as mentioned above scenario 1 is

discarded (cTF unbinds if not needed), cTF can be considered as permanently bound. This hypothesis

can be raised for the following reason. Only scenario 2 and 3 appear to hold, where either TFIIS binds

to RNAP and only unbinds upon transcript cleavage, or where TFIIS permanently binds to RNAP and

only retracts upon cleavage. The former possibility is almost equivalent with the TFIIS staying

permanently bound to the enzyme, because after a hypothetical unbinding event (after cleavage), another

TFIIS present in the surrounding solvent would quickly stochastically bind to the enzyme. The time

length of the cleavage process (~10s) is so greatly higher than stochastic diffusion of TFIIS in solution

and subsequent binding that most of the time RNAP can be considered as bound. It follows that TFIIS

seems to stay attached to RNAP in a prolonged manner during transcription and hence could interfere

in a prolonged manner with substrate diffusion via the secondary channel. Finally, as discussed in this

paragraph, insertion of TFIIS inside the CH2 seems to enhance the complexity and requirements of the

secondary channel model and consequently renders the theory less plausible.

51

Figure 7: TFIIS shielding of RNAP II secondary channel. Sc RNAP-TFIIS complex is from [Kettenberger,

et al., 2004] (PDB#1Y1V). TFIIS is shown in CPK representation, protein surface is indicated in grey. A:

TFIIS shields a large section of the funnel entrance to the secondary channel. B: TFIIS does not seem to

reduce entrance through the tertiary channel (opening CH3B).

A

B

52

7. Considerations on nucleotide selection

We will investigate in this section the current information about nucleotide discrimination and show

how it fits in the main channel theory paradigm. The goal of this section is to answer to the following

questions. Is NTP pre-binding in the main channel consistent with discrimination mechanisms occurring

in the catalytic center? How is misloading recovery achieved in the main channel theory paradigm?

One could postulate that if NTPs are pre-selected in the downstream bubble, active center discrimination

mechanisms should not significantly affect the transcription fidelity. A simple explanation is that pre-

selection in the main channel constitutes only the first layer of discrimination and that selection is further

improved in the catalytic center. Consistent with kinetic, genetic and biochemical studies ([Svetlov, et

al., 2004; Wang, et al., 2006; Malagon, et al., 2006; Kaplan, et al., 2008; Kireeva, et al., 2008; Tan, et

al., 2008; Zhang, et al., 2010; Yuzenkova, et al., 2010; Kaplan, et al., 2012; Fouqueau, et al., 2013]), the

TL interaction network (yeast Rpb1 residues Q1078, L1081, N1082, H1085, R446, N479) constitutes a

significant proofreading checkpoint for base and ribose discrimination. However, kinetic experiments

performed on mutant enzyme with deleted TL or with inhibited TL (with α-amanitin or strepltilgyn)

[Kaplan, et al., 2008; Zhang, et al., 2010; Yuzenkova, et al., 2010; Fouqueau, et al., 2013] and to a lesser

extent other studies (e.g., [Svetlov, et al., 2004; Wang, et al., 2006]) by subtracting the total wild type

discrimination from the TL interaction network discrimination, enable to evidence that the first layer of

nucleotide selection is achieved without the TL. Authors term the latter state as open active center

discrimination. Not only consistent with discrimination occurring without the active site TL, but also

consistent with the first step of selection being achieved in the main channel (while considering

hypothetical substrate pre-binding at that location) are the kinetic experiments presented in the main

channel theory section ([Foster, et al., 2001; Palangat, et al., 2001; Holmes, et al., 2003; Nedialkov, et

al., 2003; Zhang, et al., 2004; Gong et al., 2005; Xiong, et al., 2007; Kennedy, et al., 2011]). The latter

studies are all consistent with base selection being achieved in the process of substrate pre-binding to

downstream DNA registers. It is easy to rationalize such discrimination with H-bonding energies

between complementary bases. Table 1 below summarize base identity verification results achieved by

mutant enzyme with deleted TL. Of course, as mentioned above, for the base moiety selection, pre-

binding in the main channel represents an obvious filtering mechanism (even though as shown in table

1, kinetic discrimination between cATP and ncGTP is only 4-fold for T. aquaticus according to

Yuzenkova and colleagues).

53

Table 1: Comparison of nucleotide base discrimination between several studies for enzyme with deleted TL

domain. The results colored in green and red are from [Yuzenkova, et al., 2010] and [Fouqueau, et al., 2013]

respectively. Ta, Ec and Mj are the abbreviations for T. aquaticus, E. coli and M. jannaschii RNAP

respectively. d is discrimination level and is defined by the ratio between (kpol/Kdis) for the correct substrate

and (kpol/Kdis) for the incorrect substrate, where kpol is the elongation rate (i.e., misincoporation rate in the

case of incorrect NTP) and Kdis is the dissociation rate. kd is kinetic discrimination and is defined by the

elongation rate divided by the misincorporation rate. ncNTP stands for non-complementary riboNTP and

cNTP stands for cognate riboNTP. cGTP/ncGTP field is filled (in comparison to cATP/ncATP,

cCTP/ncCTP and cUTP/ncUTP fields that are not) because the comparison arises from experiments

performed on different ECs, where i + 1 register pairs GTP and where i + 1 register does not pair GTP.

Table 2 summarizes kinetic experiment results performed with RNAP not containing a TL domain and

evidences that ribose discrimination is achieved in the open active center state. Moreover, isomerization

reversal kinetic studies from Burton and colleagues [Xiong, et al., 2005; Gong, et al., 2007] indicate that

downstream i + 2 and/or i + 3 complementary 2’dCTP did not stimulate isomerization of i + 1 NTP

(while CTP did) and that isomerization reversal was weak for incorrect i + 1 NTP (strong for CTP), and

that i + 4 to i + 6 complementary 2’dUTP (or 2’dTTP) did not stimulate reversal of i + 1 NTP (while

UTP did). Hence, one can hypothesize that the first step of ribose discrimination is indeed achieved in

the main channel, and that the above findings would be explained by the deoxynucleotide

(ribonucleotide is the right substrate) not binding to downstream DNA during the short time scales of

the kinetic experiments. However, an alternative explanation for the above isomerization observations

could be that 2’dNTPs remain bound to DS register, but impede the translocation sliding degrees of

freedom. Such a hindering effect could arise from an altered Watson-Crick geometry (tilted ribose ring)

inducing steric clashes in the channel. Alternatively, electrostatic and/or hydrophobic impediment could

occur from the fact that a deoxynucleotide lacks a hydroxyl group (negative electrostatic potential). The

isomerization experiments seem to corroborate the fact that dNTPs are discriminated against in the main

channel and is consistent with the fact that such a selection is achieved partly without the TL domain

interaction network in the active site.

Although, one can postulate that, as mentioned above for potential factors affecting translocation, bond

integrity is disfavored in the main channel for deoxynucleotides, and that the latter mechanism might

54

involve additional phenomena such as subtle electrostatic, hydrophobic and/or steric filtering (e.g.

during translocation by steric clash with atomic contacts of BH residue Y836), two likely suspects are

the fact that H-bonding to a dNMP base might have a higher affinity for a matched rNTP than for a

complementary dNTP (at this stage, H-bonding chemistry is still not fully elucidated) or that stacking

interactions differ in the case of adjacent deoxy and adjacent ribo nucleotides. In favor of very subtle

interactions occurring between adjacent NTPs or opposite NTP-dNMP pair are the results displayed in

Table 2, which seem to suggest that slight atomic property differences between the NTP types induce

dramatic discrimination differences. Also, Yuzenkova et al.’s finding that ribose rather than base

discrimination depends more on the TL interaction network, is consistent with the idea that H-bonding

discriminates much better the base moiety than the ribose ring and is consistent with the observation

from Fouqueau and colleagues that binding (and incorporation) of 2’dNTP by WT RNAP was 680 times

more frequent than for ncUTP. Consistent with the fact that 3’dNTPs are poorly (e.g., 3-fold kinetic

discrimination for 3’dATP against rATP for T. aquaticus RNAP) or not (e.g., 0.4 kinetic discrimination

for 3’dGTP against rGTP for T. aquaticus RNAP) discriminated against, is the observation that the 3’OH

is located more on the periphery from the adjacent NTP than the 2’OH. Part of the explanation for the

discrepancies between the selectivity levels may be the following. The NTP type (i.e., base identity) is

important, because depending on the type of H-bonding interaction it forms with the opposite dNMP

base, the base would tilt more or less the hydroxyl groups of the ribose moiety towards adjacent pre-

bound NTPs. Alternatively, a possibility that cannot be excluded is that the substrate types do not all

have the same probability of being misincorporated in the absence of the TL. Even if ribose

discrimination was not performed in the downstream bubble, and that the latter was only done in the

active center, it would not invalidate the main channel theory. According to Nick Mc. Elhinny et al.

([Nick McElhinny, et al., 2010]), there are 82-fold more rNTPs than dNTPs in yeast RNAP II. According

to Traut’s average concentrations ([Traut, et al., 1994]), the ratio is 47-fold more in mammalian cells. It

follows that only a small fraction of the time, the enzymatic complex would need to recover from a

misloaded dNTP in the active site, if no ribose pre-selection was performed.

55

Table 2: Comparison of nucleotide ribose discrimination between several studies for enzyme with deleted

TL domain. The results colored in purple, green and red are from [Zhang, et al., 2010], [Yuzenkova, et al.,

2010] and [Fouqueau, et al., 2013] respectively. Ta, Ec and Mj are the abbreviations for T. aquaticus, E. coli

and M. jannaschii RNAP respectively. d is discrimination level and is defined by the ratio between

(kpol/Kdis) for the correct substrate and (kpol/Kdis) for the incorrect substrate, where kpol is the elongation

rate (i.e., misincoporation rate in the case of incorrect NTP) and Kdis is the dissociation rate. kd is kinetic

discrimination and is defined by the elongation rate divided by the misincorporation rate. cd is

concentration discrimination and is defined by the ratio between incorrect and correct substrate

concentrations required to elongate half of the RNA transcript. 2’dNTP and 3’dNTP stand for

complementary 2’deoxyNTP and 3’deoxyNTP respectively, NTP stands for cognate riboNTP.

At first glance, discrimination mechanisms occurring in the active center could appear inconsistent with

the main channel theory. Indeed, in the secondary channel model, NTPs are verified directly in the active

site, and an incorrect NTP in the A site is simply expelled through CH2, freeing i + 1 register for

subsequent binding. However, in the main channel model, a misloaded NTP at i + 1 position seems more

problematic, as its expulsion would leave i + 1 register unpaired while DS registers are paired (e.g., i +

2, i + 3). Furthermore, while considering that the active site TL interaction network constitutes a second

layer discrimination and that the latter allows to detect errors from the first layer of selection (i.e.,

misloading), then the issue is: can the enzyme quickly recover from failures of the first layer? In this

paragraph, we will investigate potential recovery mechanisms. We will show that the main channel

loading model could very well accommodate pre-selection errors, hence sometimes allow the channeling

of wrong substrate in the catalytic center. Let us assume that pre-binding in the downstream channel is

granted by an opening connecting the site to the solution and let us term this opening tertiary channel

(CH3). Let us also assume that pre-binding in the downstream DNA channel can occur at i + 2 or i + 3

register sequentially (findings from [Xiong, et al., 2007] imply that the first available allosteric site could

be i + 4), that is to say that every nucleotide first binds at these sites before being incrementally shifted

to the upstream position after each nucleotide addition cycle. Two scenarios could explain how RNAP

would recover from a loading error in the main channel paradigm. The first recovery mechanism could

be the following. If an incorrect NTP is loaded from the main channel to the catalytic center, the TL

56

interaction network stimulates its expulsion (while forbidding catalysis) via the secondary channel.

Now, i + 1 is unpaired, while i + 2 and i + 3 are paired. One could postulate that the latter configuration

(i.e., “hole” at i + 1, while DS registers are paired) induces a deviation of tDNA strand, which in turn

weakens the RNA-DNA hybrid. Forward translocation could then be hedged and backtracking

promoted. Two steps of backtracking could reposition i + 1 register at the i + 3 pre-binding site, i + 2

and i + 3 NTPs could detach from the DNA by stalling against the tertiary channel walls and the non-

template strand could rewind with the template strand. If two steps of backtracking are too costly, one

could examine another possibility. In case of a wrong substrate in the active site, and its subsequent

expulsion via the pore, a simple pre-translocation event would reposition i + 1 register at i + 2 position.

Then, a NTP would simply need to rebind to i + 2, and i + 3 NTP-dNMP pair would not be affected.

This would only require the i + 3 pair not blocking the passage for i + 2 NTP, which seems to be validated

by the observation of structural data. The transcription process can then resume. Scenario one seems

more complicated because it involves the requirement of detachment of the downstream pre-bound

NTPs. However, because it is a known fact that the TF cleavage process occurs in an arrested EC where

backtracking normally consists of a several nucleotides length interval, it appears that for this

phenomenon to occur in the main channel theory paradigm, the detachment of the downstream substrates

must be possible. The backtracking of tDNA in the downstream bubble, which could withstand the base-

pair hydrogen bonds and not require rewinding of the downstream DNA strands, would only be a

possibility for a few registers. Because a longer backtracking conserving paired tDNA bases would

strongly interfere with the rewinding of the template and non-template strands. In short, scenario 1 does

not necessarily require downstream NTPs detachment, but such a phenomenon appears to occur in the

backtracking process that is notably involved in the elementary step of the cleavage process.

Authors, supporting the secondary channel theory, claim that all NTPs bind to the E site, while only an

NTP able to base-pair with i + 1 DNA position will bind to the A site [Batada, et al., 2004; Wang, et al.,

2006; Martinez-Rucobo, et al., 2013]. This could be interpreted as a pre-filtering mechanism for the

base identity occurring between the E and the A sites. The authors seem to suggest that all rNTPs do not

necessarily enter completely inside the catalytic site and bind to the i + 1 position. This raises an

immediate issue: how could a rNTP be selected at distance from i + 1 register, when what determine the

correct rNTP (out of the 4 types) are the properties of the i + 1 DNA base? Authors have suggested that

the TL could serve as bridge and allow to read at the same time DNA and a distant NTP. However,

Yuzenkova et al.’s findings (consistent with kinetic data and consistent with structural data) eliminate

this far-fetched possibility: the TL proofreading mechanism only concerns a NTP bound at i + 1 position

being located inside the active center. In addition, discrimination in the open active center state also

concerns a H-bonding event. In other words, in order to be discriminated against, all rNTPs need to try

and bind to the i + 1 register. The initial binding configuration in the catalytic center has been proposed

to concern a location distinct from the addition site, however this issue is not important for our

57

discussion. The bottom line is that in order to be discriminated against, rNTPs must position in front of

the i + 1 register hence enter inside the active center in the secondary channel theory paradigm. This has

a very important implication. The rate limiting aspect of NTP diffusion in the secondary channel model

is much more important than commonly considered (details in the next section) and the presence of the

hypothetical E site does not change the problem at all: most of the time, an incorrect NTP will load in

the catalytic site via CH2, try and bind to i + 1, and will have to diffuse away to clear the path for the

correct substrate. On the other hand, because pre-binding in the main channel theory constitutes the first

layer of discrimination, misloading/expulsion frequency is greatly inferior than that of the alternative

model.

58

8. Discussion

The secondary channel mode of substrate entry in the active site appears to suffer severe limitations.

First, the properties of the pathway pose an immediate issue. The last end of the channel comprises a

narrow corridor, which has a diameter oscillating between 7 and 12 Å (according to literature, but see

in-depth analysis in chapter 5), and can also completely contract. Not only is the corridor’s structure

very constricted, it also has a strong negative electrostatic potential. Incoming MgB-NTP complexes

have an electrostatic charge of - 2 and a minimum diameter of 6 Å. It follows that the substrate

experiences repulsion preventing it to approach the corridor and that the channel can only accommodate

one NTP at a time.

In their 2004 study, Batada and colleagues defended the plausibility of the pathway as mode of loading,

because taking into account the restrictions mentioned above seemed to allow a synthesis rate consistent

with the normal elongation rate in vivo [Batada, et al., 2004]. However, they failed to take into account

several important limiting parameters. First, the NTP trajectory to the active site is significantly more

obstructed than they considered. The corridor needs to exchange wrong substrates in and out. According

to the secondary channel theory, all substrates can enter the corridor without discrimination where they

can bind to the E site. As described in the previous section, in fact a rNTP bound to the E site needs to

rotate in the catalytic center and bind to i + 1 register (equated to A site in this paragraph for sake of

simplicity, but initial i + 1 binding site might be slightly distinct from A site, which has no importance

for our discussion) in order to be discriminated against. Because A and E sites are mutually exclusive

(both sites cannot be occupied simultaneously, notably because of shared MgB binding contact), let us

consider to simplify the problem that every time a rNTP binds to the E site, it then rotates to the A site,

and is expelled if it is the wrong rNTP or is incorporated if it is the right substrate. It follows that because

there are four different kinds of NTP, most of the time an incorrect NTP will bind to the E site, rotate to

the A site, and finally be expelled. Let us term this time window, where access of the correct NTP is

blocked, “NTP access” window. Nick Mc Elhinny’s substrate concentrations in yeast [Nick McElhinny,

et al., 2010] enable to investigate the issue in more details. NTPs concentrations in yeast are the

following. rATP: 3.0 mM, rUTP: 1.7 mM, rGTP: 0.7 mM, rCTP: 0.5 mM, dTTP: 30 μM, dATP: 16 μM,

dCTP: 14 μM, dGTP: 12 μM.

According to Traut’s average concentrations in mammalian cells, ribose compounds represent about

0.13 mM and Pi compounds 4.4 mM [Traut, et al., 1994]. The latter concentrations are informative about

the fact that while dNTPs could be neglected, Pi and ribose compounds would often encounter NTPs in

solvent, if average mammalian concentrations apply more or less to yeast organisms. Next, let us

consider substrate competition at the E site and let us simplify the problem by only considering rNTPs.

ATPs represent ~51 % of polymerization substrates, UTPs: ~ 29 %, GTPs: ~12 %, CTPs: ~8 %. One

could then postulate that if GTP, or CTP is the next nucleotide to be added, 88 to 92% of the time, NTP

59

access will be blocked by an alternative substrate. In other words, for a matched GTP or CTP to enter

the active center, 88 to 92 % of the time, the cognate substrate would need to wait insertion/expulsion

of the wrong NTP to occur. We can see that this dramatically discredits the secondary channel as a

plausible pathway. Batada et al. calculated the probability of successful diffusion by releasing one

substrate at a time from the entrance of the funnel and counted how many times the molecule eventually

binds to the E site. This assumption fails to include all the other molecules impeding the trajectory,

especially other substrates bound in the E site or misloaded rNTPs in the catalytic site. The authors

mention that “the rate of collisions resulting in binding may be further reduced by one or even two orders

of magnitude by the steric requirements for binding“ and they apply this constraint in their final

estimated diffusion rate. Just dividing the successful rate by 10 or more does not consist in a serious

calculation method, all the more so because as the authors do not specify the physical ground for their

assumption. Let us recalculate Batada et al.’s diffusion probability, by using realistic rNTP substrate

concentrations. According to the requirements of the CH2-hypothesis, the rNTP needs to be oriented in

a specific way with the polyphosphate tail oriented ahead (towards the active center) and longitudinally

with the axis of the corridor. It follows that successful diffusion must be decreased by a steric clash

factor. Let us take into account this impediment by using the steric clash factor of 𝑥 ≤ 0.1 proposed by

Batada and colleagues. It follows that the upper limit of diffusion probabilities is given by:

𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑑𝑖𝑓𝑓𝑢𝑠𝑖𝑜𝑛 𝑜𝑓 𝑟𝑁𝑇𝑃 = 𝑟𝑎𝑡𝑒 𝑜𝑓 𝑟𝑁𝑇𝑃𝑠 𝑎𝑐𝑐𝑒𝑠𝑠𝑖𝑛𝑔 𝐸 𝑠𝑖𝑡𝑒 × 𝑐𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑟𝑁𝑇𝑃

× 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑟𝑁𝑇𝑃 𝑎𝑐𝑐𝑒𝑠𝑠 𝑛𝑜𝑡 𝑜𝑐𝑐𝑢𝑝𝑖𝑒𝑑 𝑏𝑦 𝑤𝑟𝑜𝑛𝑔 𝑟𝑁𝑇𝑃 × 𝑠𝑡𝑒𝑟𝑖𝑐 𝑐𝑙𝑎𝑠ℎ 𝑓𝑎𝑐𝑡𝑜𝑟

By using Batada et al.’s rate of rNTPs accessing the E site (2×105. 𝑠−1. 𝑀−1), this equation can be

rewritten as:

𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑑𝑖𝑓𝑓𝑢𝑠𝑖𝑜𝑛 𝑜𝑓 𝑟𝑁𝑇𝑃 = (2×105) × 𝑐𝑜𝑛𝑐𝑒𝑛𝑡𝑟𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑟𝑁𝑇𝑃

× 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑟𝑁𝑇𝑃 𝑎𝑐𝑐𝑒𝑠𝑠 𝑛𝑜𝑡 𝑜𝑐𝑐𝑢𝑝𝑖𝑒𝑑 𝑏𝑦 𝑤𝑟𝑜𝑛𝑔 𝑟𝑁𝑇𝑃 × (𝑥 ≤ 0.1)

Hence,

𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑑𝑖𝑓𝑓𝑢𝑠𝑖𝑜𝑛 𝑟𝐴𝑇𝑃 = (2×105) × 0.0030 × 0.51 × (𝑥 ≤ 0.1) = ≤ 30.60 𝑟𝐴𝑇𝑃. 𝑠−1

𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑑𝑖𝑓𝑓𝑢𝑠𝑖𝑜𝑛 𝑟𝑈𝑇𝑃 = (2×105) × 0.0017 × 0.29 × (𝑥 ≤ 0.1) = ≤ 9.86 𝑟𝑈𝑇𝑃. 𝑠−1

𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑑𝑖𝑓𝑓𝑢𝑠𝑖𝑜𝑛 𝑟𝐺𝑇𝑃 = (2×105) × 0.0007 × 0.12 × (𝑥 ≤ 0.1) = ≤ 1.68 𝑟𝐺𝑇𝑃. 𝑠−1

𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑑𝑖𝑓𝑓𝑢𝑠𝑖𝑜𝑛 𝑟𝐶𝑇𝑃 = (2×105) × 0.0005 × 0.08 × (𝑥 ≤ 0.1) = ≤ 0.80 𝑟𝐶𝑇𝑃. 𝑠−1

Now, in order to compare with the assumed ~10 rNTP.s-1 in vivo RNAP II polymerization rate, let us

even further consider the very upper limit and assume that rNTPs are incorporated immediately after

binding or expelled instantly if non-cognate. Because DNA bases in the tDNA strand can be generally

considered as fairly evenly distributed in most organisms, the 10 rNTP.s-1 rate in vivo can be simplified

to an incorporation segment consisting of 2.5.s-1 of each NTP. In this ideal model (incorporation delay

ignored, NTP rotation through corridor delay ignored), and assuming a NTP bound in the E site can

60

even rotate to the A site (which still requires direct evidence), 1 second is not enough to incorporate the

right number of GTPs or CTPs. Hence, although the NTP concentration utilized in the calculation are

to be taken with care because intracellular compartmentalization processes could occur and represent an

unknown parameter, it appears unclear if the calculated diffusion probabilities are realistic.

Another issue to be raised with their study is the following. When estimating diffusion impediment

induced by the electrostatic potential, they equated a successful diffusion with a NTP binding to the E

site. However, there are no experimental evidences that a NTP bound in the E site can rotate to the A

site. This has only been inferred but never been observed. If the latter unproven axiom is wrong, then a

matched substrate binding to the E site is not at all equivalent with a successful diffusion to the catalytic

center. At this stage an rNTP bound to the E site still needs to undergo an almost 180o rotation through

the narrow corridor and therefore the diffusional impairment induced by the corridor dimensions and

electrostatics is not yet fully accounted for. If such a rotation does not occur, then the probability of

diffusion from the E site to the A site is likely to be greatly reduced. Indeed, the E site being located at

the first two thirds of the corridor, the full diffusional impairment induced by the corridor dimensions

and electrostatics are not fully accounted for. Furthermore, rotation from the E site to the A site seems

difficult to explain. When a matched NTP binds to the E site, MgB is temporally bound to the pore wall.

Consequently, the MgB contribution to the repulsion is partially neutralized because it is anchored to

the wall, and serves only as a rotor. MgB is positively charged and therefore, the rest of the NTP that

remains in free motion and that accounts for most of the total - 2 negative charge of the NTP still needs

to overcome the negative repulsion of the pore during the rotation. It could be hypothesized that MgB

temporarily screens the electromagnetic field lines of the potential allowing the NTP to rotate, yet this

seems far-fetched. There is no physical basis to explain how a bound NTP to the E site in an inverted

position would rotate. Furthermore, the fact that the crystallography experiments were able to capture

NTP in an inverted position in E site could indicate that this architecture remained and that the NTP was

unable to rotate (discussed in more details below). On the other hand, the fact that no NTPs pre-bound

in the main channel have been seen in crystallographic data could simply mean that they were not

immobilized long enough in that position. They could also be mistaken for paired bases.

The researchers claim that “delivery of NTPs by diffusion may be just sufficient to maintain the rate of

RNA synthesis“. However, for all the reasons mentioned above, it seems clear that the probability of

successful diffusion via the secondary channel is not sufficient at all to allow a physiological rate of

processive elongation. In short, because the calculated diffusion probability is already barely sufficient

and can be considered as the very upper limit, it seems that this study is in fact strong evidence against

the secondary channel theory. Their research has nevertheless enabled to yield crucial information about

the restrictions imposed by the pore’s properties on diffusion. It is to be noted that the restrictions

imposed by the secondary channel are very likely to apply to the other RNAP species. For example, the

negatively charged residues of the corridor (Rpb1 D481, D483, D485, E486, E822, D826, E1074, and

61

Rpb2 E529, E836, D837) are absolutely conserved among yeast, M. jannaschii, C. elegans, drosophila,

human and mouse. For the negatively charged residues that are directly adjacent to the pore: Rpb1 D356,

D526, Rpb2 D978 are conserved, Rpb1 E833, D1359 are highly conserved, Rpb1 E771 and Rpb2 D1100

are medium conserved. Also, bacterial RNAPs display a conic shaped secondary channel, which would

impose similar topological impairment, although the pathway is shorter.

In [Kireeva, et al., 2010], Kireeva, Burton, et al., underline that the calculated diffusion rates from

Batada and colleagues are 50 times slower, than the experimentally observed rates of the template-

specific NTP sequestration for human, yeast RNAP II and E. coli RNAP in [Foster, et al., 2001; Holmes,

et al., 2003; Nedialkov, et al., 2003; Zhang, et al., 2004; Kireeva, et al., 2008; Kireeva, et al., 2009].

According to the above successful diffusion rates, notably for CTP, and representing more than the very

upper limit, the issue would be even worse. So even if template specificity can facilitate successful

diffusion in the CH2 paradigm (e.g., suppress non-template roadblocks at the E site and greatly reduce

diffusion competition), it seems hard to explain such a sequestration rate (i.e. successful catalytic loading

rate) with the restrictions imposed by the channel.

Concerning the second computational study ([Zhang, et al., 2015A]), seemingly eliminating the main

channel as a credible substrate pathway, because both not favorable conformationally and

electrostatically, the experimentation carried out suffer from the following issues. First, the researchers

run a pathway detection program, CAVER ([Chovancova, et al., 2012; Kozlikova, et al., 2014; Pavelka,

et al., 2016]), to identify cavity routes inside the enzyme. The yielded proposed substrate accessible

zone within CH1 seems particularly absurd in light of the conformational results presented in chapter 5.

The work carried out in this thesis strongly refutes their conformational analysis. To run properly, the

CAVER program needs an initial starting pathway guess to be defined and it is possible that the authors

severely misused the computer tool. Second, the methodology of fitting NTPs directly into estimated

available empty areas (which estimation is initially wrong anyway) is very questionable: it does not shed

any light on the diffusion process. Third, they reach the conclusion that an NTP fitted inside the

secondary channel experiences less repulsion than a NTP inside the main channel. However, the

diffusion impediments generated by the secondary channel theory does not concern the entire secondary

channel, but only a select area: the last narrow section, which is the corridor. There is of course plenty

of space in the first two thirds of the secondary channel, which appears to serve another purpose than

substrate loading (conic shape is ideal for expelling inorganic pyrophosphates, misloaded NTPs, large

area to accommodate TFIIS, etc.). Finally, their electrostatic analysis is not corroborated by the work

presented in chapter 5. It is possible that their detected main channel substrate route (perpendicular to

CH1, and appearing to envelop circularly the ntDNA strand) is not indeed favorable electrostatically

because it is too close to ntDNA. Alternative pathways, such as CH3C or CH3A, have not been taken

into account. Their claim that the secondary channel is electrostatically balanced is refuted by this thesis,

but also notably by [Batada, et al., 2004].

62

Now let us examine specifically the E site evidence. The argument in favor of the secondary channel

theory is why NTPs would be observed bound in CH2, very close to the active site, if they load through

a different pathway, while no NTPs have been observed bound in the downstream channel in

crystallographic/Fourier electron density data. Several remarks can be made. First, binding in the E site

could represent a singular event, and not represent the normal reaction pathway. While biochemical and

rapid kinetic techniques could be more suitable for capturing the dynamic elongation process, the

experimental procedure allowing to generate enzymatic crystals does not represent processive

elongation. The enzymatic complexes are soaked in a solution containing only one type of NTP, which

forbids sequential processive elongation to occur. In short, it could be that the experimental conditions

do not allow normal processive elongation to occur and hence do not allow hypothetical normal loading

through the main channel. In other words, a possibility to be considered is that binding events to the E

site occur because the normal reaction pathway through the main channel is eliminated. Therefore, even

though diffusion through the secondary channel could be less favorable than loading through the main

channel during normal transcription rate conditions, it could become the default pathway in

crystallographic experimental conditions. It follows that even with the diffusional restrictions exposed

previously, if granted a sufficient amount of time, a NTP could very well successfully bind to the E site

rather than bind in the downstream bubble. Very important to mention is that the E site is located near

the beginning of the corridor, hence diffusion to the E site concerns the most favorable route through

CH2, as the main impediment of the pathway occurs from the corridor. Furthermore, Kireeva et al.

[Kireeva, et al., 2010] have suggested that in the experimental procedure used for generating crystals,

blocking chemistry at the i + 1 site (necessary to fix the i + 1 NTP) might disable substrate loading via

the main channel.

Now let us consider the possibility that loading through the tertiary channel and via the main channel,

was not distorted. The study from Batada et al. is consistent with NTPs being able to bind to the E site,

even if the event is rare. One could then object that it would disturb the main channel theory pathway,

for example by preventing the incoming NTP-dNMP pair to bind to the A site. However, in fact, in real

in vivo conditions (e.g., presence of type of rNTPs in the solvent buffer), binding to the E site could be

virtually permanently cancelled because of occupancy of the A site by the NTPs loaded from the main

channel. A possibility could be that in the fast state, NTPs never have time to bind to the E site, because

translocation could be locked forward and the E site could always be gated: nucleotide is being

incorporated which forbids access to the E site, then translocation brings new NTP in the active center

before access to the E site is clear (e.g., because PPi not yet released or because RNA 3’end gates binding

to the E site), which binds the next nucleotide to the A site and still forbids access. The cycle can resume,

and the E site will always be gated by the successive loading/incorporation of NTPs incrementally

translated from the main channel. If the loaded NTP is incorrect, then its expulsion would forbid access

to the E site, and rapid backtracking motion could prevent access. Finally, it could even be possible that

63

the enzymatic complex would support a few NTPs binding to the E site in normal transcription. The

requirement would then be that activity is not distorted. For example, RNAP could just wait for the NTP

to dissociate from the E site, or alternatively, the incoming NTP channeled from i + 2 to i + 1 position,

could expel the parasitic NTP bound to the E site, by competitive binding.

Other evidences were proposed for loading via the secondary channel. In 2009, Erie and colleagues

found that mutating E. coli residue D675 led to a significant increase of misincorporations [Erie, et al.,

2009]. The authors suggested that the residue played a role in filtering substrate diffusing through the

sec. channel. However, the residue is located directly adjacent to the bridge helix (notably, β’ 772, 775

and 779), and within relative electrostatic interacting distance from the TL tip. So the residue could very

well impede a key function. Studies on TL E1103G and bridge helix mutation have shown that the

domains affected fidelity, probably indirectly by affecting the bridge helix or directly by affecting TL

mobility. Hence, the D675 mutation does not prove anything. Otherwise, the negatively charged residue

might promote the electrostatic expulsion process: the secondary channel, and in particular the corridor,

can be seen as an electrostatic gun expelling negatively charged PPi molecule and negatively charged

misloaded NTP, as exposed in chapter 5. It follows that removing the electrostatic amino acid might

hinder the expulsion process of misloaded NTPs, hence indirectly promote transcription errors.

Concerning the microcin J25 evidence, let us show that it is very weak. First, concerning the residues

that bind the toxin molecule, the authors claim that “The side chains of the majority of implicated

residues are solvent accessible—directed into the lumen of the RNAP secondary channel or toward the

exterior of RNAP—and make no obvious interactions important for RNAP structure or function”

[Mukhopadhyay, et al., 2004]. However, this is an exaggerated statement. E. coli binding residues β’

775-777, 779, 780, 782-786, 789, 790 belong to BH, β’ 922, 926, 927, 930-933, 1136, 1137 belong to

TL, β’ 744,748 belong to Floop and β 543-545 belong to FL2. Hence, insertion of microcin J25 would

notably directly interfere with two of the most important domains involved in the NAC (TL and BH).

In addition, the molecule could inhibit transcription by preventing the release of PPi. So not only

microcin would inhibit transcription activity because of the trapping of the PPi molecule very near from

the A site, which would completely disturb the active site geometry and electrostatics, but also it appears

to obviously impede the conformational degrees of freedom of key domains for transcription such as

BH, TL and FL2. Furthermore, the fact that inhibition is partially overcome at high NTP concentration

does not seem very consistent with the assumption that it blocks substrate loading. If the molecule stays

in place, it is clear from immediate investigation that no substrate should bypass the molecule at all to

access the corridor (microcin almost perfectly seals off the secondary channel, leaving no room for the

passage of a molecule the size of an NTP).

Before concluding this discussion, the puzzling studies about the Brownian ratchet mechanism are to be

argued. Substrate diffusion/loading and translocation are concepts that walk hand in hand, because NTP

binding belongs to the more general translocation/transcription cycle. It is therefore not surprising that

64

these processes were almost always thought about in correlation to each other. To study translocation:

the key process of transcription, it makes intuitive sense to pull on the nucleic frame and/or the enzyme

in a controlled manner. The single-molecule optical tweezers experiments serve that purpose by

attaching the extremities of DNA with an optical trap, and by exerting assisting or opposing force. The

basic concept underlining these studies is to try and fit a kinetic equation describing translocation

(including stepping distance, force, temperature, etc.) to experimental measures, under different

conditions such as varying force, NTP concentration or nucleic translocation track, and validate in return

the axioms of the model. Although seemingly impressive and very accurate, this methodology can suffer

the following limitations. The greatest loophole with the concept of fitting experimental measure to a

kinetic model is that it is not because a model describes the reality, that the model is the reality. In other

words, it is not because a kinetic fit is with good agreement with a model, that all the starting assumptions

of the model are correct. For example, some researchers suggested that results supporting the main

channel theory were invalid because a secondary NTP binding site was not a necessary assumption to

their kinetic model: “we were able to obtain reproducible global fits with the two pawl model without

the need to introduce additional NTP binding sites.” [Bar-Nahum, et al., 2005], “this model does not

invoke additional NTP binding sites at different translocation states, allosteric NTP binding sites,

active/inactive conformational states” [Bai, et al., 2007], “the quality of this fit to our conceptually

simpler model indicates that a more complex model with two NTP binding sites is not necessary to

explain this data” [Maoileidigh, et al., 2011]. Not only does their kinetic fit suffer limitations that will

be discussed below, but their data can actually be explained with a NTP binding to i + 2. With the only

distinction that it is not the initial binding of the NTP that rectifies the ratchet but only its loading in the

active site. Hence, the claim of these papers that the fact that their kinetic equation is in concordance

with their initial hypothesis that NTP binds directly to i + 1, suggests that the main channel theory is

incorrect: “It is reassuring that our model not only explains all the biochemical experiments presented

in the present paper but also provides a consistent and natural explanation of published kinetic data”

[Bar-Nahum, et al., 2005] and “our model does not invoke any hypothetical allosteric and/or template-

specific NTP binding sites other than i + 1 to explain the biphasic rate curves. Simply, under substrate-

limiting conditions, the F bridge has a higher probability to melt the 3’ end of the hybrid, thus facilitating

backtracking.” [Bar-Nahum, et al., 2005], is very fast reasoning. The authors offer no real explanation

as to why the existence of a secondary binding site is to be discarded, and no explanation at all on the

main channel theory kinetic data. There is no link between the BH (also referred to as the F bridge)

facilitating backtracking in particular occasions and pre-binding of NTPs in the downstream bubble

facilitating forward translocation. Furthermore, recent single molecule studies [Larson, et al., 2012;

Dangkulwanich, et al., 2013], offering much more balanced views, contradict quite directly the latter

views about the non-existence of a secondary binding site. Some experiments deriving kinetic

parameters from force-velocity relations should be regarded with caution, because they might involve

wrong starting assumptions such as rapid translocation equilibrium, which is very contested

65

[Dangkulwanich, et al., 2013]. Second, single-molecule studies do not always monitor translocation as

precisely as they seem. At non-subsaturating NTP concentrations, i.e. in normal processive elongation,

the precision of the single-molecule experiments is only of a three-base pair interval resolution

[Maoileidigh, et al., 2011]. Also, kinetic fits can involve the averaging of normalized data, or multiply

independent fit parameters, hence erasing details and artificially improving the verification of the model

used. The study from [Dangkulwanich, et al., 2013] seems more general than previous attempts to

characterize the kinetics of translocation because their model does not assume translocation equilibrium,

ignores NTP binding rates in their initial equation assumptions, and treats forward and reverse

translocation with a separate parameter. Their findings are in full concordance with the CH1 model,

namely post-translocation locked forward at non-subsaturating substrate concentrations and existence

of a secondary binding site independent of the translocation state.

The study from [Bar-Nahum, et al., 2005] mentioned in the previous paragraph poses another issue. The

authors find that when i + 2 NTP is supplemented (in EC34 therein), forward translocation is reduced,

and hence that the allosteric results supporting the CH1 model seem invalid. Let us try and explain their

experimental data with the following reasoning. If the presence of i + 2 NTP reduces EC fractions

belonging to the forward state, it means that somehow, there was a deleterious binding competition

effect between i + 1 and i + 2 NTPs. In the CH2 model, this competition only concerns successful

diffusion to i + 1. If their result is valid, namely reduced forward translocation in the presence of 0.5

mM GTP (matched to i + 1) and 0.5 mM ATP (matched with i + 2) than with 1 mM GTP alone, then

one just needs to replace one postulate: deleterious binding competition happened at i + 2 and not at i +

1, where any NTP that must load to the active site, first need to bind at i + 2 position. Their experiment

seems far from invalidating CH1 kinetic results, where NTP chases done in a very controlled manner

and precise substrate-saturation kinetic curves production is a superior characterization method than

measuring EC fractions that are pre- or post- translocated.

All the arguments in favor of the secondary channel do not seem solid and could be discarded, but the

main channel theory is supported by virtually undeniable proofs: the fact that NTPs can pre-bind in the

main channel and that the latter constitutes the default state is supported by many strong evidences

described in main channel theory section. In particular, it appears impossible to explain the allosteric

effect of several downstream templated NTPs without accepting the fact that they must pre-bind to the

DNA template strand in the downstream bubble.

66

9. Concluding remarks

Further elements can be raised to shed some light on the substrate mode of loading controversy. A

possible explanation for an alternative function for the E site arose in 2007, when Toulokhonov et al.,

proposed that frayed RNA 3’end could bind to the E site during the nonbacktracked pause state

[Toulokhonov, et al., 2007]. Otherwise the E site could be rationalized by the fact that it simply

represents the MgB binding site (where the inverted NTP binds according to [Westover, et al., 2004A]).

In 2008, Weinzierl and colleagues [Tan, et al., 2008] conducted mutagenesis on bridge helix residues

and observed that some mutations led to increased transcription rates. Because the bridge helix is linked

to translocation and not to substrate loading, it seems to indicate that substrate loading is not rate limiting

and therefore this result seems inconsistent with the secondary channel theory. As a conclusion, the

secondary channel theory is inconsistent with the results and observations presented in this review and

appears impossible.

67

Chapter 2

MD Methods

68

1. Introduction

The main channel theory seems to be the default mode of substrate loading during processive elongation.

However still little is known about the loading details of RNAP substrates: “currently, no electrostatic

or diffusion modelling is available to indicate how NTPs might load through the main channel” [Kireeva,

et al., 2010]. Also, although it is scientifically questionable, regarding the solidity of the kinetic

evidences, the common consensus is that the CH1 theory still requires “direct” evidence. It appears

therefore necessary to not only shed some light on how the diffusion process might occur, but also to

offer some additional evidences. MD is an ideal candidate for carrying out such a work. Indeed, MD is

a revolutionary computational simulation method allowing unprecedented levels of inspection at the Å

level and from the femtosecond timescale onwards. It is possibly the best method to characterize the

dynamics of a biomolecular system [Meller, 2001; Frenkel, et al., 2002]. Because diffusion is an ultrafast

process, it makes sense to inspect it using a very precise method. For analogy, it might not be

coincidental if the most compelling evidences for diffusion so far have been given by the kinetic assays,

which allow to catch ultrafast processes. Crystal structures render an atomic precision image of

biomolecular systems, yet no dynamic time evolution is displayed. MD, using as starting input an x-ray

crystallography or NMR set of coordinates, can be seen as a tool making the static image live. In this

section, we will be interested in MD philosophy and methodology, from the preparation of a static model

to advanced MD procedures, allowing to extract mechanisms of the diffusion/loading process, which is

currently not well understood, and to perhaps further prove the main channel theory. The procedures

have been fully automated and scripted, and are given in the appendices, to facilitate reproducibility of

the simulations. In order to achieve optimal computational power, simulations were run with NVIDIA

CUDA Graphic Processing Unit (GPU) based workstations, which have been assembled. The simulated

system is S. cerevisiae RNAP II.

69

2. Metabolite pool

Choosing good metabolite concentrations is important to mimic physiological conditions in MD

simulations. For instance, they can play a crucial role in Electrostatic mechanisms (e.g., shielding,

screening), affect the characteristics of the diffusional routes, and can also impact the overall stability

of the enzyme. In this sub-section, focus will be aimed on the metabolites that are charged, and

particularly on those present in non-negligible proportion. All the concentrations are intracellular (whole

cell or cytoplasmic) and discussed for yeast S. cerevisiae. In general, measures derived from aerobic

glucose-limited chemostat experiments or in reasonable fit with in vivo-like conditions have been

selected over batch cultivation experiments. Concentrations expressed in μmol/gDW or mg/gDW are

converted to mM using the factor of 2.38 mL/gDW ([Theobald, et al., 1997; Hans, et al., 2001]), except

for values from [van Eunen, et al., 2010], where a 2.083 mL/gDW conversion factor is used by the

researchers (based on their measured culture dry weight mass of 3.6 g.L-1 and 2.5 * 1011 cells.L-1, and

assumed cell volume of 3 * 10-14 L). Charged amino acid metabolites seemingly present in the

intracellular environment at a non-negligible amount have been measured as, Glu: 71-82 mM, Asp: 8-

9 mM, His: 2-2.5 mM, Lys: 1.7-1.9 mM ([Hans, et al., 2003; Canelas, et al., 2008A]), Arg: 6mM ([Hans,

et al., 2003]). Realistic intracellular NTP substrate concentrations of ATP: 3 mM, CTP: 0.5 mM, GTP:

0.7 mM, and UTP: 1.7 mM have been measured [Nick McElhinny, et al., 2010]. The latter ≈ 6 mM

rNTP content is in reasonable agreement with Traut’s average concentrations in mammalian cells [Traut,

1994]. The ATP level is rather close to measurements giving intracellular ATP levels around 2.6-3.5

mM [Gonzales, et al., 2000; Canelas, et al., 2008B; Boer, 2009; Volkov, 2015; Magdenoska, et al.,

2015]. Intracellular phosphorus content is 304-320 mM [Graschop, et al., 2001; van Eunen, et al., 2010].

According to [van Eunen, et al., 2010], most of these atoms are bound and form phosphate groups, which

is consistent with literature data fixing phosphate values around 7-43 mM [Lagunas, et al., 1983;

Theobald, et al., 1996; Gonzales, et al., 2000; Auesukaree, et al., 2004; Canelas, et al., 2008B; Zhang,

et al., 2015B]. Sulfur atoms amount to 44-45 mM intracellular concentration [Graschopf, et al., 2001;

van Eunen, et al., 2010]. Most are bound to glutathione, thus rendering a free sulfate content of about 5

mM [van Eunen, et al., 2010]. Ca2+ ion intracellular concentrations (1.9-2.2 mM, [Graschop, et al., 2001;

van Eunen, et al., 2010]) result in an estimated 0.5 mM of free cations, as most of them are

compartmentalized in the vacuole [van Eunen, et al., 2010]. Total intracellular Mg2+ content is 51-55

mM [Graschopf, et al., 2001; van Eunen, et al., 2010]. However, only about 1-2 mM of free magnesium

is estimated [van Eunen, et al., 2010]. Indeed, most of the cations bind to anionic compounds such as

nucleic acids, NTPs, NDPs, polyphosphates, etc., or are stocked in compartments, e.g. undergo

mitochondria and endoplasmic reticulum sequestration [Romani, et al., 1992; Swaminathan, et al., 2003;

van Eunen, et al., 2010]. K+ can display intracellular variations of 50 to 300 mM depending on growth

phase, K+/Na+ extracellular ratio [Volkov, 2015]. However, studies suggest that K+ can reach 5 mM with

dramatically disadvantageous K+/Na+ extracellular proportions, while others report a lower threshold

70

not much below 100mM even at seriously scarce external potassium content (reviewed in [Volkov,

2015]). Nevertheless, potassium concentrations appear to be pretty well balanced and much more

resilient to environment condition changes than Na+, which depends more on initial extracellular

concentration conditions [Volkov, 2015]. A study published in [Kahm, et al., 2012], suggests that when

the external medium contains more than 1 mM of potassium, the latter cation reaches an internal cell

content plateau of 300mM. K+ intracellular concentration is optimal around 200 to 300 mM.

[Rodriguez-Navarro, 2000; Arino, et al., 2010], consistent with 208-340 mM concentrations from

literature data, [Olz, et al., 1993; Sunder, et al., 1996; van Eunen, et al., 2010], and consistent with the

cation being the most abundant metabolite in yeast [Kahm, et al., 2012]. Although intracellular Na+

concentration can vary significantly depending on the growth conditions [Herrera, et al., 2013; Volkov,

2015], consistent with an important amount of researches (e.g., [Sychrova, et al., 2004; Arino, et al.,

2010; Ramos, et al., 2016]), it has been stressed that in order to avoid sodium cation intoxication,

intracellular proportion of K+ must be significantly higher than that of Na+. In order to avoid such a

detrimental effect, several mechanisms appear to greatly favor K+ influx over Na+ (e.g., K+/Na+

transporters extreme selection ratio of 1000:1 [Matthius, et al., 1999], important Na+ efflux mechanisms,

and vacuolar organelle compartmentalization [Montiel, et al., 2007]). Na+ is not even absolutely

necessary for S. cerevisiae growing in plenty of potassium [Camacho, et al., 1981]. Low levels of sodium

relative to potassium seem to be well in line with published data measuring 5-28mM intracellular

concentrations [Olz, et al., 1993; Sunder, et al., 1996; Graschopf, et al., 2001; Kolacna, et al., 2005; van

Eunen, et al., 2010], An optimal 25 mM Na+ concentration for optimal phosphate uptake activation has

also been proposed [Martinez, et al., 1998]. Although yeast S. cerevisiae belongs to the fungus family,

an information that could serve as an indication for its K+/Na+ ratio is the value of 20:1 found in animal

cells [Matthius, et al., 1999]. Concerning Cl- anions, S. cerevisiae requirements are very low [Rodriguez-

Navarro, 2000; Jennings, et al., 2008]. Consequently, the anion could be used solely to ensure charge

neutrality in our MD system, rather than for significant intrinsic physiological contribution. To

summarize the investigation, let us consider the following overall metabolite study. A team of nineteen

co-researchers attempted to facilitate the transfer of experimental enzyme kinetic data to systems

biology field such as metabolic mathematical modelling, computational simulation, etc. [van Eunen, et

al., 2010]. In order to do so, the authors investigate the design of a cell-free in-vivo like enzyme kinetic

assay defined medium which composition mimics as close as possible in-vivo physiological intracellular

cytosolic concentrations (and pH), while trying to reach simplicity (i.e., minimizing the diversity of

metabolites). In other words, they aimed to define a standard assay medium for molecular biology

experiments, which composition resembles the S. cerevisiae in-vivo cytosolic metabolite pool. An

application for instance is to allow accurate kinetic mathematical modelling of metabolic pathway in-

vivo dynamics with the most physiologically relevant intracellular conditions [van Eunen, et al., 2014].

The philosophy of their research superposes well with our MD metabolite investigation: setting up a

realistic S. cerevisiae intracellular metabolite pool. They propose a: K+: 300 mM, Na+: 20 mM,

71

phosphates: 50 mM, sulfates: 5 mM, free Mg2+: 2 mM, Ca+: 0.5 mM composition. The verification of

the latter medium (supplemented with NTP substrates) against cytosolic enzyme activity by kinetic

assay, returns good Km values, and confirms its credibility as a good physiological fit. In addition, the

values are physiologically credible according to literature, and agree well with the elements discussed

previously. Their values, by default of available extensive metabolite studies focused on the nucleus

itself, seem to be a good standard and initial guess for setting up realistic solvation box components in

our system to be simulated. There is however the following distinction to be made. Phosphates and

glutamates are the elements whose concentrations mainly differ from physiological measurements. We

shall propose that MD simulations should require a lower phosphate concentration, which appears more

in line with literature, and which should not impact the system behavior because varying the

concentration (between 10-75 mM) does not seem to have an impact (cytosolic enzymatic activity

unchanged, [van Eunen, et al., 2010]). At pH = 7.0, phosphate concentrations represent approximately

62% dihydrogen phosphates (H2PO4-) and 38% hydrogen phosphates (HPO4

2-). As far as the glutamate

molecules are concerned, van Eunen et al. used an un-physiologically high amount of them, above all

as an experimental convenience: they are naturally abundant in cells and using higher concentrations to

balance the overall charge, instead of injecting another type of counter-ion molecule, appears handy.

Adding Cl- counter-ions is trivial in a MD simulation, therefore a glutamate concentration value that fits

better literature data is chosen.

In summary, taking an updated version of van Eunen et al.’ standard intracellular concentrations and

Nick McElhinny et al.’ NTP content, yields the following proposed MD solvent box metabolite content:

• NTPs: 5.9 mM

• K+: 300 mM

• Na+: 20 mM

• Glu, Arg, Lys, His, Asp: 80, 6, 2, 2.5, 8.5 mM

• Phosphates: 25 mM, i.e. H2PO4- = 15.5 mM, HPO4

2- = 9.5 mM

• Sulfates: 5 mM

• Mg2+: 2 mM

• Ca2+: 0.5 mM

• Cl-: concentration required to ensure charge neutrality

72

The proposed metabolite content is not perfect. Further investigations are required to refine it. Current

knowledge about free versus compartmentalized and/or bound metabolites, especially ions, is still

approximate. More importantly the values discussed in this section correspond to cytosolic or whole cell

total intracellular concentrations, which not only can differ between them, but can also differ from

nucleus content. The study used as a standard for our proposed MD metabolite content [van Eunen, et

al., 2010], but see also [van Eunen, et al., 2014], is for physiologically close intracellular concentrations

optimized for cytosolic enzymatic activity, not for nucleic enzymes. Therefore, the values could be

refined, with further intracellular measurements by using procedures such as [Herrera, et al., 2013] to

isolate nucleus concentrations (a relevant information for our investigation is for example that the

nucleus content of potassium is 29% of total concentration), and with enzyme kinetic assay proof testing

using protocols such as [van Eunen, et al., 2010]. This would be ideal to mimic the in-vivo like nucleus

metabolite environment of RNAP II.

73

3. Forcefields

Molecular Dynamics simulations rely on set of parameters describing the physical interaction between

the atomic components. As such, it enables to model the time-evolution of a set of atomic coordinates,

by calculating the resulting force applied to each atom. The parameters can be pictured as controlling

the protein degrees of freedom such as stretching (between 2 atoms), bending (3 atoms), torsion (4

atoms). Non-bonded interactions include coulombic and vdw interaction (between 1 atom and all the

other atoms, although methods such as PME allows to reduce the number of calculations required by

using Fourier reciprocal space). Additional parameters to be defined include the nature of a bond (single

or double), particle mass and partial charge.

MD parameters are of the highest importance to correctly model a system, and notably so for diffusion

simulation, which involves a substrate bound to a highly charged metallic ion. Studies comparing the

most popular forcefields (CHARMM, Amber, Gromacs, OPLS, etc.) have shown that simulated systems

could differ substantially in time, which underlines how strongly subtle forcefield parameter differences

impact MD performance [Best, et al., 2008; Beauchamp, et al., 2008; Lange, et al., 2010; Cino, et al.,

2012; Lindorff-Larsen, et al., 2012; Piana, et al., 2014].

Finding adequate parameters has represented a relatively significant challenge in this research project.

Preliminary simulations lead to NTPs forming unphysiological clusters due to suboptimal magnesium

parameters. Other issues involved substrates not diffusing sufficiently and almost directly binding and

being trapped on the surface of the protein. Furthermore, it appeared important to add many charged

metabolites (which represents physiological conditions, see previous section) in order to deal with the

previous matter, by notably allowing electrostatic shielding of exposed charged residues. Therefore,

adequate metabolite parameters needed to be set up. In this section, we will develop what

parameterization choices have been made.

Earlier simulations put into contribution Amber12, Amber14 and CHARMM27/36 parameters, that will

not be further discussed. For latest results, presented in chapter 5, the Amber16 forcefield was chosen,

because it belongs to the family of the most popular and tested forcefields, because it is compatible with

a very wide range of molecular types, moreover because it is the only one allowing the use of the 12-6-

4 vdw potential.

More precisely for the parameters of DNA, RNA and protein amino acids, the latest recommended

choices from Amber developers were taken: DNA.OL15, RNA.OL3 and ff14SB [Wang, et al., 2000;

Perez, et al., 2007; Zgarbova, et al., 2011; Krepl, et al., 2012; Zgarbova, et al., 2013; Zgarbova, et al.,

2015; Maier, et al., 2015].

As far as the rNTP substrates are concerned, most advised parameters for Amber were taken and are

those of [Meagher, et al., 2003]. The set is relatively old, yet it was put to contribution in recent published

74

research ([Duan, et al., 2014; Jiang, et al., 2015; Perez-Villa, et al., 2015]). A more recent set of

parameters exists for NTPs, perhaps improving the flexibility of the triphosphate moiety, yet only

concerns utilization with the CHARMM forcefield [Komuro, et al., 2014].

MG ion physical modelling for MD has represented a severe challenge for experts, because it possesses

a high charge and singular vdw properties, and optimal parameters are very difficult to derive for fix-

charged models. Several attempts were made to correctly parameterize the cation. Initially, the set from

[Aqvist, 1990] (default Amber 12 parameters) was tested and lead to very unphysiological behaviors

such as heavy clustering. Then, the model from [Allner, et al., 2012] allowed a significant leap forward

in terms of simulation performance. A 2015 study ([Panteva, et al., 2015A]) compared seventeen Mg2+

forcefield models (of which [Allner, et al., 2012]), and proposed that the optimal model was the one

from [Li, et al., 2014], where a third r-4 term is added to the Lenard-Jones potential and allows to

partially take polarizability into account. The latter model was further optimized for use with nucleic

acids [Panteva, et al., 2015B], and gave best results with the TIP4PEW water model. Therefore, the

modified 12-6-4 set for nucleic acids ([Panteva, et al., 2015B]) was used, together with TIP4PEW water

([Horn, et al., 2004; Horn, et al, 2005]).

For the other monovalent and divalent ions (K+, Na+, Cl-, Ca2+), 12-6-4 parameters were also used and

are described in [Li, et al., 2014; Li, et al., 2015].

In order to simulate glutamate, aspartate, lysine, histidine and arginine metabolites, the zwitterion amino

acid set from [Horn, 2014] was inputted.

Mass, bond, angle and non-bonded parameters for sulfate atom types S and O2, were taken from

Amber16 GLYCAM_06.dat parameter library file and partial charges were taken from [Cannon, et al.,

1994] (model “std 1”). Hydrogen and dihydrogen phosphate files were prepared by analogy with

[Homeyer, et al., 2005; Steinbrecher, et al., 2014].

When no existing parameter libraries were at disposal, they were written with the following procedure.

Mass, bond, angle, dihedral and non-bonded parameters were written in an Amber .dat file respecting

the correct format. Then the relevant topology (.lib) file was prepared, by defining connectivity, bond

nature, and partial charge with the LEaP module of Amber16 [Case, et al., 2016].

Lastly, the OpenMM library ([Friedrichs, et al., 2009; Pande, et al., 2010; Eastman, et al., 2010A;

Eastman, et al., 2010B; Eastman, et al., 2013]) was chosen to run the simulations, because it is to the

author’s knowledge the only existing MD tool allowing to run the 12-6-4 potential on GPU.

75

4. Accelerated MD simulations

aMD is a MD sampling technique that allows to greatly accelerate a simulation reaction-coordinate, by

biasing in a clever fashion the potential energy landscape. When the potential energy falls below an

energy threshold, a boost is added, which allows to cross energetic barriers much faster. The key

advantage of aMD is that it allows to partially overpass two main limitations of conventional MD,

namely timescale and stagnation within local potential energy basins.

In its original implementation, the aMD method, [Hammelberg, et al., 2004], has been done via adding

an energetic boost to the dihedral component of the potential energy equation (describing the physical

interaction between the elements composing a MD system). Torsional degrees of freedom are generally

considered as the main components driving conformational changes, and indeed the dihedral boosting

method has shown enhanced sampling of protein computer simulations. The latter method has then been

implemented to the total potential energy (i.e., where a total boost is added to all the components of the

forcefield), mainly to accelerate diffusive motion [de Oliviera, et al., 2006]. Because solvent molecules

are very numerous, the total boost affects mainly the non-bonded component of the solvent atoms and

hence contributes mainly to accelerating diffusion within a system. A dual boost method ([Hammelberg,

et al., 2007]) combines the two precedent techniques, by adding energy to both the dihedral and the total

potential, and is commonly used as the method of choice. The latter method allows to accelerate at the

same time protein polypeptide chains exploration of space (dihedral boost) and solvent diffusion (total

boost).

The accuracy and functionality of the method is extensively validated by a variety of studies ([Grant, et

al., 2009; Bucher, et al., 2011; de Oliveira, et al., 2011; Markwick, et al., 2011; Lindert, et al., 2013;

Kappel, et al., 2015; Song, et al., 2015; Miao, et al., 2016]). The method has allowed to reach very high

timescale up to the millisecond range ([Markwick, et al., 2007; Pierce, et al., 2012]), and to enhance the

modelling of experimental phenomena [Markwick, et al., 2011]. Recent developments of the method,

boosting separately non-bonded terms and dihedrals, show great promise [Doshi, et al., 2014].

The aMD boost method relies on the following theoretical background.

When the potential energy is inferior to a threshold energy parameter 𝐸𝑏, i.e. for 𝑉(𝑟) < 𝐸𝑏, the added

boost potential is defined by:

𝑉∗(𝑟) = 𝑉(𝑟) + ∆ 𝑉(𝑟)

Where,

∆ 𝑉(𝑟) =(𝐸𝑏 − 𝑉(𝑟))2

𝐸𝑏 − 𝑉(𝑟) + 𝛼

And where 𝛼 is an acceleration parameter.

76

When the potential energy 𝑉(𝑟) of the system does not fall under an energy threshold 𝐸𝑏,

i.e. when 𝑉(𝑟) ≥ 𝐸𝑏, potential energy is kept untouched and ∆ 𝑉(𝑟) = 0.

The above modification of the potential energy surface results in a new force experienced by each atom

of the system.

For an atom 𝑖 belonging to the system, the new force will be ([Hammelberg, et al., 2007; Markwick, et

al., 2011]):

𝐹𝑖∗ = −

𝑑

𝑑𝑡 [𝑉(𝒓) + ∆𝑉(𝒓)]

= 𝐹𝑖 ∗ [𝛼2

(𝛼 + 𝐸𝑏 − 𝑉(𝒓))2]

Where 𝐹𝑖 is the original force.

At each step of the simulation, the unbiased potential is calculated, then the modified boost potential is

computed, which is then translated to a boost force assigned to the concerned force components

[Markwick, et al., 2011].

The boost force acting on a component of the forcefield (i.e. dihedral) can be expressed as:

𝐹𝑐𝑜𝑚𝑝∗ = −∇𝑉𝑐𝑜𝑚𝑝(𝒓)

𝛼𝑐𝑜𝑚𝑝2

(𝛼𝑐𝑜𝑚𝑝 + 𝛽𝑐𝑜𝑚𝑝)2

= 𝐹𝑐𝑜𝑚𝑝𝛾𝑐𝑜𝑚𝑝

Where,

𝑉𝑐𝑜𝑚𝑝(𝒓) is the modified component potential.

𝛼𝑐𝑜𝑚𝑝 is an acceleration parameter

𝛽𝑐𝑜𝑚𝑝= 𝐸𝑏 − 𝑉(𝒓)

𝐹𝑐𝑜𝑚𝑝 is the unboosted force for the component.

𝛾𝑐𝑜𝑚𝑝 = 𝛼𝑐𝑜𝑚𝑝

2

(𝛼𝑐𝑜𝑚𝑝 + 𝛽𝑐𝑜𝑚𝑝)2

The overall force in the boosted system is then obtained as:

𝐹∗ = (𝐹 − 𝐹𝑐𝑜𝑚𝑝) + 𝐹𝑐𝑜𝑚𝑝𝛾𝑐𝑜𝑚𝑝

Where 𝐹 is the unboosted system total force [Lindert, et al., 2013].

77

Although aMD does not require any a priori state of a system to be known and to be defined, 𝛼 and 𝐸𝑏

parameters are to be defined and require some fine tuning that can be challenging. The 𝐸𝑏 parameter

controls the portion of the energy landscape that will be affected by the boost. 𝛼 modifies the shape of

the energy surface [Wang, et al., 2011A; Wang, et al., 2011B]. Both parameters impact the strength of

the acceleration. A higher acceleration can be performed by increasing 𝐸𝑏 or by decreasing 𝛼.

Each system will have in practice 𝐸𝑏 and 𝛼 parameters that will be optimal, and finding them usually

require some testing, for example keeping one parameter constant while varying the other one

[Markwick, et al., 2011; Bucher, et al., 2011B].

Equations used to calculate the parameters are the following.

𝐸_𝑑𝑖ℎ𝑒𝑑 = 𝑉_𝑑𝑖ℎ𝑒𝑑 𝑘𝑐𝑎𝑙.𝑚𝑜𝑙−1 + 𝑐𝑡𝑟,

𝛼 = 0.20 ∗ 𝑐𝑡𝑟

Where,

𝑐𝑡𝑟 = 3 𝑡𝑜 5 𝑘𝑐𝑎𝑙.𝑚𝑜𝑙−1 ∗ 𝑛𝑏_𝑝𝑟𝑜𝑡_𝑟𝑒𝑠, [Markwick, et al., 2011; Miao, et al., 2016]

𝑐𝑡𝑟 = 0.20 𝑘𝑐𝑎𝑙.𝑚𝑜𝑙−1 ∗ 𝑛𝑏_𝑝𝑟𝑜𝑡_𝑎𝑡𝑚𝑠, [Markwick, et al., 2009; Wang, et al., 2011B]

𝑐𝑡𝑟 = 0.3, 0.4, 0.5 ∗ 𝑉_𝑑𝑖ℎ𝑒𝑑 𝑘𝑐𝑎𝑙.𝑚𝑜𝑙−1, [Tikhonova, et al., 2013; Kappel, et al., 2015; Song, et al.,

2015]

The most consensual energetic relations for the dihedral acceleration parameters, based on comparative

analysis from several studies, are 𝐸_𝑑𝑖ℎ𝑒𝑑 and 𝛼 formulas from above, with:

𝑐𝑡𝑟 = 3.5 𝑘𝑐𝑎𝑙.𝑚𝑜𝑙−1 ∗ 𝑛𝑏_𝑝𝑟𝑜𝑡_𝑟𝑒𝑠, [Lindert, et al., 2012]

As far as the total acceleration parameters are concerned, the values that are advised to this time and for

most systems (based on comparative studies), are defined as:

𝐸_𝑡𝑜𝑡𝑎𝑙 = 𝑉_𝑡𝑜𝑡𝑎𝑙 𝑘𝑐𝑎𝑙.𝑚𝑜𝑙−1 + (𝑐𝑡𝑟 = 0.16, 0.20 𝑘𝑐𝑎𝑙.𝑚𝑜𝑙−1 ∗ 𝑛𝑏_𝑎𝑡𝑚𝑠),

𝛼 = 𝑐𝑡𝑟, [Hammelberg, et al., 2007; Markwick, et al., 2011; Kappel, et al., 2015; Miao, et al., 2016]

It is worth considering that the total acceleration 𝛼 parameter is higher than that for the dihedral boost,

in order notably to not distort too heavily the solvent.

78

The aMD simulation procedure is the following.

First, the static model of the protein is prepared. The original RNAP II atomic coordinates used in this

work is PDB#2E2H and its structure resolution is described in [Wang, et al., 2006]. Missing loops are

added with the Yasara-Structure software [Krieger, et al., 2002]. The initial model also has missing

nucleic acid bases, which are added following the complex procedure outlined in chapter 3. For all the

subsequent steps, a script has been written and automates everything, that is to say that running the script

in appendix 1 should automatically (by taking care to make a few adjustments to match the PDB file

sequence used for example) perform all the tasks detailed below.

Second, the static model is “pre-minimized”. That is to say that minimization is first done on an

expurgated system, in order to optimize the static model, by optimizing the minimization algorithms

computation, notably for the inserted new nucleic templates. The static system of step 1 is further

prepared by specifying N- and C- termini at the extremities of the subunits. The system is completed

with missing heavy atoms, hydrogenated, neutralized with K+ ions, and solvated with a TIP4PEW water

box ensuring a minimum solute to edge distance of 15 Å, with the LEaP module of Amber16 [Case, et

al., 2016]. Then ten rounds of, minimization 1 (min 1) straightly followed by minimization 2 (min 2),

are computed with the Amber16 Sander module ([Case, et al., 2016]) on Computer Processing Unit

(CPU). Min 1 consists of 1000 steps of steepest descent and 4000 steps of conjugate gradient algorithms,

with 500 kcal.mol-1 harmonic restraint on protein and nucleic residues, and an electrostatic cutoff of 10

A). Min 2 consists of running 2500 steps of each algorithms without restraints. Amber16 is chosen over

OpenMM for minimization due to the superiority of its algorithms for this matter (notably reduces more

the potential energy).

Third, the refined static model of step 2 is prepared for simulation. The number of water molecules

required to ensure a TIP4PEW water box with a buffer of 15 Å is calculated with LEaP. Then the number

of metabolite molecules to be inserted is calculated, according to the latter number of water molecules.

Cl- amount is adjusted to ensure an overall charge neutrality. Using the AddToBox module of Amber16,

the metabolite molecules were inserted in the refined static model of step 2. Phosphate molecules were

not added, because of simulation instabilities with the 12-6-4 potential (their parameter set worked fine

otherwise, i.e. without using the 12-6-4 potential). The system was then hydrogenated, and simulation

coordinate and parameter files configured, with LEaP. The simulation files were further processed by

Amber16 Parmed module, in order to add the 12-6-4 potential Lenard-Jones matrix to the relevant

molecules and to apply [Panteva, et al., 2015B] nucleic acid modifications by changing the polarization

atom type of some nucleic atoms. Please refer to appendix 1 for more detailed procedure.

79

Fourth, a first round of simulation is done, without the substrates, in order to let the metabolites enough

time to relax and improve the electrostatic configuration. The final system (as compared to the

expurgated system) is minimized using the same procedure as above. Then heating, velocity

equilibration, box equilibration, and final equilibration are executed with OpenMM ([Friedrichs, et al.,

2009; Pande, et al., 2010; Eastman, et al., 2010A; Eastman, et al., 2010B; Eastman, et al., 2013]) on

GPU using the mixed CUDA precision model [Le Grand, et al., 2013], a Langevin integrator using a

time step of 2 fs, a temperature of 300K and a thermal coupling collision frequency of 1.0 ps-1, Hydrogen

bond maintained constrained and water molecules set to rigid. A PME non-bonded method with a cutoff

distance of 8 Å, and 10 kcal.mol-1 harmonic restraint on protein and nucleic atoms, are used for heating.

A PME cutoff distance of 10 Å and 50 kcal.mol-1 harmonic restraint on DNA anchoring residues

(extremities), are used otherwise. The system is heated for 20 ps. Velocity equilibration is run for 100

ps, as NVT (constant moles, volume and temperature), Box equilibration is done for 20ns, as NPT

(constant moles, pressure and temperature), by setting up a MonteCarlo Barostat with a 1 bar pressure.

The system is then relaxed for 20 ns as NVT.

Fifth, the substrates are to be added. In order to account to a NTP influx corresponding to 5.9 total mM

concentration, regardless of the rNTP type, 5.9 mM of GTPs is chosen. It is to be noted that as outlined

in chapter 3, i + 2 and i + 4 are strategically mutated to cytosine, and consequently i + 2 to i + 4 (i + 3

is already cytosine in the original PDB structure) registers of tDNA are available for pairing an incoming

GTP substrate. Water molecules are stripped from the last trajectory frame of round 1 final 20 ns

relaxation. A calculated amount of Cl- ions is also removed in order to ensure a neutral charge when 5.9

mM NTPs of charge -2 will be added. Then GTP molecules are inserted using the AddToBox module.

The next steps are identical to round 1.

Finally, the actual aMD simulation is executed. Acceleration parameters are calculated using similar

equations as outlined in the introduction of this subsection, and are listed in chapter. Several run

durations have been performed. The simulation is configured with a DualBoost integrator using a time

step of 2 fs and the four acceleration parameters, an Andersen Thermostat using a 300 K temperature

bath and a collision frequency of 1.0 ps-1, PME non-bonded method with a cutoff of 8 Å, constrained

Hydrogen bonds, rigid water molecules, 50 kcal.mol-1 harmonic restraint on DNA anchoring residues,

and mixed CUDA GPU precision.

80

5. Steered MD simulations

Steered MD is a simulation technique allowing to bias a reaction-pathway coordinate, by setting a

“pulling” force to one or several atoms. It was invented by applying the concept of Atomic Force

Microscopy (where a cantilever exerts a force on a biomolecule) to MD.

For the simple pulling of an atom along a direction, the force can be defined as:

𝐹𝑜𝑟𝑐𝑒_𝑠𝑀𝐷 = 𝑘 ∗ ((𝑥 − 𝑥0)2 + (𝑦 − 𝑦0)

2 + (𝑧 − 𝑧0)2)

Where,

𝑘 is the force magnitude

𝑥, 𝑦, 𝑧 are the coordinates of the pulled atom

𝑥0, 𝑦0, 𝑧0 are the coordinates towards which the force is exerted

While aMD is rarely used to model diffusion, sMD is the most common method, due to the ease with

which one can force a system to go through the desired pathway. The method requires however to define

a priori information about what is going to happen (the direction of the pulling force), when this is not

required for aMD. Several flavors of sMD (e.g., velocity sMD, adaptive bias sMD) that can be seen as

umbrella sampling techniques allow to extract information such as work or free-energy differences,

which were judged of priority importance for this research project. Hence, classical force sMD has been

performed.

Let us consider the sMD computer routines. The basic simulation trick that has been employed is that

the sMD trajectory is divided into several checkpoints. The latter checkpoints are defined by residue

index. The advantage of this method is that it allows to maximize the portability of the results, with

minimal user input. In other words, no direction has been defined by abstract coordinates, but by precise

landmarks within the structure itself, thus greatly facilitating the reproducibility of the results. In

addition, this strategy has allowed to fully script and automate the procedure. For researchers wishing

to reproduce the simulations, an example sMD trajectory script is provided in appendix 2.

The starting structure is the last frame of simulation round 1 presented above. It consists of an

equilibrated metabolite and water box containing RNAP, where the system has been minimized, heated,

velocity equilibrated, box equilibrated, further relaxed, without NTPs. Two Cl- ions are stripped from

the PDB file to ensure that a neutral overall charge is respected when the sMD GTP substrate will be

added. Water is also removed. Then a GTP molecule is inserted strategically within an inner box

surrounding checkpoint 0: solvent accessible area lying in front of checkpoint 1. It is not placed directly

at the checkpoint coordinates, but within a certain x, y, z threshold (hence the inner box) in order to not

81

overlap with existing metabolites. This is done by extracting an inner box surrounding the checkpoint

from the global PDB file, then by adding a GTP molecule to the inner box with the AddToBox Amber16

module by adjusting the x, y and z range correspondingly, and finally by copying the GTP back to the

global PDB.

The system is then completed with missing heavy atoms, hydrogenated, solvated with a TIP4PEW water

box respecting a 15 Å minimal distance to the solute, with LEaP. The 12-6-4 potential including nucleic

atom modifications is applied in the same fashion as outlined in section 3. Then, minimization, heating

and velocity equilibration are also performed as mentioned in previous section. With the distinction that

velocity equilibration is run for 20 ps, and that instead of using harmonic restraints on the DNA

anchoring residues, mass constraints are used (minimizes the computation complexity of the forces at

play for sMD). These steps are required although the initial system consisted of an already relaxed

system, because when starting from any static model, without the velocity information, it is necessary

to bring the system back to target temperature.

Next, the checkpoint loops are executed. For each checkpoint along the sMD trajectory, the execution

of the ith checkpoint loop is repeated, until a certain threshold distance has been reached, before

switching to the next checkpoint. The threshold distances and precision about the checkpoints used are

listed in chapter 5.

In addition, an iteration check is computed within each checkpoint loop to kill the executing of the loop

after 2 ns, if the trajectory has not converged, in order to avoid memory crash.

Each checkpoint loop is run with a Langevin integrator, using a 300K temperature, 1.0 ps-1 thermal

coupling and a 2 fs time step, a PME non-bonded method with a 8 Å electrostatic cutoff, constrained

Hydrogen bonds, rigid water, and mass constraints applied to the DNA extremities.

For sMD simulations through CH2, preliminary sMD pulls were applied on TL CA atoms of scRPB1

1082, 1087, 1088 and 1092 residues, and the latter residues were kept fixed during simulation, in order

to maintain the TL open.

Finally, sMD in combination with aMD has been tested, where the procedure is the same as for sMD,

except that the checkpoint loop runs with a DualBoost integrator and an Andersen thermostat, instead

of a Langevin integrator.

Preliminary work has been performed on PDB#5C4J (see chapter 5). The same procedures as listed in

this section were employed, with the distinction that CTP molecules were added in the system instead

of GTPs, and that CTP parameters provided by Prof. R. Amaro from UCSD were used.

82

Chapter 3

Elongation Complex Reconstruction

83

1. Introduction

As discussed in chapter 1, most of the crystal structures available for RNAP II do not contain a full

nucleic Elongation Complex. In our starting model: PDB#2E2H, ntDNA is not resolved after i + 5.

Several conventional and aMD simulations have been run (data not shown) on the incomplete structure

and the following observations have been made. The incomplete presence of ntDNA bases inside the

protein is problematic as the conformation of the nucleic Elongation Complex plays a critical role for

the diffusion of rNTPs. It significantly modifies the conformation and electrostatics of the CH1/CH3

channels, hence directly affecting the diffusion of substrates, it does not prevent DS register slippage,

and does not allow pre-binding at the right registers immediately downstream from loading position. In

addition to the factors pre-mentioned, the upstream portion of the ntDNA (after i + 5) seems important

to stabilize tDNA registers in pre-binding substrate welcoming configuration, notably by lowering

tDNA, by minimizing parasitic backbone electrostatic repulsion with the substrates, and possibly

improving diffusion by stabilizing the nucleic acids. Furthermore, RNA bases are not present after i +

10, however a complex is considered elongation ready when the RNA strand consists at least of 13

bases. In simulations with the incomplete RNA chain, the strand took distorted conformations, bending

towards the inside of the protein close to inner DNA, instead of directing towards the RNA exit channel.

Reconstructing a complete EC is also of high relevance to experimentally simulate translocation events,

which as presented in chapter 1 is linked to the loading of substrates, and consequently can shed some

light on the full diffusion/loading process. It is therefore of significant importance to reconstruct a

complete and physiologically adequate EC for RNAP, in order to carry out the characterization of

nucleotide diffusion/loading to a higher degree of precision and to optimize the scientific plausibility of

the experiments. The DNA extremities are to be maintained fixed during simulation, consequently the

starting structure must be as good as possible as restraints can prevent DNA of naturally relaxing into a

more native state conformation during simulation. In this chapter, we will investigate mathematical

tools, the development of algorithms and their application, in order to recreate a complete EC.

84

2. 3D Rotation

Before proceeding to the investigation of the mathematical tools and the algorithms, let us first define

what strategy is to be employed. The goal is to add missing RNA and DNA bases in the RNAP initial

atomic coordinates. To do so, geometric information that is already present in the structure is to be used

to guess the shape of the overall DNA frame, and to add the missing bases incrementally. The guess

need not to be perfect, as minimizing the potential energy of the structure will optimize the geometry of

the nucleic strands. However, the guess must be close enough for the minimizations algorithms to go

through, and in order to converge to a local minimum that is not of an irrelevant high order. Once we

know where to add missing bases, the next step is to insert them incrementally with the right atomic

coordinates. In order, to position an object in 3D space, two rotations are needed for the object to adopt

the right orientation, and an additional translation operation is to be computed to complete the

positioning.

Given an object in space to be aligned in a specific manner with a reference object. Two consecutive

rotation alignments are to be done. A rotation alignment between a vector of the reference object and a

vector of the object to be aligned is defined by an axis that is normal to the two vectors at the same time,

and the angle between the two vectors. The rotation via the latter axis angle can then be expressed

mathematically as three successive rotations around the x, y and z axes (rendering the total number of

rotations needed to align the object to six). This is defined as an axis angle to euler angle rotation

operation.

Three methods are generally used to carry such tasks and encompass a large variety of domains such as

aeronautics (computing the head, bank, roll of a plane), video-games and graphical design (rotating and

visualizing a 3D object). These methods are rotation matrices, quaternions, and Rodrigues’s rotations.

Quaternions are a method of choice due to limited number of operations required and the ease with

which to manipulate an entire 3D object at the same time.

85

A quaternion is a four-dimensional representation of a rotation and is defined by:

𝑞 = 𝑎 + 𝑏𝒊 + 𝑐𝒋 + 𝑑𝒌,

where,

𝒊, 𝒋 and 𝒌 are the fundamental quaternion units and satisfy 𝒊 2 = 𝒋 2 = 𝒌 2 = 𝒊𝒋𝒌 = −1.

𝑎 = cos (𝑎𝑛𝑔𝑙𝑒

2) ,

𝑏 = 𝑎𝑥𝑖𝑠 𝒙 ∗ sin (𝑎𝑛𝑔𝑙𝑒

2) ,

𝑐 = 𝑎𝑥𝑖𝑠 𝒚 ∗ sin (𝑎𝑛𝑔𝑙𝑒

2) ,

𝑑 = 𝑎𝑥𝑖𝑠 𝒛 ∗ sin (𝑎𝑛𝑔𝑙𝑒

2) ,

𝑎𝑛𝑔𝑙𝑒 is the angle of rotation.

Deriving quaternion equations with euler angles, gives the following transformations to be executed in

the right order to express a 3D rotation around an axis with a given angle:

𝑅𝑜𝑡 𝑦 = 𝑎𝑡𝑎𝑛2(𝑦 ∗ sin(𝑎𝑛𝑔𝑙𝑒) − 𝑥 ∗ 𝑧 ∗ (1 − cos(𝑎𝑛𝑔𝑙𝑒)), 1 − (𝑦2 + 𝑧2) ∗ (1 − cos(𝑎𝑛𝑔𝑙𝑒)))

𝑅𝑜𝑡 𝑧 = 𝑎𝑠𝑖𝑛(𝑥 ∗ 𝑦 ∗ (1 − cos(𝑎𝑛𝑔𝑙𝑒)) + 𝑧 ∗ sin(𝑎𝑛𝑔𝑙𝑒))

𝑅𝑜𝑡 𝑥 = 𝑎𝑡𝑎𝑛2(𝑥 ∗ sin(𝑎𝑛𝑔𝑙𝑒) − 𝑦 ∗ 𝑧 ∗ (1 − cos(𝑎𝑛𝑔𝑙𝑒)), 1 − (𝑥2 + 𝑧2) ∗ (1 − cos(𝑎𝑛𝑔𝑙𝑒)))

86

3. Illustrative case: adding a single nucleotide

Here, we will get familiarized with the algorithm principles by considering the case of adding a single

nucleotide. Let us consider the following illustrative case. A DNA strand is to be elongated by 1

nucleotide. The first step is to extend the strand with a sugar backbone.

a) Backbone extension

Two cases are to be considered:

In the first case, a nucleotide is to be added to a DNA strand in the 5’-3’ direction. Hence, O3’ atom of

the extremity nucleotide is to be bound to a new backbone to be inserted in the structure.

Figure 8: 5’-3’ direction of DNA extension. A thymine nucleotide is shown is CPK representation. O3’

binding atom for extending the strand in the 5’-3’ direction is indicated in the dashed rectangle.

87

In the second case, a nucleotide is to be added to a DNA strand in the 3’-5’ direction. Hence, O3’ atom

of the extremity nucleotide is to be bound to a new backbone to be inserted in the structure.

Figure 9: 3’-5’ direction of DNA extension. A thymine nucleotide is shown in CPK representation. P binding

atom for extending the strand in the 3’-5’ direction is indicated in the dashed rectangle.

To extend DNA of one nucleotide, both cases are dealt with using the same molecular template. The

latter template consists of the standard backbone and extended sugar geometry, containing P, O1P, O2P,

O5’, C5’, C4’, O4’, C3’, O3’, C2’ and C1’ atoms. The template also includes extra dummy atoms

allowing to perform the extension alignment.

Figure 10: Backbone extension template for both the 5'-3' and the 3'-5- directions of DNA extension. The

three anchoring residues in the left dashed rectangle allow to attach a new nucleotide in the 5’-3- direction,

while the dashed rectangle on the right contains anchoring atoms for extending DNA in the alternative path.

If DNA is to be extended in the 5’-3’ direction, then in order to bind O3’ atom of the reference nucleotide

to a new backbone, C4’, C3’ and O3’ dummy atoms of the template are aligned with C4’, C3’ and O3’

atoms of the reference.

88

Figure 11: Nucleotide attachment to the DNA backbone host in the 5’-3’ direction. The atoms to be

superposed are indicated by the dashed rectangle.

In the same logic, if extension is pursued in the 3’-5- direction, C5’, O5’ and P dummy atoms of the

template are to be aligned with the corresponding reference atoms. A template backbone is aligned with

three landmark atoms of the nucleotide at the extremity of the strand to be implemented.

Figure 12: Nucleotide attachment to the DNA backbone host in the 3’-5’ direction. The atoms to be

superposed are indicated by the dashed rectangle.

89

4. Transformations

Now, we will illustrate how the adding transformations are done, and the corresponding algorithm lines.

The algorithm is coded with two languages: perl as the host code, which enables to conveniently

manipulate files and sub-programs, and TCL as the called program in order to communicate with VMD

([Humphrey, et al., 1996]) and perform the transformations. First, the reference nucleotide is extracted

from the PDB file to be implemented and written in a separate file. Then the DNA extension direction

is extracted. This is done by looking at the extremity atoms of the reference nucleotide, and checking if

they are bound or free. 𝑎1, 𝑎2 and 𝑎3 atoms for the reference structure, and 𝑏1, 𝑏2 and 𝑏3 atoms of the

template, that will be aligned as 𝑏1 to 𝑎1, 𝑏2 to 𝑎2 and 𝑏3 to 𝑎3, are defined. For example, if the DNA

direction is 5’-3’, then O3’ atom of the reference nucleotide will be unbound, and C4’, C3’, O3’ atoms

are to be superposed between the two structures. The order of the atoms has also a significance, with 𝑎1

and 𝑏1, serving as the central atom for defining the transformation vectors (explained in more detail

below). Then, in order to perform the alignments, the reference and template structures are translated

at the origin of the coordinates. The reference structure is translated at the origin by the translation of

vector atom 𝑎1 to {0, 0, 0}. But before the translation is done, the original coordinates of 𝑎1 are saved

to reset the position when the structures have been aligned. The same operation is done with the template

structure. The dummy atoms are differentiated from the other atoms by using an occupancy field value

of 9.0 in the PDB file. Once the two structures are translated to origin and hence superposed via 𝑎1, 𝑏1,

coordinates of the atoms are extracted as 𝐴𝑥, 𝐴𝑦, 𝐴𝑧, 𝐵𝑥, 𝐵𝑦, 𝐵𝑧, 𝐶𝑥, 𝐶𝑦, 𝐶𝑧, 𝐸𝑥, 𝐸𝑦, 𝐸𝑧, 𝐹𝑥, 𝐹𝑦, 𝐹𝑧,

𝐺𝑥, 𝐺𝑦, 𝐺𝑧 for atoms 𝑎1, 𝑎2, 𝑎3, 𝑏1, 𝑏2, 𝑏3 respectively.

The first rotation is then performed in order to align the normal vectors defined by the three atoms of

the reference and template structure respectively and bring the structures in the same plane. The normal

vectors 𝒏𝟏 and 𝒏𝟐 are calculated as the cross product of the normalized vector of

𝑎1, 𝑎2⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ , 𝑎1, 𝑎3⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ , and 𝑏1, 𝑏2⃗⃗⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ , 𝑏1, 𝑏3⃗⃗⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ respectively. The axis of rotation is given by the cross product of

the two normal vectors, and the angle of rotation is calculated as the dot product of 𝒏𝟏 and 𝒏𝟏.

𝑛1𝑥 = (𝐵𝑦 − 𝐴𝑦) ∗ (𝐶𝑧 − 𝐴𝑧) − (𝐵𝑧 − 𝐴𝑧) ∗ (𝐶𝑦 − 𝐴𝑦)

𝑛1𝑦 = (𝐵𝑧 − 𝐴𝑧) ∗ (𝐶𝑥 − 𝐴𝑥) − (𝐵𝑥 − 𝐴𝑥) ∗ (𝐶𝑧 − 𝐴𝑧)

𝑛1𝑧 = (𝐵𝑥 − 𝐴𝑥) ∗ (𝐶𝑦 − 𝐴𝑦) − (𝐵𝑦 − 𝐴𝑦) ∗ (𝐶𝑥 − 𝐴𝑥)

𝑛2𝑥 = (𝐹𝑦 − 𝐸𝑦) ∗ (𝐺𝑧 − 𝐸𝑧) − (𝐹𝑧 − 𝐸𝑧) ∗ (𝐺𝑦 − 𝐴𝑦)

𝑛2𝑦 = (𝐹𝑧 − 𝐸𝑧) ∗ (𝐺𝑥 − 𝐸𝑥) − (𝐹𝑥 − 𝐸𝑥) ∗ (𝐺𝑧 − 𝐴𝑧)

𝑛2𝑧 = (𝐹𝑥 − 𝐸𝑥) ∗ (𝐺𝑦 − 𝐸𝑦) − (𝐹𝑦 − 𝐸𝑦) ∗ (𝐺𝑥 − 𝐴𝑥)

90

Let (𝑥, 𝑦, 𝑧) be the axis vector components, given by the normalized cross product 𝒏𝟏 ∗ 𝒏𝟐:

𝑥 = 𝑛1𝑦 ∗ 𝑛2𝑧 − 𝑛2𝑦 ∗ 𝑛1𝑧

𝑦 = 𝑛1𝑧 ∗ 𝑛2𝑥 − 𝑛2𝑧 ∗ 𝑛1𝑥

𝑧 = 𝑛1𝑥 ∗ 𝑛2𝑦 − 𝑛2𝑥 ∗ 𝑛1𝑦

𝑛𝑜𝑟𝑚 = (𝑥2 + 𝑦2 + 𝑧2)0.5

𝑥 = 𝑥 / 𝑛𝑜𝑟𝑚

𝑦 = 𝑦 / 𝑛𝑜𝑟𝑚

𝑧 = 𝑧/ 𝑛𝑜𝑟𝑚

Angle of rotation is given by the dot product of the normalized normal vectors:

𝑛1 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = (𝑛1𝑥2 + 𝑛1𝑦

2 + 𝑛1𝑧2)0.5

𝑛1𝑥 = 𝑛1𝑥 / 𝑛1 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒

𝑛1𝑦 = 𝑛1𝑦 / 𝑛1 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒

𝑛1𝑧 = 𝑛1𝑧 / 𝑛1 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒

𝑛2 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒 = (𝑛2𝑥2 + 𝑛2𝑦

2 + 𝑛2𝑧2)0.5

𝑛2𝑥 = 𝑛2𝑥 / 𝑛1 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒

𝑛2𝑦 = 𝑛2𝑦 / 𝑛1 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒

𝑛2𝑧 = 𝑛2𝑧 / 𝑛2 𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒

𝑐𝑜𝑠 = 𝑛1𝑥 ∗ 𝑛2𝑥 + 𝑛1𝑦 ∗ 𝑛2𝑦 + 𝑛1𝑧 ∗ 𝑛2𝑧

𝜃 = 𝑎𝑡𝑎𝑛2 ((1 − 𝑐𝑜𝑠2), 𝑐𝑜𝑠)

Finally, we can calculate the euler angles rotation components (derived from quaternions). Because of

the coordinate reference standards used in VMD, where the transformations are executed, rotations 𝑥, 𝑦

and 𝑧 components are multiplied by -1.

𝑠 = sin (𝜃)

𝑐 = cos (𝜃)

𝑡 = 1 − cos (𝜃)

𝑅𝑜𝑡 𝑦 = −𝑎𝑡𝑎𝑛2(𝑦 ∗ 𝑠 − 𝑥 ∗ 𝑧 ∗ 𝑡, 1 − (𝑦2 + 𝑧2) ∗ 𝑡)

𝑅𝑜𝑡 𝑧 = −𝑎𝑠𝑖𝑛(𝑥 ∗ 𝑦 ∗ 𝑡 + 𝑧 ∗ 𝑠)

𝑅𝑜𝑡 𝑥 = −𝑎𝑡𝑎𝑛2(𝑥 ∗ 𝑠 − 𝑦 ∗ 𝑧 ∗ 𝑡, 1 − (𝑥2 + 𝑧2) ∗ 𝑡)

91

Executing the above rotations angles around axis y, z and x successively, aligns the normal vectors

(Figure 13).

Figure 13: Schematic diagram of the first rotation transformation to align a nucleotide backbone to be

incorporated on DNA 5’ end. The figures on the first row show the original out of plane orientation of the

template backbone, represented by the three atoms to be aligned b1, b2 and b3 with the reference atoms a1,

a2 and a3. Normal vectors of the template and reference structures are n2 and n1 respectively. The figures

on the second row depict the in-plane alignment of the template with the reference backbone after rotation

1.

For the two structures to share the same orientation, 𝒕𝟏 = 𝒂𝟏, 𝒂𝟑⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗, and 𝒕𝟐 = 𝒃𝟏, 𝒃𝟑⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗, are aligned through

a second rotation transformation. The new coordinates of the template atoms (after rotation 1) are

extracted, and the second rotation is computed. The new coordinates of the template are also used to

check how precise the first alignment was done: the 𝒏𝒆𝒘 𝒏𝟐 vector is calculated, and a parallelism

score between 𝒏𝟏 and 𝒏𝒆𝒘 𝒏𝟐, is computed as the dot product of 𝒏𝟏 and 𝒏𝒆𝒘 𝒏𝟐. This is done only

to proof check the algorithm. The axis vector, rotation angle, and euler angles rotation components are

calculated like rotation 1. The second rotation is then executed (Figure 14).

92

Figure 14: Schematic diagram of the second rotation transformation to align a nucleotide backbone to be


template backbone, represented by the three atoms to be aligned b1, b2 and b3 with the reference atoms a1,

a2 and a3. Vectors of the template and reference structures to be aligned are t2 and t1 respectively. The

figures on the second row depict the alignment of the template with the reference backbone after rotation

2.

Finally, a translation transformation is done so as to bind the template backbone. The new coordinates

of the template are extracted, the previous transformation (rotation 2) is assessed by checking how well

the structures are superposed. Because both structures were translated at the origin initially via 𝑎1 and

𝑏1 respectively, and because the template geometry corresponds to the reference, after the three

previous transformations (translation to origin, rotation 1, rotation 2), the structures are now superposed.

They share the same orientation, 𝑎1 and 𝑏1 are virtually perfectly superposed, however 𝑎2, 𝑏2, and 𝑎3,

𝑏3, are not exactly superposed as the template represents a standardized geometry and do not correpond

exactly to the reference (the reference comes from the initial crystal coordinates). The final translation

is calculated as the vector between atom 𝑏2 and atom 𝑎2 of the reference structure before the initial

transformation, i.e. its original position. It results in the superposition of dummy atom O3’ with

reference atom O3’, hence in the binding of the new backbone.

93

Figure 15: Translation transformation attaching the aligned backbone to DNA 5’end. The superposed atoms

resulting from translating the template O3’ dummy atom with DNA 5’ end O3’ atom are indicated by the

dashed rectangle.

b) Inserting the base group

Once the DNA strand has been extended with a new backbone, the next step is to attach a new base

group on the C1’ (host atom) of the backbone sugar.

Figure 16: DNA nucleotide and backbone references to attach a new base group on the 5’ end. The atom

shown in lime is the attachment point of a new base to the host reference backbone, while the nucleotide

indicated in grey is the extremity nucleotide reference.

The DNA direction is extracted and will be used at a final stage to know if the base is to be laterally

shifted of + or – 34.2 degrees (B-DNA consecutive base shift). The same strategy as above is employed,

except that the atoms to be aligned are specified in the following manner. If the reference or template

base type is G or A, then 𝑎1, 𝑎2 and 𝑎3, 𝑏1, 𝑏2 and 𝑏3, atom types are C2, C4 and C6 respectively.

Alternatively, if the base type is T or C, then the atom type indexes are in the C2, C6 and C4 order. In

doing so, the bases can be aligned properly. For example, when aligning G with A or G with G, C2, C4

and C6 are respectively superposed, yet when aligning G with T or C, C2, C4 and C6 of G are

respectively superposed with C2, C6 and C4 of T or C.

94

After performing the same steps as previously (insertion of the template and translations to origin, etc.),

rotation 1 is performed (Figure 17).

Figure 17: Schematic diagram of the first rotation transformation to align a nucleotide base group to be


template base group, represented by the three atoms to be aligned b1, b2 and b3 with the reference atoms

a1, a2 and a3. Normal vectors of the template and reference structures are n2 and n1 respectively. The

figures on the second row depict the in-plane alignment of the template with the reference base group after

rotation 1.

95

Then rotation 2 is performed. The only difference with the backbone alignment procedure is that the

template base plane is tilted laterally (around its normal vector) relative to the plane of the reference

base, of + or – 34.2 degrees. The alignment angle is calculated and is represented in Figure 18, but is

incremented of +/- 34.2 degrees (not represented), to take the tilt into account.

Figure 18: Schematic diagram of the second rotation transformation to align a nucleotide base group to be


template base group, represented by the three atoms to be aligned b1, b2 and b3 with the reference atoms

a1, a2 and a3. Normal vectors of the template and reference structures are n2 and n1 respectively. The

figures on the second row depict the in-plane alignment of the template with the reference base group after

rotation 2.

96

Finally, the base is attached to the sugar, by computing the translation of template dummy atom C1’ to

backbone host attaching atom C1’ and is represented in Figure 19.

Figure 19: Schematic diagram of the translation transformation attaching a new base group to DNA 5’ end

backbone. The template base group is shown in silver, while the reference nucleotide is in grey. A: Position

of the aligned based group resulting from rotations 1 and 2. The translation target is represented by the

atom colored in lime. B: Position of the base group attached to DNA after translation transformation.

A

B

97

5. Principle application: constructing a complete EC

The missing nucleotides are represented in Figure 20 and listed in Tables 3 and 4.

Figure 20: Schematic diagram of missing nucleotides in PDB#2E2H. The upstream and downstream

bubbles are indicated. tDNA, ntDNA and RNA are in light blue, cyan and lime ribbon representation

respectively. The red dashed rectangles represent the register rank to be extended, except for tDNA i-5

where the register is indicated for positional comparison with ntDNA. RNA exit channel is indicated by the

green arrow.

i - 5

i + 9

i + 9

downstream bubble

upstream bubble

i - 5

i - 18

98

Register (i +/-) RNA strand

0 A 18

-1 G 17

-2 G 16

-3 A 15

-4 G 14

-5 A 13

-6 G 12

-7 C 11

-8 U 10

-9 A 9

-10 C 8

-11 U 7

-12 A 6

-13 G 5

-14 C 4

-15 G 3

-16 G 2

-17 U 1

Table 3: RNA nucleotides to be added. RNA strand nucleotide types and register ranks are indicated.

Numbers in green indicate existing nucleotides, while red indexes indicate the nucleotides to be added. 5’-

3’ direction is given by the ascending index order. RNA registers are listed from the downstream to the

upstream direction.

99

Register (i +/-) T strand (D*) NT strand (D*)

21 G 19 C 96

20 T 20 A 95

19 A 21 T 94

18 C 22 G 93

17 T 23 A 92

16 A 24 T 91

15 C 25 G 90

14 C 26 G 89

13 G 27 C 88

12 A 28 T 87

11 T 29 A 86

10 A 30 T 85

9 A 31 T 84

8 G 32 C 83

7 C 33 G 82

6 A 34 T 81

5 G 35 C 80

4 A *C 36 G 79

3 C 37 G 78

2 G *C 38 G 77

1 C 39 G 76

0 T 40 A 75

-1 C 41 G 74

-2 C 42 G 73

-3 T 43 A 72

-4 C 44 G 71

-5 T 45 A 70

-6 C 46 G 69

-7 G 47 C 68

-8 A 48 T 67

-9 T 49 A 66

-10 G 50 C 65

-11 A 51 T 64

-12 T 52 A 63

-13 C 53 G 62

-14 A 54 T 61

-15 T 55 A 60

-16 C 56 G 59

-17 T 57 A 58

Table 4: DNA nucleotides to be added. tDNA and ntDNA nucleotide types and register ranks are indicated.

Numbers in green indicate existing nucleotides, while red indexes indicate the nucleotides to be added. 5’-

3’ direction is given by the ascending index order. DNA registers are listed from the downstream to the

upstream direction. The purple letters indicate the existing nucleotide to be mutated as cytosine to allow

GTP substrate pre-binding in MD simulations.

100

We begin by reconstructing DNA belonging to the downstream bubble. Instead of adding only the

missing nucleotide, the whole double helix from i + 21 to i + 5 is to be inserted. By doing so, one can

perform only one superposition. A perfect double B-DNA helix, which sequence correspond to table 4,

is constructed by the nab tool of Amber package. The perfect helix consists of segment chain M 5’-3’

resid 1 to 16, and chain O 3’-5’ resid 17 to 2 (starting at 2 instead of 1 in order to include the P atom of

the first residue, the strand direction being 3’-5’). Then using a modification of the backbone algorithm,

three atoms of the DNA template are aligned with three landmarks belonging to the original structure.

The three reference atoms of PDB#2E2H are resid 9, chain T, atom P; resid 13, chain T, atom O3’; resid

2, chain N, atom C3’. The logic for choosing these landmarks is that they belong to the backbone, they

are distant to each other allowing to reduce noise, but not too far from the center of the protein in order

to reduce uncertainty due to crystal packing distortion. Two of of the landmarks are close to the binding

register (i + 5). And one of the landmarks 13:O3’ is the binding atom. Landmark 13:O3’ of the reference

and 16:O3’ are to be superposed so as to bind the refitted and extended downstream helix directly to i +

5. Several landmarks have been tested, and the combination that has given the best superposition score

is the one that has been retained.

i + 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 2e2h T 9:P 13:03’

N 2:P nab M 12:P 16:03’

O 2:P Table 5: Alignment of an entire template helix to three reference anchoring points. The template helix atoms

to be superposed are indicated in the « nab » field, while the reference anchoring atoms are listed in the

« 2e2h » row. The atoms indicated in blue, red and purple, are to be respectively superposed.

101

Figure 21: Comparison fit between initial downstream tDNA structure and superposed extended helix.

Template and initial structure tDNA strands are represented in green and light blue respectively. Reference

nucleotide attaching the extended template is shown in yellow.

Figure 22: Comparison fit between initial downstream ntDNA structure and superposed extended helix.

Template and initial structure ntDNA strands are represented in green and cyan respectively. Reference

nucleotide attaching the extended template is shown in yellow.

102

Figure 23: Visualization of downstream DNA reconstruction. tDNA, ntDNA and protein walls are shown as

grey, light blue and cyan surfaces respectively.

tDNA and ntDNA i + 5 registers are kept from the original structures, rendering the missing nucleotides

to be added to: ntDNA registers i + 4 to i -17 and tDNA i - 10 to i - 17. The extension is split into two

procedures: extension of ntDNA from i + 4 to i – 9, and extension of t and ntDNA strands from i – 10

to i – 17. The reason for this splitting is that the chains reanneal at i – 10. The next step of the

reconstruction procedure is to add missing ntDNA nucleotides from i + 4 to i – 9. The goal is to fit the

extra segment in the protein, such that it connects to i + 5 on one extremity and appoaches i – 9 on the

other for reannealling. i + 5 ntDNA strand position lacks P, O1P and O2P atom because it is on the 5’

end extremity. Hence in order to extend the strand in the 5’ direction, a dummy nucleotide is superposed

with i + 5 and its P, O1P and O2P atom coordinates inserted in the reference structure are copied back

to the structure. Then, the task of adding nucleotides from i + 4 to i – 9 is pursued. The latter task

represents a singularity compared to standard geometrical fitting, because the conformation of the

ntDNA is not orthodox and diverges far from a B-DNA helix. Indeed, the path of the strand is not helical,

because it is distorted inside the protein, before it undergoes reassociation outside of the enzyme with

the t strand. In order to solve this problem, let us focus on the basic requirement. Ignoring clashes with

the protein, the first requirement is that a DNA strand of a 14 nucleotide length has to be inserted such

that it starts at the extremity of i + 5, and ends up approximatively in front and one base before tDNA

register i - 9. To perform this requirement the following trick is applied. The tDNA segment running

from i + 4 to i – 9 and which is provided in the initial structure almost perfectly covers the distance

between the two landmarks pre-mentioned on a length of 14 bases. Because t and ntDNA i + 4 blocks

are in front of each other, and i – 9 registers as well, for the DNA to reanneal at i – 10, the preceding

nucleotides must be close to each other. Thus the trick is to use the tDNA structure from i + 4 to i – 9 as

the starting template guess for the extension of the nt strand and to insert it between the landmarks so as

to cover a nucleotide distance ranging from the extremity of ntDNA i + 5 to the front of tDNA i – 9.

Futhermore, it is to be noted that tDNA follows roughly an elbow shape path, hence for ntDNA to go

103

from roughfly the same starting and ending points, the elbow shaped structure of tDNA is to be inverted.

In addition, this allows ntNDA to be in the right 3’-5’ direction. The figure below describes this concept.

Figure 24: Initial fitting of upstream ntDNA. Initial tDNA region to match template upstream ntDNA is

indicated in the left figure by a dashed circle. The right figure displays the initial insertion of upstream

ntDNA (dashed circle) to fit corresponding tDNA from the start and end association areas. Refitted tDNA

and ntDNA derived previously are indicated in light blue and cyan, while template upstream ntDNA is

represented in yellow.

Performing the fitting of the nt strand using the inverted path of the t-strand renders the result displayed

in Figure 25, where the path of the strand has very few vdw clashes with the protein.

104

Figure 25: Visualization of the initial fitting of ntDNA template relative to the enzymatic structure. Existing

tDNA and ntDNA strands are in light blue and cyan respectively, fitted ntDNA is in yellow and protein walls

are in grey. A: Side view. B: Front view.

A

B

105

After the initial insertion, a few adjustements are made to improve the path of the strand, minimizing

vdw contacts, and orientating the strand between the starting and the ending landmarks, especially for

the segment which binds at i + 4. These are done manually under VMD, by closely superving the

structure. The optimized geometry is depicted in Figures 26 and 27 below.

Figure 26: Second fitting of upstream ntDNA. Initial The path of the template is modified to connect to

downstream ntDNA around register i + 4. Refitted tDNA and ntDNA from previously are indicated in light

blue and cyan, while template upstream ntDNA is represented in yellow.

106

Figure 27: Visualization of the second fitting of ntDNA template relative to the enzymatic structure. Existing

tDNA and ntDNA strands are in light blue and cyan respectively, fitted ntDNA is in yellow and protein walls

are in grey. A: Side view. B: Front view.

The adjusted geometry appears in reasonable agreement with Andreacka et al.’s fluorescent probing of

ntDNA [Andreacka, et al., 2009].

A

B

107

Then, ntDNA bases are mutated into the right types, using the same procedures used for the base group

alignment and the result is displayed in Figure 28.

Figure 28: Mutation of ntDNA template nucleotides to match Table 4 sequence. The nucleic acid strand

used for insertion geometry alignment is modified to the wanted sequence. The mutated base groups (blue)

are aligned to the groups to be replaced, belonging to the reference strand (light blue).

108

Then, t and ntDNA i – 10 to i – 17 and RNA i – 9 to i – 17 portions are inserted using the same procedures

used for the sugar and the base insertions explained previously and are manually adjusted under VMD

to minimize vdw contacts and optimize their path, such as: exiting upstream DNA helix is

approximatively helical, RNA is extruded through the RNA exit channel.

Figure 29: Fitting of missing RNA nucleotides. The initial RNA strand (lime) is prolonged by aligned

template nucleotides forming the yellow strand. A: Enzyme-free view. B: Visualization of the extension

relative to the protein (grey).

A

B

109

Finally, the potential energy is minimized, by running ten rounds of minimization 1 (10 kcal.mol-1

restraint on protein residues), and minimization 2 (whole system is minimized), in order to refine the

nucleic acid frame geometry, and notably to create the correct bonding distances.

Figure 30: vdw representation of the full nucleic complex before potential energy minimization.

Figure 31: vdw representation of the full nucleic complex after potential energy minimization.

110

6. Closing remarks

Several mathematical methods have been investigated in the biosciences field to characterize helix

geometry occuring in nanostructures [McLahan, 1979; Aqvist, 1986; Kahn, 1988; Christopher, et al.,

1996; Lu, et al., 2003; Dalton, et al., 2003; Lee, et al., 2007; Enkbayar, et al., 2008; Kumar, et al., 2012;

Bansal, et al., 2012]. In this section, helix geometry was recreated using the fitting of an optimal template

to three landmarks atoms, using 3D rotations. It is to be noted that the advantage of the method presented

here is that its minimum requirement as starting data is three atoms that are not necessarily consecutives,

but for which the registers are known, when other methods require at least four consecutive atoms.

Nevertheless, it is worth inspecting alternative procedures outlined in the above references in order to

identify what could be optimized.

The observation of the outcome of the EC recontruction shows that while refining the potential energy

(minimization) works, the nucleic acids from i - 10 to i - 17 seem to only have a satisfactory

conformation because the strands do not form a well defined double helix. Indeed, potential energy

refinement is a very efficient method, but rely on algorithms that can get stuck to local minima which

are too high. For example, minimizing the same structure presented in this section, but with many

surrounding metabolites and with the 12-6-4 potential did not allow a satisfactory minimization of the

nucleic acids because the initial system was too far from relaxation. Hence not only to further refine the

structure before minimization, but also to port the method presented here to a more complex system,

mathematical refinement methods referenced above could be investigated.

For example, for segments adopting a conventional shape, investigating mathematical helical parameter

extraction tools could be of interest. In particular the non-linear optimization procedure presented in

[Enkbayar, et al., 2008] seems to be the best tool so far to derive notably an helix axis, and could, using

information present in the initial crystal structure atomic coordinates, be used to refine the positioning

of missing nucleotides. The latter method works in three steps.

First the function 𝑓1 is minimized, i.e., seven variables (two vectors, one radius) are calculated for which

the “energy” of the function is minimal. The second step is to calculate the helix pitch. Then the eleven

parameters (two 3D vectors, two 3D points, one radius) of the function 𝑓2 are minimized (at the same

time), using as starting guess 𝑃, 𝑟, 𝒂 and 𝒐 of step 1 and 2.

Where:

𝑓1(𝑟, 𝒂, 𝒐) = ∑ (|𝒙𝒊𝑁𝑖 − 𝒐 − (𝒙𝒊. 𝒂)𝒂| − 𝒓)𝟐,

𝑓2(𝑃, 𝑟, 𝒂, 𝒐, 𝑡0) = ∑|𝒙𝒊

𝑁

𝑖

− (𝒐 + 𝒂𝑃𝑡 − 𝑟(𝒗𝑐𝑜𝑠(𝑡) + 𝒘𝑠𝑖𝑛(𝑡)))|2

111

And:

𝑟 is the helix radius,

𝒂 is the helix axis direction vector,

𝒐 is the perpendicular vector from the coordinate origin (0,0,0) to the starting of the helix axis,

𝒙𝒊 is the ith data point vector (vector from the origin to the ith point belonging to the helix),

𝑃 is the helix pitch,

𝒗 is a unit vector perpendicular to 𝒂,

𝒘 is a unit vector perpendicular to 𝒂 and 𝒗,

𝑡 is an independent variable representing the rotation angle around 𝒂,

𝑡0 is the first data point (the first point lying at the beginning of the helix verifies 𝒐 + 𝒂𝑃𝑡 −

𝑟(𝒗𝑐𝑜𝑠(𝑡) + 𝒘𝑠𝑖𝑛(𝑡)) = 𝑡0).

112

Chapter 4

Advanced Characterization of the Diffusional Pathways

113

1. Introduction

For advanced characterization of the diffusion process, meaningful parameters to be extracted can be

divided in two categories: conformational contribution and long range interaction contribution. In this

section, a novel algorithm allowing to extract the diffusive cross section areas along pathways and other

useful parameters will be presented. Then we will focus our investigation on how to characterize non-

bonded phenomena.

2. Geometric pathway analysis

2.1. Introduction

In order to characterize how the geometry of the pathways impact nucleotide diffusion, parameters of

particular interest include: pathway axis (allows to define a protein-free central trajectory) and cross

section area. There exist tools such as CAVER 3.0, ([Chovancova, et al., 2012; Kozlikova, et al., 2014;

Pavelka, et al., 2016]), PoreWalker ([Pellegrini-Calace, et al., 2009] or MolAxis ([Yaffe, et al., 2008]),

that propose automated analysis of pathways in protein. However, these tools are based on algorithms

that function either poorly or do not allow to extract a physically sound cross section area. It is therefore

necessary to investigate how to express mathematically parameters of the channels in a rigorous manner,

to be able for example to state that CH3 is wider than CH2 and hence offers greater accessibility. The

task of mathematically expressing the parameters of protein pathways in space is not straightforward.

The shape of the pathways in proteins generally does not exhibit orthodox geometry (i.e. canonical

shape) but can be very irregular. Furthermore, defining a surface or a volume with atoms poses an

additional issue, as the true dimensions of an object composed of atoms is not derived directly from the

coordinates of the atomic centers, but the true shape is given by the electromagnetic contour, that can

be represented as the van der Walls radius. Let us consider this issue more closely. Let us assume that a

nondescript pathway lies in space, and let an axis a traverse the pathway. Let us assume that at a given

point along the axis, the cross section area is to be calculated. The task has the following difficulties.

First, the diffusive cross section is only defined by the inner surface contour, thus only the atoms for

which the vdw radius are the closest to the inside of the pathway are to be taken into consideration

(Figure 33). Second, investigating the lateral component of the pathway in two dimensions with a cross

section plane is not satisfactory: because of the vdw radius, atoms that lie just in front or behind the

plane will also affect the inner cross section area of the plane (Figure 34). In other words, because of the

vdw radius, atomic points can be represented as spherical and hence there is a third depth dimension at

play that affects the lateral contour (Figure 32).

114

Figure 32: Schematic diagram of the main dimensions of a pathway.

Combining the two latter issues means that for any lateral direction, only the atomic sphere that is

laterally the closest to the inside of the pathway, and belonging to a certain vdw longitudinal atomic

threshold, will contribute to the inner dimensions of the channel. An important fact to underline is that

extracting only the interlining atomic contributions allows calculation of the right axial geometric center,

while including in the atomic selection extra atoms, can severely bias the calculation (Figure 33).

Figure 33: Schematic diagram of a pathway cross section layer. Spheres in cyan represent the cross section

selection of atoms of a channel. Left: geometric center of the pathway (blue) is erroneous if not excluding

the outer-lining atoms. Right: geometric center is correct when excluding outer-lining atoms (red).

115

There is also another problem to solve: defining the right axial direction. Defining the diffusive axis

with a single straight line across a pore is erroneous because the lateral width of an irregular cavity is to

be defined as the biggest lateral void dimensions along the pathway. Let us take the following example.

If the axis of a given pathway is defined as a straight line ranging from the start to the finish of the

structure, then the cross section area is defined as the plane perpendicular to the axis, will not

characterize real accessibility. It is more accurate to define a readjusted axis along the pathway so as to

be orthogonal to the lateral contour offering the biggest accessibility (Figure 34).

Figure 34: Pathway axis of an irregular channel. Left: if the diffusive cross section is defined as the plane

(red rectangle) orthogonal to a fixed axis (solid arrow) from the start to the end of the pathway, then the

cross-sectional area will be erroneous. Right: Correct non-fixed pathway axis defining diffusive cross section

areas.

116

2.2. Principle of the algorithm

The main issues expressed above allow to refine the task to be carried out. Hereafter, an algorithm

allowing to solve the task will be explained in its main principles. A way to tackle the issue is to

imagine that one is looking axially towards a pore (Figure 35).

Figure 35: Schematic diagram of the visualization through a pathway. Figures on the left indicate the

visualization direction towards a pathway represented as a tube lying in space. Figures on the right indicate

variation of the void space projected in front. A: out of axis direction. B: axial direction.

The visualization angle displaying the biggest opening will give the correct accessibility direction. To

dig further on this concept, let the eye of above be replaced by a plane onto which the pathway points

immediately in front are projected. A way to define the best accessibility direction is the plane direction

for which the projected points have the biggest minimal atomic distance to the inner contour center

among other directions (Figure 36). This is in fact a simplification, as the best accessibility direction

will be accessed with a radius in 3D and not only from the projections above. A more precise definition

is that the best accessibility direction is the projection for which the contour geometric center has the

biggest minimal distance to any other atoms of the pathway.

A

B

117

Figure 36: Projection of pathway points onto a tested direction. Figures on the left represent a tested axial

direction of a pathway with a plane. Figures on the right correspond to the projection of the atoms belonging

to a channel (grey) and lying immediately in front of the plane, onto the tested plane. Optimal axial direction

(B) gives a minimal distance to the interlining cluster of atoms center greater than the wrong tested direction

(A).

A

B

118

The algorithm starts from an initial direction along a pathway start guess point and a pathway end guess

point. This is the only user input required, i.e. 6 values (x, y and z coordinates of the two guesses). It

would theoretically be possible to have zero user input by automatically generating the guess points. For

example, by detecting borders between protein and protein-free regions using mathematical

convolutions. An even simpler way, would be to map the entire protein with a series of adjacent spheres.

Then the spheres that do not contain a threshold value of atoms are selected as void cavities. Then void

cavities that are adjacent to each other are selected to define a linked void area, or in other words a

pathway. This extra complexity is however unnecessary for our investigation.

The axis is scanned by tilting around the initial direction. Each scanned direction is represented by a

vector and the initial point. For every direction, the points that belong to a 3 Å window in front are

projected onto the plane defined by the direction vector along the initial point (Figure 37).

119

Figure 37: Axis scan. Starting from an initial direction (black arrow), an axis scan is performed by rotating

a test vector (red dashed arrow) about the initial direction. A: Generation of the scan directions. B: For

each scan direction (red dashed arrow), atoms belonging to a cylindric region in the direction of the scan

are extracted.

A

B

120

Next, the contour of the projection is extracted by selecting only the inner atoms (Figure 39). This is

done by rotating around the scan direction vector and analyzing the contour by single dials (Figure 38).

Figure 38: Contour scan. For any given scan direction (black arrow), the pathway contour is scanned

around the axis by dial increments. The first figure shows the starting dial (blue) of the contour, calculated

from the closest atom to the axis (black point) displaying a certain vdw radius (red circle). The second figure

displays the atom extraction performed for the dial, and the purple, green and orange atomic points are

selected. The third figure indicates the selection of the dial atom that is closest to the axis (interlining atom).

The dial is then incremented to scan a new angular region (purple dial).

121

Figure 39: Interlining atoms extraction. Left: atomic selection (grey points) before performing the contour

scan. Right: interlining atomic selection computed by the dial calculations.

Then, the contours (of the scanned axis) are assessed against each other. The contour that has the biggest

minimal distance to its geometric center is selected: the new good axial direction forward has been

found. The second part of the algorithm uses a similar approach but scans the pathway by tilting a virtual

sphere along the previous detected pathway axis point and selects the virtual sphere whose center has

the biggest minimal radius compared to the other virtual spheres scanned. For the start of the pathway,

the first point is the winning contour projection geometric center along the winning scanned direction.

The second part of the algorithm also used a “fixed axis” principle. A fixed axis is defined with the start

and end guess points (see previous paragraphs) and allows the algorithm to run across the channel from

roughly the start to the end guesses, without exploring sub-pathways in the main channel, by going

backwards for example. This is done by setting up a two-step virtual sphere scan of 45 degrees maximum

around the fixed axis, such as the scan does not go backwards (i.e., more than 90 degrees). A second

trick is employed to prevent the pathway exploration to escape the channel and consists in defining an

outer tube around the fixed axis. This allows us to compute the best curvature of the inner pathway axis,

without escaping from the outer tube, and is done by defining virtual atoms in the outer tube. Finally,

the algorithm increments the scan forward to advance along the pathway by starting a new virtual sphere

scan forward.

122

2.3. Detailed description of the algorithm

2.3.1. Refine starting point

a) Scan axis

The first step: the scan of the axis, is done by assigning into three arrays the respective x, y and z

coordinates of a point projected from the starting point 𝐴, along the tested direction. To do so, the initial

direction vector (hereafter named 𝒏) is rotated laterally and vertically, in 5 degrees increments, and

covering a spherical scan of -35 to +35 degrees. First, the initial point 𝐴 is projected 1 Å along 𝒏 and a

point called 𝑁𝑖𝑛𝑖 (N initial) is set. Each tested scanned direction 𝒏𝒔𝒄𝒂𝒏 is represented as the new

position of point 𝑁𝑖𝑛𝑖 in space and is specified by the point 𝑁𝑝𝑟𝑜𝑗 (N projected): lateral shift of point

𝑁𝑖𝑛𝑖 in space, and 𝑁𝑝𝑟𝑜𝑗𝑃 (N projected prime): vertical shift of 𝑁𝑝𝑟𝑜𝑗 in space, hence representing the

combination of lateral and vertical shift in space. To understand how this represents a new direction (for

example lateral shift of -30 deg., vertical shift of +5 deg.), an illustration that can be made is that vector

(𝐴, 𝑁𝑝𝑟𝑜𝑗𝑃) is the vector n starting from 𝐴 but pointing in a new direction. The latter direction is given

by the projection of point 𝐴 along the vector and is point 𝑁𝑝𝑟𝑜𝑗𝑃. To rotate n laterally and vertically,

i.e. to rotate point 𝑁𝑖𝑛𝑖, two vectors are defined. 𝝎 vector is set and is a vector orthogonal to 𝒏. 𝝍 is a

vector orthogonal to 𝒏 and 𝝎.

Let the initial vector 𝒏 be specified by initial point 𝐴(𝐴𝑥, 𝐴𝑦, 𝐴𝑧), pointing towards point

𝐵(𝐵𝑥, 𝐵𝑦, 𝐵𝑧).

Let 𝑁𝑥,𝑁𝑦,𝑁𝑧 be the parameters of unit vector 𝒏:

𝑁𝑥 = 𝐵𝑥 − 𝐴𝑥, 𝑁𝑦 = 𝐵𝑦 − 𝐴𝑦, 𝑁𝑧 = 𝐵𝑧 − 𝐴𝑧

Vector magnitude is: 𝑁𝑛𝑜𝑟𝑚 = (𝑁𝑥2 + 𝑁𝑦2 + 𝑁𝑧2)0.5

𝑁𝑥 = 𝑁𝑥/𝑁𝑛𝑜𝑟𝑚, 𝑁𝑦 = 𝑁𝑦/𝑁𝑛𝑜𝑟𝑚, 𝑁𝑧 = 𝑁𝑧/𝑁𝑛𝑜𝑟𝑚

Note that the following terminology is used. When the same variable occurs on the left and the right of

an equation, the left variable corresponds to the new value of the right variable and overwrites it.

A vector 𝝎 orthogonal to 𝒏 verifies

𝑑𝑜𝑡(𝒏,𝝎) = 0

Hence 𝝎(𝑊𝑥,𝑊𝑦,𝑊𝑧) = (0, −𝑁𝑧,𝑁𝑦) is orthogonal to 𝒏 and is set.

Unit vector parameters are given by:

𝑊𝑛𝑜𝑟𝑚 = (𝑊𝑥2 + 𝑊𝑦2 + 𝑊𝑧2)0.5

123

𝑊𝑥 = 𝑊𝑥/𝑊𝑛𝑜𝑟𝑚, 𝑊𝑦 = 𝑊𝑦/𝑊𝑛𝑜𝑟𝑚, 𝑊𝑧 = 𝑊𝑧/𝑊𝑛𝑜𝑟𝑚

A vector 𝝍(𝑌𝑥, 𝑌𝑦, 𝑌𝑧) that is both orthogonal to 𝒏 and 𝒘, verifies 𝑐𝑟𝑜𝑠𝑠(𝒏,𝝎) = 𝝍

𝑌𝑥 = 𝑁𝑦 ∗ 𝑊𝑧 − 𝑁𝑧 ∗ 𝑊𝑦

𝑌𝑦 = 𝑁𝑧 ∗ 𝑊𝑥 − 𝑁𝑥 ∗ 𝑊𝑧

𝑌𝑧 = 𝑁𝑥 ∗ 𝑊𝑦 − 𝑁𝑦 ∗ 𝑊𝑥

Unit vector components are calculated as:

𝑌𝑛𝑜𝑟𝑚 = (𝑌𝑥2 + 𝑌𝑦2 + 𝑌𝑧2)0.5

𝑌𝑥 = 𝑌𝑥/𝑌𝑛𝑜𝑟𝑚, 𝑌𝑦 = 𝑌𝑦/𝑌𝑛𝑜𝑟𝑚, 𝑌𝑧 = 𝑌𝑧/𝑌𝑛𝑜𝑟𝑚

The unshifted position of vector 𝒏 is represented by the 1 Å projection of point 𝐴 along 𝒏, and is point

𝑁𝑖𝑛𝑖(𝑁𝑖𝑛𝑖_𝑥, 𝑁𝑖𝑛𝑖_𝑦, 𝑁𝑖𝑛𝑖_𝑧):

𝑁𝑖𝑛𝑖_𝑥 = 𝐴𝑥 + 𝑁𝑥, 𝑁𝑖𝑛𝑖_𝑦 = 𝐴𝑦 + 𝑁𝑦, 𝑁𝑖𝑛𝑖_𝑧 = 𝐴𝑧 + 𝑁𝑧

The vertical scan is done by rotating 14 times (in order to cover the -35 to 35 degrees range in 5 degrees

increments) a point 𝑁𝑝𝑟𝑜𝑗(𝑁𝑝𝑟𝑜𝑗_𝑥, 𝑁𝑝𝑟𝑜𝑗_𝑦, 𝑁𝑝𝑟𝑜𝑗_𝑧) around 𝝎 and corresponds to the point

𝑁𝑝𝑟𝑜𝑗P. 𝑁𝑝𝑟𝑜𝑗 corresponds to the current lateral scan position, and initially

𝑁𝑝𝑟𝑜𝑗(𝑁𝑝𝑟𝑜𝑗_𝑥, 𝑁𝑝𝑟𝑜𝑗_𝑦, 𝑁𝑝𝑟𝑜𝑗_𝑧) = 𝑁𝑖𝑛𝑖(𝑁𝑖𝑛𝑖_𝑥, 𝑁𝑖𝑛𝑖_𝑦, 𝑁𝑖𝑛𝑖_𝑧).

The rotation is calculated with the rotation matrix of point 𝑁𝑝𝑟𝑜𝑗(𝑁𝑝𝑟𝑜𝑗_𝑥, 𝑁𝑝𝑟𝑜𝑗_𝑦, 𝑁𝑝𝑟𝑜𝑗_𝑧) about

the origin, around 𝝎(𝑊𝑥,𝑊𝑦,𝑊𝑧) going through point 𝐴(𝐴𝑥, 𝐴𝑦, 𝐴𝑧), of angle 𝑡𝑒𝑡𝑎1.

Let:

𝑠 = sin(𝑡𝑒𝑡𝑎1) , 𝑐 = cos(𝑡𝑒𝑡𝑎1) , 𝑡 = 1 − 𝑐

Let us apply transformation matrix to point 𝑁𝑝𝑟𝑜𝑗:

𝑚𝑎𝑡1𝑥 = (𝐴𝑥 ∗ (𝑌𝑦2 + 𝑌𝑧2) − 𝑌𝑥 ∗ (𝐴𝑦 ∗ 𝑌𝑦 + 𝐴𝑧 ∗ 𝑌𝑧 − 𝑌𝑥 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑥 − 𝑌𝑦 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑦 − 𝑌𝑧 ∗

𝑁𝑝𝑟𝑜𝑗_𝑧)) ∗ 𝑡

𝑚𝑎𝑡2𝑥 = 𝑁𝑝𝑟𝑜𝑗_𝑥 ∗ 𝑐 + (−𝐴𝑧 ∗ 𝑌𝑦 + 𝐴𝑦 ∗ 𝑌𝑧 − 𝑌𝑧 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑦 + 𝑌𝑦 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑧) ∗ 𝑠

𝑁𝑝𝑟𝑜𝑗𝑃_𝑥 = 𝑚𝑎𝑡1𝑥 + 𝑚𝑎𝑡2𝑥

𝑚𝑎𝑡1𝑦 = (𝐴𝑦 ∗ (𝑌𝑥2 + 𝑌𝑧2) − 𝑌𝑦 ∗ (𝐴𝑥 ∗ 𝑌𝑥 + 𝐴𝑧 ∗ 𝑌𝑧 − 𝑌𝑥 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑥 − 𝑌𝑦 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑦 − 𝑌𝑧 ∗


124

𝑚𝑎𝑡2𝑦 = 𝑁𝑝𝑟𝑜𝑗_𝑦 ∗ 𝑐 + (𝐴𝑧 ∗ 𝑌𝑥 − 𝐴𝑥 ∗ 𝑌𝑧 + 𝑌𝑧 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑥 − 𝑌𝑥 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑧) ∗ 𝑠

𝑁𝑝𝑟𝑜𝑗𝑃_𝑦 = 𝑚𝑎𝑡1𝑦 + 𝑚𝑎𝑡2𝑦

𝑚𝑎𝑡1𝑧 = (𝐴𝑧 ∗ (𝑌𝑥2 + 𝑌𝑦2) − 𝑌𝑧 ∗ (𝐴𝑥 ∗ 𝑌𝑥 + 𝐴𝑦 ∗ 𝑌𝑦 − 𝑌𝑥 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑥 − 𝑌𝑦 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑦 − 𝑌𝑧 ∗


𝑚𝑎𝑡2𝑧 = 𝑁𝑝𝑟𝑜𝑗_𝑧 ∗ 𝑐 + (−𝐴𝑦 ∗ 𝑌𝑥 + 𝐴𝑥 ∗ 𝑌𝑦 − 𝑌𝑦 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑥 + 𝑌𝑥 ∗ 𝑁𝑝𝑟𝑜𝑗_𝑦) ∗ 𝑠

𝑁𝑝𝑟𝑜𝑗𝑃_𝑧 = 𝑚𝑎𝑡1𝑧 + 𝑚𝑎𝑡2𝑧

After each vertical -35 to +35 deg. scan, i.e. after 14 vertical scans, the scan is rotated laterally in order

to refresh the new lateral position 𝑁𝑝𝑟𝑜𝑗 from which the vertical rotation is to be performed: the scan is

restarted in order to cover a new vertical region. Lateral shift is performed 14 times. Hence, taking into

account that the first vertical scan does not require a lateral rotation, the total number of tilts is 14 * 14=

196, and allows to cover forward spherical scan of -35 to +35 degrees.

Lateral rotation coordinates 𝑁𝑝𝑟𝑜𝑗(𝑁𝑝𝑟𝑜𝑗_𝑥, 𝑁𝑝𝑟𝑜𝑗_𝑦, 𝑁𝑝𝑟𝑜𝑗_𝑧) of point 𝑁𝑖𝑛𝑖 around 𝝍 are given by:

𝑠 = sin(𝑡𝑒𝑡𝑎2) , 𝑐 = cos(𝑡𝑒𝑡𝑎2) , 𝑡 = 1 − 𝑐, where 𝑡𝑒𝑡𝑎2 is the lateral angle increment.

𝑚𝑎𝑡1𝑥 = (𝐴𝑥 ∗ (𝑊𝑦2 + 𝑊𝑧2) − 𝑊𝑥 ∗ (𝐴𝑦 ∗ 𝑊𝑦 + 𝐴𝑧 ∗ 𝑊𝑧 − 𝑊𝑥 ∗ 𝑁𝑖𝑛𝑖_𝑥 − 𝑊𝑦 ∗ 𝑁𝑖𝑛𝑖_𝑦 −

𝑊𝑧 ∗ 𝑁𝑖𝑛𝑖_𝑧)) ∗ 𝑡

𝑚𝑎𝑡2𝑥 = 𝑁𝑖𝑛𝑖_𝑥 ∗ 𝑐 + (−𝐴𝑧 ∗ 𝑊𝑦 + 𝐴𝑦 ∗ 𝑊𝑧 − 𝑊𝑧 ∗ 𝑁𝑖𝑛𝑖_𝑦 + 𝑊𝑦 ∗ 𝑁𝑖𝑛𝑖_𝑧) ∗ 𝑠

𝑁𝑝𝑟𝑜𝑗_𝑥 = 𝑚𝑎𝑡1𝑥 + 𝑚𝑎𝑡2𝑥

𝑚𝑎𝑡1𝑦 = (𝐴𝑦 ∗ (𝑊𝑥2 + 𝑊𝑧2) − 𝑊𝑦 ∗ (𝐴𝑥 ∗ 𝑊𝑥 + 𝐴𝑧 ∗ 𝑊𝑧 − 𝑊𝑥 ∗ 𝑁𝑖𝑛𝑖_𝑥 − 𝑊𝑦 ∗ 𝑁𝑖𝑛𝑖_𝑦 −


𝑚𝑎𝑡2𝑦 = 𝑁𝑖𝑛𝑖_𝑦 ∗ 𝑐 + (𝐴𝑧 ∗ 𝑊𝑥 − 𝐴𝑥 ∗ 𝑊𝑧 + 𝑊𝑧 ∗ 𝑁𝑖𝑛𝑖_𝑥 − 𝑊𝑥 ∗ 𝑁𝑖𝑛𝑖_𝑧) ∗ 𝑠

𝑁𝑝𝑟𝑜𝑗_𝑦 = 𝑚𝑎𝑡1𝑦 + 𝑚𝑎𝑡2𝑦

𝑚𝑎𝑡1𝑧 = (𝐴𝑧 ∗ (𝑊𝑥2 + 𝑊𝑦2) − 𝑊𝑧 ∗ (𝐴𝑥 ∗ 𝑊𝑥 + 𝐴𝑦 ∗ 𝑊𝑦 − 𝑊𝑥 ∗ 𝑁𝑖𝑛𝑖_𝑥 − 𝑊𝑦 ∗ 𝑁𝑖𝑛𝑖_𝑦 −


𝑚𝑎𝑡2𝑧 = 𝑁𝑖𝑛𝑖_𝑧 ∗ 𝑐 + (−𝐴𝑦 ∗ 𝑊𝑥 + 𝐴𝑥 ∗ 𝑊𝑦 − 𝑊𝑦 ∗ 𝑁𝑖𝑛𝑖_𝑥 + 𝑊𝑥 ∗ 𝑁𝑖𝑛𝑖_𝑦) ∗ 𝑠

𝑁𝑝𝑟𝑜𝑗_𝑧 = 𝑚𝑎𝑡1𝑧 + 𝑚𝑎𝑡2𝑧

125

Each scanned direction is assigned into three arrays 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑥, 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑦, 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑧,

containing the respective coordinate components of the 1 Å projection of point 𝐴 along a scan direction.

To simplify, let the ensemble of points 𝑁𝑝𝑟𝑜𝑗 and 𝑁𝑝𝑟𝑜𝑗𝑃 (depicting the lateral and vertical rotations)

be 𝑁𝑟𝑜𝑡 (N rotated). A rotation direction rank i is recorded, such that 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑥[𝑖], 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑦[𝑖],

𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑧[𝑖], contain the respective x, y, z coordinates of 1 Å projection of point 𝐴 along the 𝑖th

scan direction, and 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑥[𝑖] = 𝑁𝑟𝑜𝑡[𝑖]_𝑥, 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑦[𝑖] = 𝑁𝑟𝑜𝑡[𝑖]_𝑦, 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑧[𝑖] =

𝑁𝑟𝑜𝑡[𝑖]_𝑧.

b) Project points

For each scan direction, the next step is to project the coordinates of the pathway atoms centers lying in

front of the scanned direction onto the plane 𝐷𝐼𝑅 defined by point 𝐴 and scan direction. The protein

points that belong to a cylinder of radius 20 Å and length 3 Å in front of plane 1, are projected onto

plane 1. Cylinder atoms are the points that belong between plane 1 and plane 2 that is 3 Å ahead of plane

1, and which are at a distance inferior than 20 Å from the axis going from 𝐴 and 𝐴 projected 3 Å along

the scanned direction.

Plane 𝐷𝐼𝑅 is defined by 𝐴 and 𝒏𝒔𝒄𝒂𝒏, where 𝑁𝑥,𝑁𝑦,𝑁𝑧 are the new parameters of vector 𝒏𝒔𝒄𝒂𝒏.

𝑁𝑥 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑥[𝑖] − 𝐴𝑥, 𝑁𝑦 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑦[𝑖] − 𝐴𝑦, 𝑁𝑧 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑧[𝑖] − 𝐴𝑧

𝒏𝒔𝒄𝒂𝒏 need not to be normalized, because 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠 ith point is already at 1 Å from 𝐴.

Plane 𝐷𝐼𝑅2 is defined by the 3 Å projection of plane DIR along 𝒏𝒔𝒄𝒂𝒏, where:

𝐴𝑃 is the 3 Å projection of 𝐴 along 𝒏𝒔𝒄𝒂𝒏

𝒏𝒑(𝑁𝑃𝑥,𝑁𝑃𝑦,𝑁𝑃𝑧) is a normal vector of the plane going through point 𝐴𝑃(𝐴𝑃𝑥, 𝐴𝑃𝑦, 𝐴𝑃𝑧), and

pointing towards 𝐴, verifying 𝒏𝒑 = −𝒏𝒔𝒄𝒂𝒏.

𝐴𝑃𝑥 = 𝐴𝑥 + 3 ∗ 𝑁𝑥

𝐴𝑃𝑦 = 𝐴𝑦 + 3 ∗ 𝑁𝑦

𝐴𝑃𝑧 = 𝐴𝑧 + 3 ∗ 𝑁𝑧

𝑁𝑃𝑥 = −𝑁𝑥

𝑁𝑃𝑦 = −𝑁𝑦

𝑁𝑃𝑧 = −𝑁𝑧

126

Let an atom that belongs to the protein structure be defined by the point

𝑎𝑡𝑜𝑚(𝑎𝑡𝑜𝑚_𝑥, 𝑎𝑡𝑜𝑚_𝑦, 𝑎𝑡𝑜𝑚_𝑧).

Let a vector 𝒖(𝑈𝑥, 𝑈𝑦, 𝑈𝑧) go from 𝐴 to 𝑎𝑡𝑜𝑚.

𝑈𝑥 = 𝑎𝑡𝑜𝑚_𝑥 − 𝐴𝑥, 𝑈𝑦 = 𝑎𝑡𝑜𝑚_𝑦 − 𝐴𝑦, 𝑈𝑧 = 𝑎𝑡𝑜𝑚_𝑧 − 𝐴𝑧

Let a vector 𝒗(𝑉𝑥, 𝑉𝑦, 𝑉𝑧) go from 𝐴𝑃 to 𝑎𝑡𝑜𝑚.

𝑉𝑥 = 𝑎𝑡𝑜𝑚_𝑥 − 𝐴𝑃𝑥, 𝑉𝑦 = 𝑎𝑡𝑜𝑚_𝑦 − 𝐴𝑃𝑦, 𝑉𝑧 = 𝑎𝑡𝑜𝑚_𝑧 − 𝐴𝑃𝑧

An atom that lies in between the two planes will verify 𝑑𝑜𝑡_1 = 𝑑𝑜𝑡(𝒏𝒔𝒄𝒂𝒏,𝑼) > 𝟎 and 𝑑𝑜𝑡_2 =

𝑑𝑜𝑡(𝒏𝒑, 𝑽) > 𝟎

𝑑𝑜𝑡_1 = 𝑁𝑥 ∗ 𝑈𝑥 + 𝑁𝑦 ∗ 𝑈𝑦 + 𝑁𝑧 ∗ 𝑈𝑧

𝑑𝑜𝑡_2 = 𝑁𝑃𝑥 ∗ 𝑉𝑥 + 𝑁𝑃𝑦 ∗ 𝑉𝑦 + 𝑁𝑃𝑧 ∗ 𝑉𝑧

An atom that further verifies a distance to the axis going through 𝐴 and 𝐴𝑃 inferior or equal to 20 Å,

will belong to the 3 Å long, 20 Å wide, forward cylinder, where the distance is calculated as:

𝑟𝑎𝑑𝑖𝑢𝑠 = |𝑐𝑟𝑜𝑠𝑠(𝒖, 𝒗)| / |𝑨 − 𝑨𝑷|

Let 𝒘 = 𝑨 − 𝑨𝑷:

𝑊𝑥 = 𝐴𝑥 − 𝐴𝑃𝑥, 𝑊𝑦 = 𝐴𝑦 − 𝐴𝑃𝑦, 𝑊𝑧 = 𝐴𝑧 − 𝐴𝑃𝑧

|𝑨 − 𝑨𝑷| = 𝑊𝑛𝑜𝑟𝑚 = (𝑊𝑥2 + 𝑊𝑦2 + 𝑊𝑧2)0.5

Let 𝑐𝑟𝑜𝑠𝑠(𝒖, 𝒗) = 𝑐𝑟𝑜𝑠𝑠_𝑥, 𝑐𝑟𝑜𝑠𝑠_𝑦, 𝑐𝑟𝑜𝑠𝑠_𝑧

𝑐𝑟𝑜𝑠𝑠_𝑥 = 𝑈𝑦 ∗ 𝑉𝑧 − 𝑈𝑧 ∗ 𝑉𝑦

𝑐𝑟𝑜𝑠𝑠_𝑦 = 𝑈𝑧 ∗ 𝑉𝑥 − 𝑈𝑥 ∗ 𝑉𝑧

𝑐𝑟𝑜𝑠𝑠_𝑧 = 𝑈𝑥 ∗ 𝑉𝑦 − 𝑈𝑦 ∗ 𝑉𝑥

|𝑐𝑟𝑜𝑠𝑠(𝒖, 𝒗)| = 𝑐𝑟𝑜𝑠𝑠_𝑛𝑜𝑟𝑚 = (𝑐𝑟𝑜𝑠𝑠_𝑥2 + 𝑐𝑟𝑜𝑠𝑠_𝑦2 + 𝑐𝑟𝑜𝑠𝑠_𝑧2)0.5

𝑟𝑎𝑑𝑖𝑢𝑠 = 𝑐𝑟𝑜𝑠𝑠_𝑛𝑜𝑟𝑚/𝑊𝑛𝑜𝑟𝑚

Each atom that verifies 𝑑𝑜𝑡_1 > 0, 𝑑𝑜𝑡_2 > 0, and 𝑟𝑎𝑑𝑖𝑢𝑠 ≤ 20, is projected onto plane 1:

Let 𝑡_𝑝𝑟𝑜𝑗 be the projection parameter, such as the projected atom onto plane 1 defined by

𝐴(𝐴𝑥, 𝐴𝑦, 𝐴𝑧) and 𝒏𝒔𝒄𝒂𝒏(𝑁𝑥,𝑁𝑦, 𝑁𝑧), is given by 𝑎𝑡𝑜𝑚 + 𝑡_𝑝𝑟𝑜𝑗 * 𝒏𝒔𝒄𝒂𝒏.

127

𝑡_𝑝𝑟𝑜𝑗 verifies:

𝑡_𝑝𝑟𝑜𝑗 = (𝑁𝑥 ∗ 𝐴𝑥 − 𝑁𝑥 ∗ 𝑎𝑡𝑜𝑚_𝑥 + 𝑁𝑦 ∗ 𝐴𝑦 − 𝑁𝑦 ∗ 𝑎𝑡𝑜𝑚_𝑦 + 𝑁𝑧 ∗ 𝐴𝑧 − 𝑁𝑧 ∗ 𝑎𝑡𝑜𝑚_𝑧)/(𝑁𝑥2 +

𝑁𝑦2 + 𝑁𝑧2)

Let 𝑎𝑡𝑜𝑚_𝑝𝑟𝑜𝑗(𝑎𝑡𝑜𝑚_𝑝𝑟𝑜𝑗_𝑥, 𝑎𝑡𝑜𝑚_𝑝𝑟𝑜𝑗_𝑦, 𝑎𝑡𝑜𝑚_𝑝𝑟𝑜𝑗_𝑧) be 𝑎𝑡𝑜𝑚(𝑎𝑡𝑜𝑚_𝑥, 𝑎𝑡𝑜𝑚_𝑦, 𝑎𝑡𝑜𝑚_𝑧)

projected onto plane 1:

𝑎𝑡𝑜𝑚_𝑝𝑟𝑜𝑗_𝑥 = 𝑎𝑡𝑜𝑚_𝑥 + 𝑡_𝑝𝑟𝑜𝑗 ∗ 𝑁𝑥

𝑎𝑡𝑜𝑚_𝑝𝑟𝑜𝑗_𝑦 = 𝑎𝑡𝑜𝑚_𝑦 + 𝑡_𝑝𝑟𝑜𝑗 ∗ 𝑁𝑦

𝑎𝑡𝑜𝑚_𝑝𝑟𝑜𝑗_𝑧 = 𝑎𝑡𝑜𝑚_𝑧 + 𝑡_𝑝𝑟𝑜𝑗 ∗ 𝑁𝑧

The atom belonging to the cylinder selection and projected onto the plane defined by the scanned

direction and 𝐴 is stored by respective x, y, z components into arrays 𝑝𝑟𝑜𝑗_𝑥, 𝑝𝑟𝑜𝑗_𝑦, 𝑝𝑟𝑜𝑗_𝑧.

Its vdw radius is also stored in array 𝑝𝑟𝑜𝑗_𝑣𝑑𝑤, in the following fashion. If the atom is of hydrogen,

carbon, nitrogen, oxygen, phosphorus, sulfur, magnesium or zinc type, then its vdw radius is set to 1.20,

1.70, 1.55, 1.52, 1.80, 1.80, 1.73 or 1.39 respectively. It follows that for each scanned direction, is

associated an ensemble of projected points, with their corresponding vdw radii.

c) Scan contour

Once a projection map is assigned to each scanned axis, one proceeds to the scan of the projection

contour. The goal is to assign to each projection map a unique contour, such that only the relevant

interlining atoms are selected.

First point 𝐵𝑃 (B prime) is defined as a second point (in addition to 𝐴), to delineate a direction axis.

𝐵𝑃𝑥 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑥[𝑖], 𝐵𝑃𝑦 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑦[𝑖], 𝐵𝑃𝑧 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑧[𝑖]

Then, the point closest to 𝐴 is selected in order to start the circular scan from a first point belonging to

the contour. To perform the latter selection, the distance between each atom belonging to the projection

and 𝐴 is stored into an array called 𝑑𝑖𝑠𝑡. Then, the atom index that corresponds to the first iteration of

the minimal distance is stored in the 𝑑𝑖𝑠𝑡 array.

It is worth underlining that 𝐴 represents the point from which a direction axis is pointing. Hence, for

each scanned axis, the distance between the projected points and 𝐴, is equivalent to the distance between

the non-projected points and the axis.

For each projected atom, appearing at 𝑐𝑜𝑢𝑛𝑡 iteration, their lateral distance is given by:

128

𝑑𝑖𝑠𝑡_𝑥 = 𝑝𝑟𝑜𝑗_𝑥[𝑐𝑜𝑢𝑛𝑡] − 𝐴𝑥

𝑑𝑖𝑠𝑡_𝑦 = 𝑝𝑟𝑜𝑗_𝑦[𝑐𝑜𝑢𝑛𝑡] − 𝐴𝑦

𝑑𝑖𝑠𝑡_𝑧 = 𝑝𝑟𝑜𝑗_𝑧[𝑐𝑜𝑢𝑛𝑡] − 𝐴𝑧

𝑑𝑖𝑠𝑡 = (𝑑𝑖𝑠𝑡_𝑥2 + 𝑑𝑖𝑠𝑡_𝑦2 + 𝑑𝑖𝑠𝑡_𝑧2)0.5 − 𝑝𝑟𝑜𝑗_𝑣𝑑𝑤[𝑐𝑜𝑢𝑛𝑡]

Where −𝑝𝑟𝑜𝑗_𝑣𝑑𝑤[𝑐𝑜𝑢𝑛𝑡] accounts for the deduction of the vdw radius.

Let 𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥 be the index of the first iteration of the minimal value stored in 𝑑𝑖𝑠𝑡 array and let 𝑀𝐼𝑁

be the point corresponding to the first atom belonging to the contour:

𝑀𝐼𝑁𝑥 = 𝑝𝑟𝑜𝑗_𝑥[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]

𝑀𝐼𝑁𝑦 = 𝑝𝑟𝑜𝑗_𝑦[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]

𝑀𝐼𝑁𝑧 = 𝑝𝑟𝑜𝑗_𝑧[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]

Also, because projection atom coordinates were implemented into 𝑝𝑟𝑜𝑗_𝑥, 𝑝𝑟𝑜𝑗_𝑦, and 𝑝𝑟𝑜𝑗_𝑧 arrays,

in the same order than 𝑝𝑟𝑜𝑗_𝑣𝑑𝑤 array, the corresponding vdw radius of atom 𝑀𝐼𝑁 is:

𝑣𝑑𝑤_𝑎𝑡𝑚 = 𝑝𝑟𝑜𝑗_𝑣𝑑𝑤[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]

The four values of the first contour atom (x, y, z coordinates, vdw radius) are then stored into 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥,

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑦, 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑧, and 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤 arrays respectively.

The latter arrays store information about the contour, correspond to a refined state of the projection map,

and will later be put to contribution to assess notably the right axial direction.

Then the actual 360 deg. contour scan is performed, starting from the atom selected above, and using

the projection map and the tested axis direction. This is done in six steps that are outlined below.

The calculation that will be performed starts from the previous atom belonging to the contour, and

searches for atoms belonging to a 2 Å lateral window from that atom.

In order to cover a full 360 deg. circular scan, each increment of the scan is done successively in the

same circular direction: positive angle.

129

Step 1

The angle required to cover a lateral range of 2 Å is calculated.

Let radius be the distance to 𝐴, and 𝑀𝐼𝑁 be the previous contour atom from which the next dial is to be

scanned.

𝑟𝑎𝑑𝑖𝑢𝑠 = ((𝑀𝐼𝑁𝑥 − 𝐴𝑥)2 + (𝑀𝐼𝑁𝑦 − 𝐴𝑦)2 + (𝑀𝐼𝑁𝑧 − 𝐴𝑧)2)0.5

Let 𝑡𝑒𝑡𝑎 be the angle required to cover a lateral range of 2 Å and 𝑙𝑎𝑡_𝑤𝑖𝑛𝑑𝑜𝑤 = 2 Å,

𝑐𝑜𝑠 = (2 ∗ 𝑟𝑎𝑑𝑖𝑢𝑠2 − 𝑙𝑎𝑡_𝑤𝑖𝑛𝑑𝑜𝑤2)/(2 ∗ 𝑟𝑎𝑑𝑖𝑢𝑠2)

𝑡𝑒𝑡𝑎 = 𝑎𝑡𝑎𝑛2((1 − 𝑐𝑜𝑠2), cos)

Step 2

Second, the borders of the dial are calculated, in order to extract atoms belonging to the dial (i.e.

belonging to that particular circular region).

To do so, two planes are to be defined.

One dial border is represented by the plane going through atom 𝑀𝐼𝑁 (dial start) and 𝐴, with normal

vector orthogonal to the axis.

Plane 1:

Let 𝑾(𝑊𝑥,𝑊𝑦,𝑊𝑧) be the axis vector,

𝑊𝑥 = 𝐵𝑃𝑥 − 𝐴𝑥, 𝑊𝑦 = 𝐵𝑃𝑦 − 𝐴𝑦, 𝑊𝑧 = 𝐵𝑃𝑧 − 𝐴𝑧

Let 𝑼(𝑈𝑥, 𝑈𝑦, 𝑈𝑧) be the vector between point 𝐴 and point 𝑀𝐼𝑁,

𝑈𝑥 = 𝑀𝐼𝑁𝑥 − 𝐴𝑥, 𝑈𝑦 = 𝑀𝐼𝑁𝑦 − 𝐴𝑦, 𝑈𝑧 = 𝑀𝐼𝑁𝑧 − 𝐴𝑧

Plane 1 is defined by normal vector 𝒏𝟏(𝑛1_𝑥, 𝑛1_𝑦, 𝑛1_𝑧) going through point 𝐴, where,

𝒏𝟏 = 𝑐𝑟𝑜𝑠𝑠(𝑾,𝑼)

𝑛1_𝑥 = 𝑊𝑥 ∗ 𝑈𝑧 − 𝑊𝑧 ∗ 𝑈𝑦

𝑛1_𝑦 = 𝑊𝑧 ∗ 𝑈𝑥 − 𝑊𝑥 ∗ 𝑈𝑧

𝑛1_𝑧 = 𝑊𝑥 ∗ 𝑈𝑦 − 𝑊𝑦 ∗ 𝑈𝑥

130

The upper border of the dial can be expressed as the plane going through atom 𝑀𝐼𝑁 prime (𝑎𝑡𝑜𝑚𝑃),

where 𝑎𝑡𝑜𝑚𝑃 is rotation of 𝑀𝐼𝑁 around axis 𝐴— 𝐵𝑃 (scan axis) of angle 𝑡𝑒𝑡𝑎, with normal vector

orthogonal to the axis, but pointing in the opposite angle direction to plane 1.

To find the upper border of the dial, the positive direction rotation of point 𝑀𝐼𝑁 around axis 𝑾 is

calculated. Let 𝑎𝑡𝑜𝑚𝑃(𝑎𝑡𝑜𝑚𝑃𝑥, 𝑎𝑡𝑜𝑚𝑃𝑦, 𝑎𝑡𝑜𝑚𝑃𝑧) be the rotation image of 𝑀𝐼𝑁 around 𝑾, going

through point 𝐴 with an angle of 𝑡𝑒𝑡𝑎 (angle calculated above corresponding to a lateral window of 2

Å).

The rotation image is calculated as follows:

𝑊𝑛𝑜𝑟𝑚 = (𝑊𝑥2 + 𝑊𝑦2 + 𝑊𝑧2)0.5

𝑊𝑥 = 𝑊𝑥/𝑊𝑛𝑜𝑟𝑚, 𝑊𝑦 = 𝑊𝑦/𝑊𝑛𝑜𝑟𝑚, 𝑊𝑧 = 𝑊𝑧/𝑊𝑛𝑜𝑟𝑚

𝑠 = sin(𝑡𝑒𝑡𝑎) , 𝑐 = cos(𝑡𝑒𝑡𝑎) , 𝑡 = 1 − 𝑐,

𝑚𝑎𝑡1𝑥 = (𝐴𝑥 ∗ (𝑊𝑦2 + 𝑊𝑧2) − 𝑊𝑥 ∗ (𝐴𝑦 ∗ 𝑊𝑦 + 𝐴𝑧 ∗ 𝑊𝑧 − 𝑊𝑥 ∗ 𝑀𝐼𝑁𝑥 − 𝑊𝑦 ∗ 𝑀𝐼𝑁𝑦 − 𝑊𝑧 ∗

𝑀𝐼𝑁𝑧)) ∗ 𝑡

𝑚𝑎𝑡2𝑥 = 𝑁𝑖𝑛𝑖_𝑥 ∗ 𝑐 + (−𝐴𝑧 ∗ 𝑊𝑦 + 𝐴𝑦 ∗ 𝑊𝑧 − 𝑊𝑧 ∗ 𝑁𝑖𝑛𝑖_𝑦 + 𝑊𝑦 ∗ 𝑀𝐼𝑁𝑧) ∗ 𝑠

𝑎𝑡𝑜𝑚𝑃𝑥 = 𝑚𝑎𝑡1𝑥 + 𝑚𝑎𝑡2𝑥

𝑚𝑎𝑡1𝑦 = (𝐴𝑦 ∗ (𝑊𝑥2 + 𝑊𝑧2) − 𝑊𝑦 ∗ (𝐴𝑥 ∗ 𝑊𝑥 + 𝐴𝑧 ∗ 𝑊𝑧 − 𝑊𝑥 ∗ 𝑀𝐼𝑁𝑥 − 𝑊𝑦 ∗ 𝑀𝐼𝑁𝑦 − 𝑊𝑧 ∗


𝑚𝑎𝑡2𝑦 = 𝑀𝐼𝑁𝑦 ∗ 𝑐 + (𝐴𝑧 ∗ 𝑊𝑥 − 𝐴𝑥 ∗ 𝑊𝑧 + 𝑊𝑧 ∗ 𝑀𝐼𝑁𝑥 − 𝑊𝑥 ∗ 𝑀𝐼𝑁𝑧) ∗ 𝑠

𝑎𝑡𝑜𝑚𝑃𝑦 = 𝑚𝑎𝑡1𝑦 + 𝑚𝑎𝑡2𝑦

𝑚𝑎𝑡1𝑧 = (𝐴𝑧 ∗ (𝑊𝑥2 + 𝑊𝑦2) − 𝑊𝑧 ∗ (𝐴𝑥 ∗ 𝑊𝑥 + 𝐴𝑦 ∗ 𝑊𝑦 − 𝑊𝑥 ∗ 𝑀𝐼𝑁𝑥 − 𝑊𝑦 ∗ 𝑀𝐼𝑁𝑦 − 𝑊𝑧 ∗


𝑚𝑎𝑡2𝑧 = 𝑀𝐼𝑁𝑧 ∗ 𝑐 + (−𝐴𝑦 ∗ 𝑊𝑥 + 𝐴𝑥 ∗ 𝑊𝑦 − 𝑊𝑦 ∗ 𝑀𝐼𝑁𝑥 + 𝑊𝑥 ∗ 𝑀𝐼𝑁𝑦) ∗ 𝑠

𝑎𝑡𝑜𝑚𝑃𝑧 = 𝑚𝑎𝑡1𝑧 + 𝑚𝑎𝑡2𝑧

Let 𝑿(𝑋𝑥, 𝑋𝑦, 𝑋𝑧) be the vector between point 𝐴 and point 𝑎𝑡𝑜𝑚𝑃,

𝑋𝑥 = 𝑎𝑡𝑜𝑚𝑃𝑥 − 𝐴𝑥, 𝑋𝑦 = 𝑎𝑡𝑜𝑚𝑃𝑦 − 𝐴𝑦, 𝑋𝑧 = 𝑎𝑡𝑜𝑚𝑃𝑧 − 𝐴𝑧

Plane 2 is defined by normal vector 𝒏𝟐(𝑛2_𝑥, 𝑛2_𝑦, 𝑛2_𝑧) going through point 𝐴, where,

131

𝒏𝟐 = −𝑐𝑟𝑜𝑠𝑠(𝑾,𝑿)

𝑛2_𝑥 = 𝑊𝑧 ∗ 𝑋𝑦 − 𝑊𝑦 ∗ 𝑋𝑧

𝑛2_𝑦 = 𝑊𝑥 ∗ 𝑋𝑧 − 𝑊𝑧 ∗ 𝑋𝑥

𝑛2_𝑧 = 𝑊𝑦 ∗ 𝑋𝑥 − 𝑊𝑥 ∗ 𝑋𝑦

Step 3

Then the atoms that belong to the dial are extracted. This is done by checking if they lie in-between

plane 1 and plane 2.

For each 𝑐𝑜𝑢𝑛𝑡 rank of the projection map array, each atom is represented by:

𝑎𝑡𝑜𝑚_𝑥 = 𝑝𝑟𝑜𝑗_𝑥[𝑐𝑜𝑢𝑛𝑡], 𝑎𝑡𝑜𝑚_𝑦 = 𝑝𝑟𝑜𝑗_𝑦[𝑐𝑜𝑢𝑛𝑡], 𝑎𝑡𝑜𝑚_𝑧 = 𝑝𝑟𝑜𝑗_𝑧[𝑐𝑜𝑢𝑛𝑡]

𝑎𝑡𝑜𝑚_𝑣𝑑𝑤 = 𝑝𝑟𝑜𝑗_𝑣𝑑𝑤[𝑐𝑜𝑢𝑛𝑡]

An atom of the projection map belongs to the dial if it lies in-between plane 1 and plane 2, hence if

𝑑𝑜𝑡_1 = 𝑑𝑜𝑡 (𝒏𝟏, (𝐴, 𝑎𝑡𝑜𝑚)⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ) > 0

And 𝑑𝑜𝑡_2 = 𝑑𝑜𝑡 (𝒏𝟐, (𝐴, 𝑎𝑡𝑜𝑚)⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ) > 0

Calculation:

𝑑𝑜𝑡_1 = 𝑛1_𝑥 ∗ (𝑎𝑡𝑜𝑚_𝑥 − 𝐴𝑥) + 𝑛1_𝑦 ∗ (𝑎𝑡𝑜𝑚_𝑦 − 𝐴𝑦) + 𝑛1_𝑧 ∗ (𝑎𝑡𝑜𝑚_𝑧 − 𝐴𝑧)

𝑑𝑜𝑡_2 = 𝑛2_𝑥 ∗ (𝑎𝑡𝑜𝑚_𝑥 − 𝐴𝑥) + 𝑛2_𝑦 ∗ (𝑎𝑡𝑜𝑚_𝑦 − 𝐴𝑦) + 𝑛2_𝑧 ∗ (𝑎𝑡𝑜𝑚_𝑧 − 𝐴𝑧)

Because of limited floating value accuracy in computation calculations, the selection criterion is actually

made to be 𝑑𝑜𝑡_1 > 0.1 and 𝑑𝑜𝑡_2 > 0.1. Otherwise, the algorithm can for example detect 𝑀𝐼𝑁

belonging to the inside of the dial (e.g. 𝑑𝑜𝑡_1 = 0.000000000001647), when it is just outside (𝑑𝑜𝑡_1 =

0). Which has for effect to double select an atom that was already in the dial.

Therefore, each projection atom verifying 𝑑𝑜𝑡_1 > 0.1and 𝑑𝑜𝑡_2 > 0.1, is selected as belonging to the

dial. Its coordinates (𝑎𝑡𝑜𝑚_𝑥, 𝑎𝑡𝑜𝑚_𝑦, 𝑎𝑡𝑜𝑚_𝑧), radius (distance to 𝐴) and vdw (𝑎𝑡𝑜𝑚_𝑣𝑑𝑤) values

are stored into the following arrays: 𝑑𝑖𝑎𝑙_𝑥, 𝑑𝑖𝑎𝑙_𝑦, 𝑑𝑖𝑎𝑙_𝑧, 𝑑𝑖𝑎𝑙_𝑟𝑎𝑑𝑖𝑢𝑠, 𝑑𝑖𝑎𝑙_𝑣𝑑𝑤.

𝑑𝑖𝑎𝑙_𝑟𝑎𝑑𝑖𝑢𝑠 value is calculated as:

𝑟𝑎𝑑𝑖𝑢𝑠 = ((𝑎𝑡𝑜𝑚_𝑥 − 𝐴𝑥)2 + (𝑎𝑡𝑜𝑚_𝑦 − 𝐴𝑦)2 + (𝑎𝑡𝑜𝑚_𝑧 − 𝐴𝑧)2)0.5 − 𝑎𝑡𝑜𝑚_𝑣𝑑𝑤

132

Step 4

The dial (i.e., the projection map atoms belonging to that particular angular area) is processed, in order

to extract the atom that is the closest to 𝐴, and therefore corresponds to the relevant interlining atom.

Let 𝑀𝐼𝑁𝑛𝑒𝑤(𝑀𝐼𝑁𝑛𝑒𝑤_𝑥,𝑀𝐼𝑁𝑛𝑒𝑤_𝑦,𝑀𝐼𝑁𝑛𝑒𝑤_𝑧) be that atom, which corresponds to the next atom

belonging to the contour (the first atom being 𝑀𝐼𝑁).

Let 𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥 be the index of the first iteration of the minimal value stored in 𝑑𝑖𝑎𝑙_𝑟𝑎𝑑𝑖𝑢𝑠 array,

𝑀𝐼𝑁𝑛𝑒𝑤 is:

𝑀𝐼𝑁𝑛𝑒𝑤_𝑥 = 𝑑𝑖𝑎𝑙_𝑥[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]

𝑀𝐼𝑁𝑛𝑒𝑤_𝑦 = 𝑑𝑖𝑎𝑙_𝑦[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]

𝑀𝐼𝑁𝑛𝑒𝑤_𝑧 = 𝑑𝑖𝑎𝑙_𝑧[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]

𝑀𝐼𝑁𝑛𝑒𝑤_𝑣𝑑𝑤 = 𝑑𝑖𝑎𝑙_𝑣𝑑𝑤[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]

These four values are stored in the following ith contour arrays: 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥, 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑦, 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑧,

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤, because the new minimum value (that is to say minimal radius) of the dial belongs to

the pathway contour for the current axis direction being tested. In other words, the atom that has been

extracted through step 1 to 4 corresponds to an inner contour atom, and is consequently stored in the

contour array.

Step 5

So as to keep track of the angular region processed (amount of dial region covered), the angle between

the previously calculated contour atom and the new contour atom is calculated. When the sum of the

dial angle processed will equate 360 degrees, the contour will have been scanned in its entirety.

Let 𝑸(𝑄𝑥, 𝑄𝑦, 𝑄𝑧) be the vector going from 𝐴 to 𝑀𝐼𝑁, and 𝑸𝒏𝒆𝒘(𝑄𝑛𝑒𝑤_𝑥, 𝑄𝑛𝑒𝑤_𝑦, 𝑄𝑛𝑒𝑤_𝑧) be the

vector going from 𝐴 to 𝑀𝐼𝑁_𝑛𝑒𝑤.

The dial region covered is calculated as the angle between 𝑸 and 𝑸𝒏𝒆𝒘 and is represented by 𝑟𝑜𝑡_𝑠ℎ𝑖𝑓𝑡.

𝑄𝑥 = 𝑀𝐼𝑁𝑥 − 𝐴𝑥, 𝑄𝑦 = 𝑀𝐼𝑁𝑦 − 𝐴𝑦, 𝑄𝑧 = 𝑀𝐼𝑁𝑧 − 𝐴𝑧

𝑄𝑛𝑜𝑟𝑚 = (𝑄𝑥2 + 𝑄𝑦2 + 𝑄𝑧2)0.5

𝑄𝑥 = 𝑄𝑥/𝑊𝑛𝑜𝑟𝑚, 𝑄𝑦 = 𝑄𝑦/𝑄𝑛𝑜𝑟𝑚, 𝑄𝑧 = 𝑄𝑧/𝑄𝑛𝑜𝑟𝑚

𝑄𝑛𝑒𝑤_𝑥 = 𝑀𝐼𝑁𝑛𝑒𝑤_𝑥 − 𝐴𝑥, 𝑄𝑦 = 𝑀𝐼𝑁𝑛𝑒𝑤_𝑦 − 𝐴𝑦, 𝑄𝑧 = 𝑀𝐼𝑁𝑛𝑒𝑤_𝑧 − 𝐴𝑧

𝑄𝑛𝑒𝑤_𝑛𝑜𝑟𝑚 = (𝑄𝑛𝑒𝑤_𝑥2 + 𝑄𝑛𝑒𝑤_𝑦2 + 𝑄𝑛𝑒𝑤_𝑧2)0.5

133

𝑄𝑛𝑒𝑤_𝑥 = 𝑄𝑛𝑒𝑤_𝑥/𝑄𝑛𝑒𝑤_𝑛𝑜𝑟𝑚, 𝑄𝑛𝑒𝑤_𝑦 = 𝑄𝑛𝑒𝑤_𝑦/𝑄𝑛𝑒𝑤_𝑛𝑜𝑟𝑚,

𝑄𝑛𝑒𝑤_𝑧 = 𝑄𝑛𝑒𝑤_𝑧/𝑄𝑛𝑒𝑤_𝑛𝑜𝑟𝑚

𝑐𝑜𝑠 = 𝑑𝑜𝑡(𝑸,𝑸𝒏𝒆𝒘) = 𝑄𝑥 ∗ 𝑄𝑛𝑒𝑤_𝑥 + 𝑄𝑦 ∗ 𝑄𝑛𝑒𝑤_𝑦 + 𝑄𝑧 ∗ 𝑄𝑛𝑒𝑤_𝑧

𝑟𝑜𝑡_𝑠ℎ𝑖𝑓𝑡 = 𝑎𝑡𝑎𝑛2((1 − 𝑐𝑜𝑠2), cos)

Step 6

Finally, the dial is incremented one step forward and the calculations above are repeated until the total

dial region covered has reached 360 degrees.

To increment the next starting dial point, 𝑀𝐼𝑁 of next stage is 𝑀𝐼𝑁𝑛𝑒𝑤 of previous dial, hence:

𝑀𝐼𝑁𝑥 = 𝑀𝐼𝑁𝑛𝑒𝑤_𝑥

𝑀𝐼𝑁𝑦 = 𝑀𝐼𝑁𝑛𝑒𝑤_𝑦

𝑀𝐼𝑁𝑧 = 𝑀𝐼𝑁𝑛𝑒𝑤_𝑧

To keep track of the total dial angle processed, 𝑟𝑜𝑡_𝑠ℎ𝑖𝑓𝑡 dial values are summed into 𝑟𝑜𝑡_𝑠ℎ𝑖𝑓𝑡_𝑖𝑛𝑐

at each stage, with:

𝑟𝑜𝑡_𝑠ℎ𝑖𝑓𝑡_𝑖𝑛𝑐 = 𝑟𝑜𝑡_𝑠ℎ𝑖𝑓𝑡_𝑖𝑛𝑐 + 𝑟𝑜𝑡_𝑠ℎ𝑖𝑓𝑡

If no atoms have been detected inside the dial, the procedures above are repeated but with a larger lateral

window.

This is automated as, if the size of 𝑑𝑖𝑎𝑙_𝑥 array is null (which means that 𝑑𝑖𝑎𝑙_𝑦, 𝑑𝑖𝑎𝑙_𝑧, and 𝑑𝑖𝑎𝑙_𝑣𝑑𝑤

are also empty arrays), then the next dial is incremented with:

𝑙𝑎𝑡_𝑤𝑖𝑛𝑑𝑜𝑤 = 𝑙𝑎𝑡_𝑤𝑖𝑛𝑑𝑜𝑤 + 2

d) Analyze contour

Executing procedures a) to c) allows to get an ith contour map (whose information is stored into the

contour arrays) for each ith scanned direction. In order to compare the scanned directions between each

other, the object of procedure d) is to further characterize the contours (one contour for each direction)

by assigning to each contour the smallest distance between its van der Walls geometric center and the

surrounding atoms. This also allows one to hit two birds with one stone, since the van der Walls

geometric center of the winning contour will be put to contribution to get the pathway center, the

pathway minimal radius at the corresponding longitudinal region along the pathway, together with the

cross section area.

134

First, let us calculate the van der Walls geometric center.

For each 𝑐𝑜𝑢𝑛𝑡 atom rank of the contour array, let:

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥[𝑐𝑜𝑢𝑛𝑡]

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑦 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑦[𝑐𝑜𝑢𝑛𝑡]

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑧 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑧[𝑐𝑜𝑢𝑛𝑡]

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤[𝑐𝑜𝑢𝑛𝑡]

To take the atom vdw radius into the geometric center calculation, the following relation is applied, and

is given as an illustration in the x dimension only:

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑_𝑥 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥 + 𝑣𝑑𝑤_𝑢𝑛𝑖𝑡_𝑣𝑒𝑐𝑡𝑜𝑟 ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ∗ 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤

Where,

𝑣𝑑𝑤_𝑢𝑛𝑖𝑡_𝑣𝑒𝑐𝑡𝑜𝑟 ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ = (𝐴𝑥 − 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥)⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ _𝑢𝑛𝑖𝑡_𝑣𝑒𝑐𝑡𝑜𝑟

It is important to underline that while 𝐴 represents the initial center of the scan axis, it does not represent

the geometric center.

Calculation:

Let 𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑥, 𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑦, 𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑧 be the parameters of 𝑣𝑑𝑤_𝑢𝑛𝑖𝑡_𝑣𝑒𝑐𝑡𝑜𝑟 ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗

𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑥 = 𝐴𝑥 − 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥, 𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑦 = 𝐴𝑦 − 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑦, 𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑧 = 𝐴𝑧 −

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑧

𝑣𝑑𝑤_𝑛𝑜𝑟𝑚 = (𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑥2 + 𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑦2 + 𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑧2)0.5

If 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟 ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ = 𝑣𝑑𝑤_𝑢𝑛𝑖𝑡_𝑣𝑒𝑐𝑡𝑜𝑟 ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ⃗⃗ ∗ 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤:

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑥 = (𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑥/𝑣𝑑𝑤_𝑛𝑜𝑟𝑚) ∗ 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑦 = (𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑦/𝑣𝑑𝑤_𝑛𝑜𝑟𝑚) ∗ 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑧 = (𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑧/𝑣𝑑𝑤_𝑛𝑜𝑟𝑚) ∗ 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤

Then summing the different contour atom contributions into 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟 components renders:

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑥 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑥 + 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑥 + 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑥

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑦 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑦 + 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑦 + 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑦

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑧 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑧 + 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑧 + 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑣𝑑𝑤_𝑣𝑒𝑐𝑡𝑜𝑟_𝑧

135

Finally, the ith contour interlining atoms vdw geometric center is given by:

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑥 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑥/𝑛𝑏_𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑎𝑡𝑜𝑚𝑠

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑦 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑦/𝑛𝑏_𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑎𝑡𝑜𝑚𝑠

𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑧 = 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑧/𝑛𝑏_𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑎𝑡𝑜𝑚𝑠

The latter values are then stored in three arrays: 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑥, 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑦, 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑧, that will be

used in procedure e). The vdw geometric center is not a center of mass (i.e. “COM”), but for convenient

reasons, 𝐶𝑂𝑀 terminology is used.

Before proceeding to step e), the minimal surrounding atom distance to 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟 is calculated.

To do so, the atoms that lie within 20 Å of 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟 are extracted.

The following selection criterion is calculated, where 𝑎𝑡𝑜𝑚(𝑎𝑡𝑜𝑚_𝑥, 𝑎𝑡𝑜𝑚_𝑦, 𝑎𝑡𝑜𝑚_𝑧) is an atom

belonging to the structure, 𝑎𝑡𝑜𝑚_𝑣𝑑𝑤 is its van der Walls radius, 𝑑𝑖𝑠𝑡(𝑑𝑖𝑠𝑡_𝑥, 𝑑𝑖𝑠𝑡_𝑦, 𝑑𝑖𝑠𝑡_𝑧) is the

distance to 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟 and 𝑟𝑎𝑑𝑖𝑢𝑠_𝑐ℎ𝑒𝑐𝑘 is the vdw weighted distance to 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟.

𝑑𝑖𝑠𝑡_𝑥 = 𝑎𝑡𝑜𝑚_𝑥 − 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑥

𝑑𝑖𝑠𝑡_𝑦 = 𝑎𝑡𝑜𝑚_𝑦 − 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑦

𝑑𝑖𝑠𝑡_𝑧 = 𝑎𝑡𝑜𝑚_𝑧 − 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟_𝑧

𝑟𝑎𝑑𝑖𝑢𝑠_𝑐ℎ𝑒𝑐𝑘 = (𝑑𝑖𝑠𝑡_𝑥2 + 𝑑𝑖𝑠𝑡_𝑦2 + 𝑑𝑖𝑠𝑡_𝑧2)0.5 − 𝑎𝑡𝑜𝑚_𝑣𝑑𝑤

To accelerate the selection, only the atoms lying within 20 Å of 𝑐𝑜𝑛𝑡𝑜𝑢𝑟_𝑐𝑒𝑛𝑡𝑒𝑟 (i.e., verifying

𝑟𝑎𝑑𝑖𝑢𝑠_𝑐ℎ𝑒𝑐𝑘 ≤ 20) are stored into an array: 𝑝𝑟𝑜𝑗_𝑟𝑎𝑑𝑖𝑢𝑠_𝑐ℎ𝑒𝑐𝑘.

Let 𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥 be the index of the first iteration of the minimal value stored in

𝑝𝑟𝑜𝑗_𝑟𝑎𝑑𝑖𝑢𝑠_𝑐ℎ𝑒𝑐𝑘 array, and let 𝑚𝑖𝑛 be the minimal contour distance to center:

𝑚𝑖𝑛 = 𝑝𝑟𝑜𝑗_𝑟𝑎𝑑𝑖𝑢𝑠_𝑐ℎ𝑒𝑐𝑘[𝑚𝑖𝑛_𝑖𝑛𝑑𝑒𝑥]

Finally, the ith contour minimal radius is stored into 𝑝𝑟𝑜𝑗_𝑟𝑎𝑑𝑖𝑢𝑠_𝑚𝑖𝑛 array.

136

e) Choose the best pathway axis and calculate pathway parameters

Procedures a) to d) are repeated for each tested pathway direction (scanned axis). For each ith scanned

axis, we now have a corresponding minimum radius (stored at ith rank in 𝑝𝑟𝑜𝑗_𝑟𝑎𝑑𝑖𝑢𝑠_𝑚𝑖𝑛) and a van

der Walls reweighted geometric center (stored at ith rank in 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑥, 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑦, 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑧).

Now, the scanned direction which offers the ith scanned contour with the biggest (compared to those of

the other scanned directions) minimal radius is the winning axis, and is to be selected.

Let 𝑚𝑎𝑥_𝑖𝑛𝑑𝑒𝑥 be the index of the first iteration of the maximal value stored in

𝑝𝑟𝑜𝑗_𝑟𝑎𝑑𝑖𝑢𝑠_𝑚𝑖𝑛 array, and let 𝐶𝑂𝑀(𝐶𝑂𝑀𝑥, 𝐶𝑂𝑀𝑦, 𝐶𝑂𝑀𝑧) be the geometric center of the winning

contour projected onto plane DIR (i.e. from the starting point of the scan):

𝐶𝑂𝑀𝑥 = 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑥[max _𝑖𝑛𝑑𝑒𝑥]

𝐶𝑂𝑀𝑦 = 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑦[max _𝑖𝑛𝑑𝑒𝑥]

𝐶𝑂𝑀𝑧 = 𝑝𝑟𝑜𝑗_𝐶𝑂𝑀𝑧[max _𝑖𝑛𝑑𝑒𝑥]

It follows that the winning axis is given by:

𝑁𝑥 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑥[max _𝑖𝑛𝑑𝑒𝑥]

𝑁𝑦 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑦[max _𝑖𝑛𝑑𝑒𝑥]

𝑁𝑧 = 𝑠𝑐𝑎𝑛_𝑎𝑥𝑖𝑠_𝑧[max _𝑖𝑛𝑑𝑒𝑥]

137

2.3.2. Virtual sphere scan

To advance along the pathway with a constant step size, the second part of the algorithm performs a

virtual sphere scan. Starting from the previous point belonging to the pathway axis (refined starting

point for the beginning of the pathway axis generation), a spherical region is scanned along the fixed

axis. To do so, a principle very similar to 2.3.1 is employed. For each scan axis, a sphere center is defined

as the projection of point 𝐴 along the axis, with a projection step equal to the previous minimum pathway

radius +3 Å. By doing so, the sphere scan is close enough to stay within the pathway resolution, but far

enough to capture information about the longitudinal spread of the pathway. For each sphere projected

along a sphere_step distance in the tested scanned direction, the minimum distance from surrounding

atoms is stored into an array. The sphere containing the maximal (relative to the other scans) minimal

radius is selected. The latter sphere allows to refine a main pathway direction. Then a second sphere

scan is performed, starting from the winning direction of the first sphere scan, in order to refine details

about the inner irregularities of the channel and to generate a final pathway point belonging to the

computed pathway axis. The calculations for the virtual sphere scans are similar than the one described

previously and will not be specified here. The double virtual sphere scan method (Figure 40) allows to

generate pathway axes with great precision (Figure 41).

Figure 40: Virtual sphere scan method. The first scan allows to remain within the main longitudinal spread

direction of the pathway. The tested direction and corresponding virtual spheres are shown in purple

arrows and circles respectively. The winning direction corresponding to the spherical area containing the

largest minimal radius to surrounding atoms compared to the other spherical regions tested, to be selected,

is represented in blue. The second scan resolves the inner irregularities and details of the pathway. The

tested directions are represented as cyan arrows. The final winning spherical region to be selected is

represented in white. The main and sub pathways are represented as a large and small grey tube

respectively.

138

Figure 41: Virtual sphere scan pathway axis detection. The double virtual sphere scan method allows to

generate precisely the axis of a very irregular pathway. The protein channel cross section is represented in

grey surface. The inner contour is very complex, consists of almost 90 degrees turns and displays

periodically very small void areas (e.g., pathway exit on the right). The computed axis is represented as a

series of red spheres.

139

2.3.3. Walk forward along pathway axis

All the calculations above have been done for the first step along the pathway. The good pathway

direction has been found. The final step is to increment the scan forward, so as to advance along the

pathway axis. Before moving forward along the pathway, the scan is repeated altogether one time, but

starting from a re-adjusted position. In other words, procedures above are repeated with 𝐴 (initial

pathway start guess) replaced by 𝐶𝑂𝑀 (re-adjusted pathway start center) and with 𝐵 (direction to which

the initial pathway axis guess is pointing) by the projection of 𝐶𝑂𝑀 along 𝒏:

𝑁𝑥 = 𝑁𝑥 − 𝐴𝑥, 𝑁𝑦 = 𝑁𝑦 − 𝐴𝑦, 𝑁𝑧 = 𝑁𝑧 − 𝐴𝑧

𝑁𝑛𝑜𝑟𝑚 = (𝑁𝑥2 + 𝑁𝑦2 + 𝑁𝑧2)0.5

𝑁𝑥 = 𝑁𝑥/𝑁𝑛𝑜𝑟𝑚, 𝑁𝑦 = 𝑁𝑦/𝑁𝑛𝑜𝑟𝑚, 𝑁𝑧 = 𝑁𝑧/𝑁𝑛𝑜𝑟𝑚

New 𝐴 and 𝐵 points are given by:

𝐴𝑥 = 𝐶𝑂𝑀𝑥, 𝐴𝑦 = 𝐶𝑂𝑀𝑦, 𝐴𝑧 = 𝐶𝑂𝑀𝑧

𝐵𝑥 = 𝐶𝑂𝑀𝑥 + 3 ∗ 𝑁𝑥, 𝐵𝑦 = 𝐶𝑂𝑀𝑦 + 3 ∗ 𝑁𝑦, 𝐵𝑧 = 𝐶𝑂𝑀𝑧 + 3 ∗ 𝑁𝑧

Then for all the subsequent walks along the pathway, the new starting position is set to 𝐴 as the 2 Å

projection of 𝐶𝑂𝑀 along 𝑁 (in order to walk along the winning axis) and to 𝐵 as the 4 Å (arbitrary

value) projection of 𝐶𝑂𝑀 along 𝑁 (𝐵 is only used to characterize the direction, and hence could be

projected at any distance along 𝑁):

New 𝐴 and 𝐵 points (for forward shifted scan) are given by:

𝐴𝑥 = 𝐶𝑂𝑀𝑥 + 2 ∗ 𝑁𝑥, 𝐴𝑦 = 𝐶𝑂𝑀𝑦 + 2 ∗ 𝑁𝑦, 𝐴𝑧 = 𝐶𝑂𝑀𝑧 + 2 ∗ 𝑁𝑧

𝐵𝑥 = 𝐶𝑂𝑀𝑥 + 4 ∗ 𝑁𝑥, 𝐵𝑦 = 𝐶𝑂𝑀𝑦 + 4 ∗ 𝑁𝑦, 𝐵𝑧 = 𝐶𝑂𝑀𝑧 + 4 ∗ 𝑁𝑧

140

2.3.4. Convert COM map to distance bins

In order to split the pathway axis into fixed distance to binding steps, independently from the pathway

axis length (which varies in time as the pathway conformation changes in time), and hence

independently from the simulation frame, the following procedure is employed. The calculation details

will not be specified.

First, each 𝐶𝑂𝑀𝑖 point (defines the pathway axis, derived previously) is projected onto the fixed axis

(see previously), which serves as an invariable reference for the different simulation frames. Let the

projected 𝐶𝑂𝑀𝑖 points be 𝐶𝑂𝑀𝑃𝑖. And let the fixed axis run from points 𝐿1 to 𝐿2. Second, the fixed

axis is divided into 1 Å steps beginning from 𝐿1 and ending at 𝐿2 (corresponding to the target successful

binding position). Third, to each fixed axis step is assigned a lower and an upper bound 𝐶𝑂𝑀 point

(segment of the pathway axis defined by two consecutive 𝐶𝑂𝑀 points), by comparing its position to the

𝐶𝑂𝑀𝑃𝑖 points (initial 𝐶𝑂𝑀𝑖 points that have been projected onto the fixed axis). Fourth, the distance

steps are “reprojected” onto their corresponding 𝐶𝑂𝑀 axis (lower and upper bound 𝐶𝑂𝑀). This is done

by calculating the intersection between the plane defined by the ith step and the normal vector (𝐿1, 𝐿2),

and the assigned 𝐶𝑂𝑀 axis (consecutive 𝐶𝑂𝑀 points selected). Let the final reprojected points

(corresponding to a fixed distance step along (𝐿1, 𝐿2), and belonging to the pathway axis) be

𝐶𝑂𝑀_𝑆𝑇𝐸𝑃𝑖.

141

2.3.5. Calculate cross section area

A good approximation of the cross section area of a 2D shape is:

𝐴𝑟𝑒𝑎 = 𝜋. 𝑟_𝑚𝑒𝑎𝑛2

Where 𝑟_𝑚𝑒𝑎𝑛 is the mean of the contour atoms radius to the geometrical center.

Simulating this formula on a square or a 6-branched star shape returns an area with a precision of +/-17

and +/-19 % respectively.

A more accurate method, explained below, is to sum the local areas formed by the atoms, successively

around the contour, i.e. to sum the areas per dial.

Figure 42: Cross section area calculation. The cross section area is computed by summing the area

contribution of the successive dials. The first three dials are represented in green, purple and orange

respectively.

Hence the formula used to calculate the cross section area is:

𝐴𝑟𝑒𝑎 = ∑(r1_i

𝑛

𝑖=1

∗ r2_i ∗ sin (𝑡𝑒𝑡𝑎))/2

Where r1_i and r2_i are the radius for the ith and ith +1 atoms belonging to the contour, and 𝑡𝑒𝑡𝑎 is the

angle between r1_i and r2_i.

To perform this calculation, the pathway is processed again, and loops through each 𝐶𝑂𝑀_𝑆𝑇𝐸𝑃𝑖 points.

The algorithm performs similar computations to the previous dial calculations. In order to calculate an

estimated diffusion area in pathways containing holes inside them, when no atom is detected inside a

dial, the previous atom of the contour is rotated to define a virtual contour atom.

142

3. Electrostatic analysis

Three types of forces govern the diffusion: the Brownian random molecular water motion, the non-

bonded interactions and hydrogen bonds.

Given that the hydrophobic non-bonded interactions are indirectly taken into account and given that

hydrogen bonds represent a special case of electrostatic interaction, the long-range non-bonded

interactions are described by the Electrostatics and the van der Walls potential, between atoms i and j:

Unon-bonded= Uelectrostatics + UvdW

= 𝑞𝑖 𝑞𝑗

4𝜋𝜖0𝑟𝑖𝑗+ 𝜀 [(

𝑅𝑚𝑖𝑛,𝑖𝑗

𝑟𝑖𝑗)12

− 2 (𝑅𝑚𝑖𝑛,𝑖𝑗

𝑟𝑖𝑗)6

]

vdw forces are not straight forward to characterize and are dominated by Coulombic interactions on

long distances. Consequently, emphasis is made on Coulombic electrostatics to characterize the force

guiding or impeding substrate access along the pathways.

More precisely, to characterize how favorable a pathway is for substrate diffusion, the central diffusion

pathway is put to contribution with the methodology of previous section. The series of pathway

𝐶𝑂𝑀_𝑆𝑇𝐸𝑃𝑖 points are used to represent the position of a substrate successively along the pathway.

Long range electrostatics at play inside a channel over a rNTP substrate is then characterized by

calculating the Coulombic interaction between a point of charge -2 representing the substrate at the ith

position along the pathway axis. If a NTP at 𝐶𝑂𝑀_𝑆𝑇𝐸𝑃𝑖 position is represented by the point

𝐶𝑂𝑀𝑖_𝑁𝑇𝑃, and if j and i are the protein atom and NTP indexes respectively, the force on 𝐶𝑂𝑀𝑖_𝑁𝑇𝑃

charge due to the protein charge is given by:

𝑭(𝐶𝑂𝑀𝑖_𝑁𝑇𝑃) = (qNTP

4π𝜀𝑟𝜀0) ∗ ∑

qj

|𝒓𝒋𝒊|2 �̂�𝒋𝒊

𝑛

𝑗=1

Using a protein dielectric constant of 74, a NTP charge of - 2, and converting in SI units (elementary

charge in Coulombs and Angstroms in meters), the equation can be rewritten as:

𝑭(𝐶𝑂𝑀𝑖_𝑁𝑇𝑃) =−2

4 ∗ 𝜋 ∗ 74 ∗ 8.854187817 ∗ 10−12∗

(1.6021762208 ∗ 10−19)2

10−20∗ ∑

qj


𝑛

𝑗=1

𝑭(𝐶𝑂𝑀𝑖_𝑁𝑇𝑃) = −6.2353446 ∗ 10−10 ∗ ∑qj


𝑛

𝑗=1

143

Chapter 5

Results and Discussion

144

1. Introduction

Diffusion is a critical step to provide substrates to molecular machines. One can think of substrate

loading as being mainly stochastic and random in nature. However, a cell is orchestrated in a very precise

manner, and in living organisms, nature has provided advanced and sometimes complex solutions to

control substrate input such as precisely shaped pathways, or elaborate electrostatic filtration. As such,

diffusion, and the biomolecular properties underlining its behavior, can be seen as being part of more

general cellular programming. In RNA synthesis, substrate delivery can be seen as the most elementary

step of elongation.

In this section, we will present new results about substrate diffusion and loading into RNAP. We will

attempt to characterize the diffusion process and check if simulation results are in accordance with the

main channel theory presented in chapter 1. The following questions will be discussed. What are the

diffusion pathways leading to the DS bubble or the catalytic center? Are there conformationally or

electrostatically suitable routes and do they compare favorably to CH2? How does NTP loading fit in a

rationalized more general enzymatic translocation cycle model?

145

2. Simulation summary

Trajectories derived from five aMD and six sMD simulations are listed in this subsection. aMD

simulations are summarized in Table 6 below.

aMD

simulation

name

time

A1 A2 A3 A4 nb of

protein

residues

nb of

water

mol.

total

nb of

atoms

aMD1 20 ns 3.5 0.20 0.50 0.50

3795

159600

707874 50 ns 3.5 0.20 0.20 0.20

aMD2 50 ns 3.5 0.20 0.50 0.50

aMD3 50 ns 4.5 0.20 0.20 0.20

aMD4 50 ns 3.5 0.20 0.20 0.20

aMD5 80ns 3.5 0.20 0.20 0.20

Table 6: aMD simulation summary. Acceleration parameters are calculated from A1, A2, A3 and A4 as:

𝑬_𝒅𝒊𝒉𝒆𝒅 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 = 𝑽_𝒅𝒊𝒉𝒆𝒅 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 + 𝑨𝟏 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 ∗ 𝒏𝒃_𝒑𝒓𝒐𝒕_𝒓𝒆𝒔,

𝜶_𝒅𝒊𝒉𝒆𝒅 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 = 𝑨𝟐 ∗ (𝑨𝟏 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 ∗ 𝒏𝒃_𝒑𝒓𝒐𝒕_𝒓𝒆𝒔),

𝑬_𝒕𝒐𝒕𝒂𝒍 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 = 𝑽_𝒕𝒐𝒕𝒂𝒍 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 + 𝑨𝟑 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 ∗ 𝒏𝒃_𝒂𝒕𝒎𝒔,

𝜶_𝒕𝒐𝒕𝒂𝒍 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 = 𝑨𝟒 𝒌𝒄𝒂𝒍.𝒎𝒐𝒍−𝟏 ∗ 𝒏𝒃_𝒂𝒕𝒎𝒔

In addition, Force-Distance relationships are generated from the six sMD trajectories outlined in Table

7, specifying the structural checkpoints used along the sMD pull across a pathway, and the magnitude

of the pulling force.

146

sMD CH2 0.3, sMD CH2 0.4

LD1A, Rpb1 728:NZ

LD1B, Rpb1, 1300:NZ

LD2A, Rpb1 1360:CA

LD2B, Rpb1 620:CA

LD3, Rpb1 476:CA

LD4, Rpb2 837:CB

LD5, tDNA i + 1:N3

CK1, 2.5 Å from m(LD1A, LD1B),

CK2, 7 Å from m(LD2A, LD2B)

CK3, 6 Å from LD3

CK4, 5 Å from LD4

CK5, 3 Å from LD5

sMD CH3C 0.075

LD1A, Rpb1 1222:CG

LD1B, Rpb5 118:CG

LD2, Rpb1 1278:CG

LD3, tDNA i + 2:N3

CK1, 4 Å from m(LD1A, LD1B)

CK2, 7 Å from LD2

CK3, 3 Å from LD3

sMD CH3A 0.3

LD1A, Rpb1 728:NZ

LD1B, Rpb1 1300:NZ

LD2A, Rpb1 716:CG

LD2B, Rpb1 1092:CE

LD3A, Rpb1 1113:CB

LD3B, Rpb1 773:CD

LD4A, Rpb1 1112:CE

LD4B, Rpb2 509:CA

LD5, tDNA i + 2:N3

CK1, 2.5 Å from m(LD1A, LD1B)



CK4, 2.5 Å from m( LD4A, LD4B)

CK5, 3 Å from LD5

sMD-aMD CH3C 0.04

LD1A, Rpb5 91:CD

LD1B, tDNA i-21:C5'

LD2, Rpb1 1247:CB

LD3, Rpb1 771:CG

LD4, tDNA i + 2:N3


CK2, 7 Å from LD2

CK3, 7 Å from LD3

CK4, 3 Å from LD4

sMD CH3B 0.15, sMD CH3B 0.3

LD1A, Rpb1 702:CG

LD1B, Rpb1 1274:CZ

LD1C, Rpb9 92:NH2 ,

LD2A, Rpb1 702:CG

LD2B, Rpb1 1274:CZ

LD3, Rpb9 50:CB

LD4, tDNA i + 2:N3

LD5, Rpb1 771:CG

LD6, tDNA i + 2:N3

CK1, 2.5 Å from 25 Å projection of

m(LD1A, LD1B) along LD1C, m(LD1A,

LD1B).

CK2 2.5 Å from m(LD2A, LD2B)

CK3 12 Å from LD3

CK4, no checkpoint distance, pulled for 50

ps towards LD4

CK5, 6.5 Å from LD5

CK6, 3 Å from LD6

sMD CH4 0.075

LD1, NA

LD2, tDNA i + 2:N1

CK1, manually positioned at entrance of CH4

CK2, 3 Å from LD2

Table 7: sMD simulation summary. sMD simulations are listed, where the corresponding pathway is

indicated after the “sMD” instance in the title, followed by the pulling force in kcal.mol-1.A-2. When the

landmark is calculated as the middle between two points, the notation m(A,B) is used. LD stands for

147

landmark x, y, z coordinates. SMD trajectories are divided into sub-paths, where switching is done at CK

point (stands for checkpoint) from a certain threshold distance (Å distance given in the table above before

the LD point). The simulated system and pulled molecule are 2E2H and GTP respectively for all the sMD

runs, except for sMD CH4 0.075 where the system is PDB#5C4J and the pulled molecule is CTP.

Finally, an algorithm has been developed (see previous section for explanations) and is executed to

extract the pathway axis, cross section area, minimal radius and electrostatic force experienced by a

virtual NTP point charge of -2, along the diffusional pathway. In order to characterize the electrostatic

force in an informative fashion, i.e. propensity to travel through a pathway: magnitude and orientation,

are combined in one single value by projecting the Coulombic interaction vector between the virtual

NTP point charge and the protein atoms onto the diffusional axis generated by the scanning algorithm.

The latter is not rectilinear (a single axis) but is represented by successive 1 Å long pathway axes, where

each 1 Å axis runs from one pathway center to the next, which are referred to as pathway COMs for

simplicity.

148

3. Results

3.1. Diffusional zones

It has been proposed by many authors that no access was granted to deliver substrates to the main

channel. However, it appears impossible to rationalize how downstream templated NTPs could promote

the translocation sliding degrees of freedom and consequently help expel misloaded NTP or accelerate

the active site delivery and/or isomerization of a correct NTP, without binding to DS registers.

Consistent with the results presented in the main channel theory section, which seem to indicate that

substrates can access the main channel, several pathways have been identified in the RNAP structure,

and appear to offer substrate delivery capabilities to the main channel. In addition to the widely

discussed CH2 pathway in literature, five channels leading to the DS bubble have been identified.

Altogether, the possible diffusion routes are the following.

CH2 comprises the funnel and a narrow corridor. Sequence of CH2 is, scRPB1: 350, 352, 446-448, 450,

451, 453, 454, 472-477, 479-486, 513, 515-525, 528, 532, 533, 535-538, 588-605, 616-628, 631, 632,

635, 693, 696, 697, 702-739, 743-758, 760, 764-769, 772-774, 819-824, 826-828, 831, 832, 878-888,

946-962, 1025, 1071, 1074, 1075, 1078-1097, 1100, 1113, 1115-1117, 1119, 1281-1291, 1298-1309,

1326, 1328-1330, 1342, 1345, 1346, 1349-1351, 1353-1366, 1368; scRPB2: 529-531, 533, 763, 765,

766, 769, 772, 773, 776, 835, 836, 837, 977, 979, 985-987, 1013, 1016, 1018-1021, 1095-1097, 1102;

scRPB5: 147-149, 151, 200-204.

The sequence of the corridor is, scRPB1: 350, 352, 446 - 448, 450, 451, 453, 454, 472 - 477, 479 - 486,

515, 520 - 525, 528, 623, 624, 750 - 753, 819-824, 826, 827, 1074, 1075, 1078 – 1086; scRPB2: 529 -

531, 533, 763, 765, 766, 769, 772, 773, 776, 836, 837, 977, 979, 985-987, 1018 - 1021, 1095-1097,

1102.

149

Figure 43: CH2 and corridor pathways. Residues lining the corridor section of CH2 are shown in white, the

remaining part of CH2 is shown in blue. The protein and the RNA’3 end are colored in grey and lime

respectively.

A complex channel is branched in four parts and will be referred to as CH3. CH3A/B channel runs from

two openings near the funnel of CH2, directly to the downstream bubble near registers i + 2 to i + 4.

CH3A is formed by a hole in the funnel of CH2. CH3B is adjacent to CH3A, and is formed by a hole

lying near the exterior of the enzyme rather than the funnel. CH3A seems to correspond to the “pore 2”

pathway described briefly by Cramer et al. in [Cramer, et al., 2000], but has apparently not been referred

to since.

CH3A/B is composed of the following residues, scRPB1: 700-712, 715, 716, 768-784, 787-791, 796,

797, 814, 815, 817, 819, 826, 827, 829, 835, 837, 840, 1076, 1080 1089, 1089-1116, 1132, 1134-1136,

1138-1141, 1144-1146, 1148, 1198, 1200-1207, 1269, 1274, 1277-1284, 1307-1312, 1329-1334, 1351,

1354, 1355, 1357, 1358, 1381, 1383-1387; scRPB2: 218, 224-241, 254-264, 267, 308, 309, 312, 313,

381, 386-400, 404, 501-517, 535, 699; scRPB9: 44, 46, 48-53, 87, 89-94, 96, 113-120. CH3A opening

is, scRPB1: 705-708, 712, 713, 716, 717, 719, 720, 769, 771-774, 1089-1097, 1100, 1113, 1115, 1117,

1281, 1283, 1285, 1287, 1307, 1309, 1328, 1330, 1350, 1351, 1354, 1357, 1358. CH3B opening is,

scRPB1: 700-706, 708-710, 1132, 1134-1136, 1138-1141, 1144-1146, 1148, 1198, 1200-1207, 1269,

1274, 1277-1279, 1281-1284; scRPB2: 263, 264, 267, 308, 309, 312, 313; scRPB9: 44, 46, 48-53, 90,

92-94, 96, 113-120.

150

Figure 44: CH3 channel view from CH2. Residues lining opening A of CH3 are shown in green, opening B

leading to CH3 and CH3 are indicated in pink. CH2 is indicated in blue and the protein and nucleic acid

atoms are represented as grey lines.

Figure 45: Side view of CH3. CH3, CH2, downstream tDNA and ntDNA are shown in blue, pink, light blue

and cyan respectively. Protein atoms are represented as grey lines.

151

CH3C joins CH3A/B on the other side of the protein wall further away from the funnel, and is shaped

as a tube open on one third of its length on one side and DS DNA. Hence CH3C is a sub-channel of

CH1 and envelops partly DS DNA. In addition to CH3C, two additional channels lie in the CH1 area:

CH3D runs below and perpendicularly to DS DNA and joins CH3C and CH4 is a passage that goes

under a loop formed by ntDNA to enter the pre-binding i + 2 to i + 4 zone, from the opposite direction

than CH3A/B.

The sequences are the following. CH3C: scRPB1: 829, 832, 833, 836, 837, 840, 1095, 1096, 1099, 1100,

1102, 1103, 1105-1114, 1140-1142, 1144, 1145, 1215, 1216, 1218-1224, 1242-1263, 1265-1267, 1269-

1272, 1275-1280, 1309-1315, 1317, 1318, 1329, 1331, 1333, 1334, 1336-1338, 1381-1383, 1385-1387;

scRPB2: 224, 226-234, 237, 239, 255, 257, 259-268, 270, 277, 278, 279, 396-399, 504-511; scRPB5: 5,

7, 8, 11, 112-119, 121, 122, 136-140; ntDNA i-20 to i-4, t strand i-20 to i-2. CH3D: scRPB1: 118-141,

143-147, 860-862, 1393, 1394; scRPB5: 140, 171, 173, 175- 194, 213-215. CH4: scRPB1: 306-316,

ntDNA i + 4 to i + 10.

Figure 46: Side view of CH3C, CH3D and CH4, relative to CH2. CH2, CH3C, CH3D, CH4, tDNA and nt

DNA are shown in blue, yellow, red orange, light blue and cyan respectively. CH4 represented surface

includes ntDNA registers i + 4 to i + 10. The rest of the protein is indicated in grey.

152

Figure 47: Front view of CH3C, CH3D and CH4. CH3C, CH3D, CH4, tDNA, ntDNA and RNA are shown

in yellow, red orange, light blue cyan and lime respectively. The protein is indicated in grey.

Figure 48: Side view of CH3C, CH3D and CH4, relative to CH4. CH3C, CH3D, CH4, tDNA, ntDNA and

RNA are shown in yellow, red orange, light blue cyan and lime respectively. The protein is indicated in grey.

153

Figure 49: Bottom view of CH3D entrance to CH3. CH3D, CH3C, tDNA and the overall enzyme are visible

as red, yellow, cyan and grey surfaces respectively.

154

3.2. CH2 Analysis

Conformational analysis

Let us start our investigation with the well-known secondary channel. The pathway algorithm detected

the following COM axis across the channel.

Figure 50: Front, side and back view of CH2 pathway axis. The pathway is represented in grey surface.

Virtual atoms filling holes in the pathway surface are indicated as silver points along the contour, thereby

drawing a closed diffusive channel. The computed pathway axis is represented as a red Gaussian trajectory.

155

The pathway axis is indicated as successive red spheres in the first figure to better visualize trajectory along

the funnel.

The path is characterized by two main directions: path from funnel opening to entrance of the corridor,

and then bifurcation across to corridor leading to the active site.

The cross-sectional areas and minimal radii along the latter COM axis are given below. It is to be noted

that the path displays dramatic reduction of diffusive area (heatmap corroborates well that of minimal

radius) when entering the corridor.

Figure 51: CH2 minimal radius along diffusional path heatmap. Time against Distance to Binding against

Minimal Radius along the pathway is plotted. The simulation trajectory is aMD5.

156

Figure 52: CH2 cross section area along diffusional path heatmap. Time against Distance to Binding against

Cross Section Area along the pathway is plotted. The simulation trajectory is aMD5.

157

Electrostatic analysis

The Electrostatic favorable or impeding contribution is characterized by the projection of the Coulombic

electrostatic interaction between a virtual point of charge -2, representing a NTP, along the COM axis.

The heatmap below displays this information and further worsens the case of CH2 being a favorable

diffusive channel from the corridor section onwards.

Figure 53: CH2 Electrostatic NTP interaction along diffusional path heatmap. Time against NTP

experienced Electrostatic Force projected along channel axis against Cross Section Area along the pathway

is plotted. The simulation trajectory is aMD5.

158

Force-Distance relationship

To further test the pathway diffusive favorability score, several pulling forces were applied to a GTP

molecule along the checkpoints presented in Table 7. The nucleotide triphosphate required a 0.4

kcal.mol-1.A-2 force to cross the corridor, while a force of 0.3 kcal.mol-1.A-2 lead to the substrate halting

its diffusion. Furthermore, the most favorable conditions were used, with the TL maintained open with

restraints.

Figure 54: CH2 force-distance plot. The simulation trajectories are sMD CH2 0.3 and sMD CH2 0.4.

Substrate/Metabolite diffusion analysis

In aMD2 and aMD4 simulations, glutamate molecules diffused through the funnel to the entrance of

the corridor, then quickly diffused away, confirming that the corridor is not suitable to accommodate

negatively charged molecules.

159

3.3. CH3A Analysis


CH3A is an interesting opening, because its access can be completely gated or expanded greatly. The

pore leads directly to DS DNA around i + 2 to i + 3. Let us first consider parameters affecting restriction

of the channel. The opening appears to be gated by the TL, when the latter is in the extreme open

conformation. For example, PDB#5C4J crystal structure shows an initial complete gating of CH3A.

However, preliminary simulations of 5C4J seem to indicate that the TL quickly retracts a little bit from

CH3A, reducing its gating (data not shown). Also, CH3A access seems to be shielded when TFIIS binds

(chapter 1).

Figure 55: Front and side view of TL closing of opening CH3A. TL, opening A and protein walls are

indicated in grey, red and green respectively. RNAP structure is PDB#5C4J [Barnes, et al., 2015].

In aMD 1 to 5 simulations, CH3A maintains globally a large opening. The entrance expands

stochastically, resulting in the periodic merging with CH3B, thereby forming one single opening:

CH3A/B. The pathway algorithm was run on aMD 1, where the access displays a very large void surface,

160

and where it was virtually merged with CH3B during the entire 70 ns simulation. The figures below

display the COM axis trajectory, and a CH3A merged with CH3B conformation.

Figure 56: Front, side and back view of CH3A pathway axis. The pathway is represented in grey surface.



161

The conformation along the channel can be characterized by the following heatmaps.

Figure 57: CH3A minimal radius along diffusional path heatmap. Time against Distance to Binding against


Figure 58: CH3A cross section area along diffusional path heatmap. Time against Distance to Binding

against Cross Section Area along the pathway is plotted. The simulation trajectory is aMD5.

162


Although the accessibility area is very important, the pathway is unfavorable to NTP diffusion due to

an Electrostatic force repelling a NTP away from the diffusional path leading to a potential pre-binding.

Figure 59: CH3A Electrostatic NTP interaction along diffusional path heatmap. Time against NTP



163


Several forces were tested, and a pulling magnitude of 0.3 kcal.mol-1.A-2 was required to overcome the

negative Electrostatic potential.

Figure 60: CH3A force-distance plot. The simulation trajectory is sMD CH3A 0.2.


In aMD1 and aMD2 simulations, a glutamate and an aspartate metabolite respectively, diffused

completely across the channel, which seems to indicate that the pathway is more favorable than CH2,

when no metabolite was able to go pass the E site near the entrance of the corridor.

164

3.4. CH3B Analysis


CH3B is also an interesting pathway, because it displays a very precisely shaped narrow pore running

from an opening outside the enzyme, adjacent to the CH3A opening belonging to CH2 funnel area, and

does not seem to be affected by TL conformation switch or TFIIS binding. The pathway algorithm

generated a COM trajectory axis displaying a mean minimal radius along the diffusive path of about 3

Å only.

Figure 61: Front, side and back view of CH3B pathway axis. The pathway is represented in grey surface.



165

Figure 62: CH3B minimal radius along diffusional path heatmap. Time against Distance to Binding against


Figure 63: CH3B cross section area along diffusional path heatmap. Time against Distance to Binding


166


According to the electrostatic calculations computed, CH3B is not favorable to NTP accommodation.

Figure 64: CH3B Electrostatic NTP interaction along diffusional path heatmap. Time against NTP



167


Although a negative Electrostatic potential lies across the channel, a relatively low pulling force of 0.15

kcal.mol-1.A-2 was able to make a GTP molecule diffuse almost successfully. A force of 0.3 kcal.mol-

1.A-2 made the substrate diffuse very quickly, and compared favorably to the same pulling force applied

in the CH3A case, which seems to indicate that the channel is more favorable than both CH2 and CH3A.

Figure 65: CH3B force-distance plot. The simulation trajectories are sMD CH3B 0.15 and sMD CH3B 0.3.


In aMD2 simulation, a glutamate zwitterion amino acid diffused through the channel. More importantly,

in aMD5, a GTP molecule bonded at the entrance of the channel and remained at the position during the

simulation time, which seems to indicate that there is no energy barrier to access the very beginning of

the pathway.

168

Figure 66: GTP bound at CH3B entrance. GTP and bound MgB ion are indicated as red CPK drawing and

pink sphere respectively. Protein surface, tDNA and ntDNA are shown in grey, light blue and cyan

respectively.

169

3.5. CH3C Analysis


CH3C is an intriguing pathway, because although it lies next to DS DNA, it remains at distance with

the nucleic helix during most of the time in the simulations. The solvent accessible cavity widens in the

first few ns of simulations, meaning that in initial crystal structure atomic coordinates, crystal packing

forces might partially hide the pathway. The last fourth of the corridor seems to be gated by scRPB2:

204-206. Nevertheless, the latter residues are most of the time folded away, hence not impeding

accessibility in the last section of the channel. In the 80 ns long aMD5 simulation, scRPB2: 204-206

were always folded away.

Figure 67: Longitudinal view through CH3C. Gating residues near the end the pathway, protein surface,

tDNA and ntDNA are shown in lime, light blue, cyan and grey respectively.

170

A diffusive COM axis has been detected, and is presented below.

Figure 68: Side view of CH3C pathway axis. The pathway is represented in grey surface. Virtual atoms

filling holes in the pathway surface are indicated as silver points along the contour, thereby drawing a closed

diffusive channel. The computed pathway axis is represented as a red Gaussian trajectory.

171

For 80 ns of aMD, important accessibility dimensions occur, although not obvious from the minimal

radius along the COM axis, the accessibility is better evidenced by the cross section area heatmap.

Figure 69: CH3C minimal radius along diffusional path heatmap. Time against Distance to Binding against


Figure 70: CH3C cross section area along diffusional path heatmap. Time against Distance to Binding


172


CH3C seems to be suitable electrostatically to accommodate NTP substrates, although an energetic

barrier lies at the very beginning.

Figure 71: CH3C Electrostatic NTP interaction along diffusional path heatmap. Time against NTP



173


sMD simulations compare in a very advantageous manner to the alternative pathways, where a pulling

force of only 0.075 kcal.mol-1.A-2 allowed fast diffusion of the substrate to binding. Also, a sMD

simulation using the aMD boost sampling method, allowed a virtually complete diffusion (a few

angstroms away from binding, probably due to a trajectory that would require a few adjustments) with

a force that can be considered almost negligible: 0.04 kcal.mol-1.A-2.

Figure 72: CH3C force-distance plot. The simulation trajectories are sMD CH3C 0.075 and sMD CH3C

0.04 aMD.

174


Around 65 ns of aMD5 simulation, a GTP molecule initiated diffusion across CH3C. The substrate then

inserted further in the channel. The base group stuck to protein walls, preventing the molecule to pursue

its diffusion quickly, which appeared to be due to suboptimal NTP base group parameters. The nucleic

acid forcefield potential modifications for use with the 12-6-4 potential from [Panteva, et al., 2015B]

was then applied to the NTP, and the molecule unbounded and continued a quick diffusion across the

channel. The simulated diffusion could constitute an unbiased (as compared to sMD where a force that

biases the reaction-coordinate is applied) partial successful diffusion. The NTP is bound to an additional

positively charged metabolite: an extra Mg2+ ion. This could help cross the small energetic barrier that

seems to lie (on 3 to 4 Å) at the beginning of the pathway.

Figure 73: NTP diffusion through CH3C state 1. A substrate approaches CH3C around time step 66 ns of

aMD5 simulation. The GTP molecule (red) bound to two Mg2+ (pink spheres) is shown. The protein surface,

tDNA and ntDNA are indicated in grey, light blue and cyan respectively.

175

Figure 74: NTP diffusion through CH3C state 2. The substrate initiates diffusion around time step 66.5 ns

of aMD5 simulation. The GTP molecule (red) bound to two Mg2+ (pink spheres) is shown. The protein

surface, tDNA and ntDNA are indicated in grey, light blue and cyan respectively.

Figure 75: NTP diffusion through CH3C state 3. The substrate continues diffusion inside CH3C around

time step 80 ns of aMD5 simulation. The GTP molecule (red) bound to two Mg2+ (pink spheres) is shown.

The protein surface, tDNA and ntDNA are indicated in grey, light blue and cyan respectively.

176

Figure 76: NTP diffusion through CH3C state 4. [Panteva, et al., 2015B] parameters are switched on and

the substrate diffuses along half of CH3C pathway (aMD5-prolonged time step 85.5 ns). The GTP molecule

(red) bound to two Mg2+ (pink spheres) is shown. The protein surface, tDNA and ntDNA are indicated in

grey, light blue and cyan respectively.

In addition to the NTP loading depicted above, glutamate molecules diffused completely along the

channel, arriving near DS DNA pre-binding area in aMD2, 3, 4 and 5. The diffusion occurred very

quickly (0.5 ns up to 2ns), significantly faster than for the eventual metabolite travel in the alternative

pathways. This seems to corroborate both the sMD and the electrostatic analysis indicating that CH3C

is the favorable access for NTP loading to the pre-binding registers.

177

3.6. CH3D Analysis

Preliminary analysis

In aMD simulations, NTPs appeared to display a strong repulsion from the entrance of CH3D. Therefore,

the other channels were tested in priority and CH3D has not been thoroughly investigated. In aMD2

simulation, a GTP substrate travelled at the entrance of the channel, before diffusing away.

Figure 77: NTP diffusion at CH3D entrance. The GTP molecule and its bound Mg2+ atom are shown in red

and pink respectively. The protein surface, tDNA and ntDNA are indicated in grey, light blue and cyan

respectively.

178

3.7. CH4 Analysis

Preliminary analysis

CH4 opening seems to be created mainly by the ntDNA upstream section from i + 4 to i + 10. aMD

simulations were performed with a reconstructed EC displaying only a satisfactory ntDNA upstream

conformation. Therefore, CH4 has not been thoroughly examined because non-optimal initial

conformation can bias the entire simulation behavior, all the more because extremities of DNA are to

be maintained immobile with restraints, thereby not allowing necessarily the structure to recover from

an initial potentially hedged conformation. A complete transcription bubble (PDB#5C4J) has been

published recently, and provides an adequate structure to investigate CH4. Therefore, investigation of

CH4 has only been started by the author and is in current progress.

Preliminary results seem to show that the access is favorable to substrate diffusion (Figure 78 below).

In addition, i + 2 register appears to orientate most of the time towards CH4, which may be consistent

with the channel being the most favorable NTP loading route. In aMD2 simulation, a glutamate molecule

diffused inside the pre-binding cavity via CH4.

Figure 78: CH4 force-distance plot. The simulation trajectory is sMD CH4 0.075.

179

3.8. Misloading recovery investigation

We have discussed in chapter 1 hypotheses about how misloading recovery could occur in the CH1

model. The CH2 model appears at first glance more straightforward for proposing a misloading recovery

mechanism. If NTP substrates load via CH2, then if a wrong NTP is isomerized in the catalytic site and

subsequently expelled by TL induced fit mechanism, a new NTP can simply travel again via CH2 and

bind if correct. However, the issue is much subtler in light of the CH1 theory. If an erroneous NTP has

bound to DS registers and has been wrongly loaded to the active site, then this time expulsion of the

NTP via CH2, leaves as only option for recovery an obligatory repositioning of i + 1 tDNA register

inside the DS bubble. i + 1 could simply rotate toward CH1 to allow NTP reloading. In other words,

the EC may not necessarily need to be fully pre-translocated to recover from misloading. However, the

latter phenomenon most likely represents an off-pathway short time window, when i + 1 stochastically

shifts toward the DS bubble. On the other hand, a full pre-translocation of the EC, would allow i + 1 to

position more permanently in the DS bubble and hence would represent the on-pathway recovery state.

It appears therefore interesting to investigate the pre-translocation mechanism, because it allows to

refine details about the critical misloading recovery process in a more general CH1 model.

aMD3, with a higher acceleration boost on the dihedral component of the forcefield potential, captured

a complete pre-translocation event. Analysis of the interplay between the enzymatic domains raises the

following observations. During the pre-translocation motion, BH applies a force against free i + 1

register, by bending towards the catalytic site. In contrast to the post-translocation motion following

incorporation of a NTP, the latter register is not immobilized, because it is unbound. aMD3 simulation

shows that when the BH starts bending and exerting a pressure to the free i + 1 nucleotide, the force is

absorbed by the DNA that begins to bend, and the force is telescoped to i + 2 register that undergoes an

almost 180 degrees shift. i + 2 flips and pushes against Switch 2 domain (SW2) resulting in a net motion

of RNAP towards RNA 3’end. While the BH bending continues, i + 1 base flips as well, and stacks

briefly against i + 2 in an inverted position, thereby assisting the push against Switch 2 domain, while

further freeing the catalytic cavity. Finally, i + 1 and i + 2 resume to a non-inverted position and stabilize

in the DS bubble: RNAP has pre-translocated. This mechanism is fascinating for several reasons. First,

the enzyme uses the push against i + 1 indirectly. It does not move away from i + 1 as it could be

intuitively assumed, but rather the induced force is telescoped behind the initial pushing direction of

the BH, to i + 2 that pushes against SW2. Second, it is very interesting to note that RNAP utilizes the

exact same initial mechanical domain motion to carry out sliding on DNA in two opposite directions.

The key is that the same force applied by the BH, is not decoupled in the same way, whether the EC is

in the pre-translocated or the post-translocated geometry, resulting in two net motions in the opposite

direction. The BH does not push in the opposite direction from post-translocation to drive pre-

translocation.

180

Figure 79: Pre-translocation protein re-adjustments occurring near the active site. RNA, tDNA and BH are

shown in lime, light blue and red respectively. i + 1 and i + 2 nucleotides are indicated in yellow and orange

vdw representation respectively. A: The complex is fully post-translocated. B: BH bends and initiates a push

against i + 1 resulting in the flipping of i + 2 register. C: Downstream displacement of the enzymatic complex

is occurring, BH approaches RNA 3’ end and i + 2 register is joined by i + 1 in an inverted position. i + 1

has left the catalytic cavity. D and E: i + 1 switches to the other side of BH, while i + 2 resumes to a non-

inverted position.

Figure 80: Mechanistic basis for pre-translocation. RNA, tDNA, BH, i + 1 nucleotide, i + 2 nucleotide, Switch

1 (scRPB1: 1384-1407) and Switch 2 (scRPB1: 326-345) domains are represented in lime, light blue, red,

yellow, orange, blue and mauve respectively. A and B: while flipping into an inverted position, i + 2 applies

a push against Switch 2 domain. C: i + 1 transiently assists i + 2 pushing against Switch 2 domain, before

being channeled downstream.

A B C

D E

A B C

181

4. Discussion

The intricate gallery structure running through RNAP is very complex. In addition to CH2, five channels

have been identified. Some of them are branched, involving overlapping areas, and some constitute sub-

pathways of larger channels (e.g., CH3C). In all the simulations, melting of registers i + 2 to i + 4 has

been observed, which allows substrate pre-binding in the DS bubble. This could potentially occur in

PDB#5C4J. Diffusion across the different channels, has been reasonably investigated (see next

subsection for future research to be undertaken) and allows to gauge how NTP diffusion-friendly a given

pathway may be. More importantly, it allows to test CH1 model against CH2 loading theory. In all the

investigations carried out, CH2 appears to be the worst option for substrate accommodation. Not only

Figures 51 and 52 show that conformationally the corridor section of CH2 is very constricted, being

even virtually completely closed an important fraction of the time. But also, CH2 tested the least

favorably when applying a pull to force a NTP through the corridor. Indeed, 0.4 kcal.mol-1.A-2 was

required, while 0.3, 0.15, 0.075/0.04 and 0.075 kcal.mol-1.A-2 were sufficient for travel via CH3A,

CH3B, CH3C and CH4 respectively. The electrostatic analysis, corroborated by the free glutamate

metabolite diffusion observations, indicates that the corridor section of CH2 appears more suitable for

exit diffusion, and appears less suitable for substrate entry to the catalytic center. In addition, sMD pull

through CH2 involved artificially maintaining the TL wide open: without this operation, the case would

most likely be worse. In contrast, the CH3C and CH4 pathways, leading directly to a pre-binding site in

the DS bubble, appear to be very credible routes of substrate diffusion and loading.

Importantly, in an unbiased reaction-coordinate aMD simulation (aMD5) using realistic metabolite

concentrations, physiological temperature and a complete transcription bubble, a partly successful

diffusion via CH3C is observed. The NTP has travelled through about half the pathway. It seems that

coordination of the incoming substrate with an additional Mg2+ ion is beneficial to diffusion and helps

traverse the energetic barrier lying at the entrance. For penetration through the channel and unsticking

to protein walls, [Panteva, et al., 2015B] nucleic acid forcefield parameters were switched on for the

GTP (modified 12-6-4 vdw potential for phosphate oxygen and nitrogen N7 atoms). However, in other

simulations using the latter parameters on the substrates lead to an increase in NTP stacking

aggregations. In other words, the utilization of the forcefield modification parameter set from [Panteva,

et al., 2015B] reduced the inconvenience of NTPs sticking unphysiologically at the entrance of CH3C,

yet the same parameters lead to alternative complications such as an increase in NTP aggregation. This

underlines how complex and subtle the parameterization choices can be. sMD simulations sampled a

successful diffusion with only a small biasing force of 0.075 and even 0.04 kcal.mol-1.A-2.

Conformational analysis shows that there is sufficient space remaining in time to accommodate diffusive

substrates. CH3C seems to be periodically gated near the end of the pathway, which has been observed

in some simulations for a short amount of time, but not in aMD5. It is hypothesized that the occasional

gating does impede substrate loading. Electrostatically, Figure 71 seems to indicate that an incoming

182

NTP would only experience an energetic barrier for about 3 to 4 Å at the entrance. It is interesting to

note that the substrate tends to straighten up upon approaching CH3C entry, and then undergo a rotation

of the polyphosphate tail bound to two Mg2+ ions in the direction of the channel. This mechanism could

involve a dipole moment alignment of the NTP with the local electrostatic field, could involve an

electrostatic field ionic screening with MgB, or could simply allow to place the more positively charged

part ahead. This phenomenon could permit diffusional attack along CH3C and help overcome the small

negative barrier. Adding credibility to CH3C being an input channel, is the observation that glutamate

molecules loaded through the pathway at great speed in aMD2, 3, 4 and 5. CH3C seems overall favorable

for substrate input: accessibility dimensions are wide, electrostatic configuration is globally neutral or

assisting.

Only one aMD simulation captured a partly successful diffusion across CH3C. The most likely

explanation is that not enough simulation time was sampled overall. If for the sake of the argument we

assume that a physiological diffusion is very rapid and should be observed in a few nanoseconds, several

hypotheses can be put forward as to why a complete successful diffusion via CH3C has not been

observed in the five relatively short aMD simulations. A first assumption is that forcefield parameters

are suboptimal. In particular, the parameters of the NTP base moiety seem questionable. In aMD5

simulation, the NTP base group tends to stick against protein walls and slow down diffusion via CH3C.

Furthermore, it has been observed that even with the adoption of the 16-12-4 potential, NTPs still tend

to stick to protein surface walls and to periodically form aggregates by stacking interactions. It is

possible that correctly modelling diffusion that involves nucleic acids and nucleic-acid-like NTPs,

would require the use of a polarizable forcefield. It has been indeed suggested that polarizable forcefields

are required to correctly model a system containing nucleic acids [Baker, et al., 2011; Lindert, et al.,

2013]. There is also the issue of the NTP bound highly charged Mg2+ ion parameters, which may still

not be optimal despite the 12-6-4 vdw potential. Hence, it is possible that aMD simulations did not allow

diffusion to converge adequately. Now, let us assume the possibility that the forcefield parameters were

relatively correct, but that the slow timescales available in MD simulations (aMD boost only increases

diffusion by about 3-folds) did not allow sufficient sampling and that substrates did not explore the

optimal pathway fast enough. It might take time for a NTP to be positioned randomly at a favorable

diffusion entry window through CH3C and hence it was only observed in one simulation.

Although CH4 has not been fully investigated at this stage, preliminary analysis seems to indicate that

the pathway is a very credible route of NTP loading to the DS bubble as well. It might even represent

the default mode of substrate loading, since i + 2 register appears to favor orientation towards CH4 (data

not shown).

Several hypotheses can be raised about how downstream pre-bound substrates can be stabilized in the

DS bubble in time, until their loading into the catalytic cavity. One assumption is that stacking

183

interactions between the adjacent NTP-dNMP pairs or involving DS DNA nucleotides in CH1 might

help their hybridization integrity to resist thermal fluctuations. Another hypothesis is that FL2,

contacting directly ntDNA i + 2 register, may help stabilize DS DNA and indirectly the pre-bound rNTP

at tDNA i + 2 position. In [Kireeva, et al., 2011], the authors propose that in addition to play a role in

promoting the isomerization of the active site, FL2 might contribute to the resilience of DS DNA to

thermal fluctuations.

Concerning the electrostatic analyses performed, the following limitations may be noted. The true

electrostatic configuration of a NTP-MgB substrate consists in the distribution of partial charges in

space, and modelling the molecule as a simple point of charge -2 along a diffusive path is a

simplification. This might erase details about the spatial positioning of the NTP relative to the protein

structure during diffusion, which may allow to optimize diffusion attack along a given pathway. Second,

vdw interactions have not been taken into account in the calculations and might affect the diffusion

characteristics of the channels. Finally, a NTP might undergo coordination with protein walls, by

temporarily binding to the enzyme surface. Then the stochastic tilting of the protein region coordinating

the NTP, could help push the substrate through a pathway section. The latter phenomenon could

contribute to cross small energetic barriers, notably the one lying at the entrance of CH3C.

As far as the misloading recovery investigation is concerned, simulation results show that when pre-

translocation occurs to rescue an unbound i + 1 register, the latter register quickly repositions at i + 2

position inside the DS bubble where it becomes available for pairing via CH3C. Both literature (e.g.,

[Dangkulwanich, et al., 2013]) and the observation of a rapid pre-translocation event in the absence of i

+ 1 NTP, supports the idea that: as the EC necessarily oscillates if i + 1 position is unbound, and hence

if NTPs are not loaded immediately following the previous nucleotide incorporation, and as the EC does

not seem to oscillate in normal on-pathway elongation, then it means that NTPs are necessarily pre-

bound in normal elongation. In other words, it appears that the only way to prevent rapid spontaneous

pre-translocation to occur (which does not seem to occur in fast elongation) is to have the EC

immediately locked from the first incorporation event to the next, and hence that the next NTP is already

pre-bound to i + 2, resulting in the instantaneous fixing of the EC following the transition between two

incorporations. In addition to all its conceptual drawbacks, the CH2 model does not allow to solve the

latter issue, whereas the CH1 pre-binding mechanism fits perfectly.

In summary, a general model of substrate delivery, linked to translocation, is proposed in the figures

hereafter.

184

Figure 81: Schematic representation of EC-RNAP coordination with substrate diffusion trajectory. The

figure depicts a NTP, that is complementary to the i + 2 binding site accessible in the downstream bubble,

reaching the CH3 (via CH3C) or CH4 side of the DS DNA helix pre-binding region. RNAP is represented

as a grey train sliding along a DNA frame. tDNA (upper strand) and ntDNA (bottom strand) are represented

as chains of connecting lozenges. RNA strand is constituted of connecting stars, and is extruded through the

RNA exit channel. NTPs are shown as triangles. The cyan, orange, blue and purple colors represent

indistinctly the four bases or NTP types possible. CH1, CH2, CH3, CH4, Switch 2 domain (SW2) and BH

are indicated. The MgA/MgB binding sites are represented by the metallic border fixing NTP number 4 in

the active site. The enzymatic process is simplified by shortening the real length of the nucleic acids, by

representing the downstream binding region by only one available register: only i + 2 is considered and i +

3/4 are ignored, and by separating radically CH1, CH2 and CH3, from each other for visualization purposes.

Also, in reality CH3 and CH4 reach the DS bubble from different directions and are not juxtaposed to each

other.

CH2 CH3 CH4

BH

CH1 SW2

185

Figure 82: Schematic representation of on-pathway state 1. While i + 1 NTP is undergoing catalysis in the

active site, i + 2 substrate diffuses via CH4 and binds to i + 2. Until the NTP in the active site has not

undergone the chemical reaction incorporating it into the RNA transcript, EC is notably immobilized by i

+ 1 NTP binding to MgB and MgA sites.

Figure 83: Schematic representation of on-pathway state 2. i + 1 NTP is incorporated at RNA 3’end and

PPi-MgB (represented by a small silver ball) is expelled through CH2. MgB site interaction is eliminated,

MgA site interaction is loosen up. RNAP is free to move forwards, but not backwards due to the steric block

induced by the RNA 3’ end.

186

Figure 84: Schematic representation of on-pathway state 3. BH bends and applies a force against RNA 3’end

initiating post-translocation.

Figure 85: Schematic representation of on-pathway state 4. RNAP has undergone post-translocation along

the DNA frame, resetting the nucleotide addition cycle one increment forward. i + 2 NTP is now at i + 1

position. A new NTP diffuses through CH4 and binds to i + 2.

187

Figure 86: Schematic representation of off-pathway state 1. A wrong NTP has been loaded into the active

site (through wrong pre-binding to i + 2 and subsequent loading to the catalytic center).

Figure 87: Schematic representation of off-pathway state 2. The mismatched NTP is expelled through CH2

via TL induced fit mechanism (second layer of nucleotide discrimination).

188

Figure 88: Schematic representation of off-pathway state 3. BH bends and initiates a push against the free

i + 1 base. tDNA i + 2 nucleotide flips around and initiates a strong push against Switch 2 domain.

Figure 89: Schematic representation of off-pathway state 4. BH bending is further decoupled as a force

pushing against Switch domain 2 via the flipping of i + 2 and i + 1 bases, thereby driving pre-translocation.

189

Figure 90: Schematic representation of off-pathway state 5. Resulting from the force applied against Switch

2 domain, RNAP pre-translocates. i register position of the RNA-tDNA hybrid enters the catalytic cavity. i

+ 1 tDNA register repositions at i + 2 location in the downstream channel, where it is available for binding

a new (matched) NTP. RNAP EC has been reset one step backwards.

Figure 91: Schematic representation of off-pathway state 6. BH bends and applies a force against the hybrid,

thereby initiating post-translocation.

190

Figure 92: Schematic representation of off-pathway state 7. RNAP has post-translocated, i + 2 NTP has

loaded into the active site. i + 1 register is now bound to the right NTP. RNAP EC has been rescued.

191

5. Future Works

A RNAP structure containing a complete EC has been recently published (PDB#5C4J, [Barnes, et al.,

2015]). Repeating all the work presented in this section with this system is proposed to be a priority

future work, because the path of the nucleic acid strands is optimal compared to a reconstructed EC.

Because the CH1 theory is very controversial, it is a good idea to use the most undebatable starting

system possible. Comparison with the reconstructed EC shows that the DNA positions are almost

identical, with the fine distinction of ntDNA trajectory between register i + 4 to i - 11. Work has been

initiated with 5C4J, where the ntDNA adopts a slight conformation difference, and could improve

diffusion via CH4 and possibly via CH3C. Preliminary simulations appear to confirm the availability of

DS registers, where i + 3 to i + 2 are often in the melted state and i + 4 is in transient association.

Future tasks are also to be pursued with sMD. Accessibility of the pathways can change drastically in

time. Hence executing sMD runs from different starting pathway conformations could allow to better

characterize how diffusion friendly a pathway may be. aMD parameters that were used in combination

with sMD (sMD CH3C 0.04) were very aggressive, and repeating sMD-aMD runs with a moderate

acceleration could be more suitable. For example, a total boost acceleration that is too high can distort

the solvent. Overall, more aMD parameters and sMD forces are to be tested in future research.

Furthermore, sMD, electrostatic, cross section area and minimal radius analyses are to be carried out for

CH3D and CH4, which have not been fully investigated.

A number of options can be explored to improve the sampling of substrate loading in MD simulations.

Raising the temperature does not seem adequate, as diffusion is a subtle process, and increasing the

thermal energy could modify for example the conformation of the channels. Modelling the NTP as a

sphere could allow to tackle the issue of substrates sticking to the protein walls, yet details of the

diffusion process would be lost. A more promising trick could be to repeat the aMD simulations, but

using only CTPs. A hypothesis is that the NTP would diffuse faster because it is smaller than GTP and

have an enhanced chance of successfully binding because it forms the G-C hydrogen bond, which is

stronger than the A-T bond. At this stage, aMD simulations with PDB#5C4J and 5.9 mM CTPs have

been started. Another future work could consist in providing the solvent box with higher Mg2+

concentrations, because the binding of a second magnesium ion to the NTP substrate balances the overall

electrostatic potential of the molecule. In the partly successful aMD run, the NTP diffusing through

CH3C is bound to an additional Mg2+ ion. This could however raise new issues. There are for example

questions marks about the possibility of catalysis of a loaded NTP coordinated to a second magnesium

atom. A strategy to increase the probability of simulating a complete successful diffusion, without

biasing the reaction-coordinate (such as in sMD), could be the increasing of NTP concentrations. Long

lived unproductive stacking aggregations were observed in simulations with a concentration of 5.9 mM.

Hence, multiplying the number of rNTPs in the solvent box could at first glance appear detrimental.

192

Nevertheless, such an inconvenience seems to be greatly reduced in preliminary simulations with an

alternative set of NTP parameters (provided by Prof. R. Amaro from UCSD) applied to CTP molecules,

without using [Panteva, et al., 2015B] modifications, and with aMD3 acceleration parameters (larger

dihedral boost). Hence, future works could consider adding more substrates into the simulation box,

with the use of well-reasoned parameters, and with a high dihedral boost. Next, glutamate and sulfate

metabolites displayed tendency to bind to MgB in aMD simulations, thereby increasing the negative

potential of the substrate. Adjusting the metabolite content, for example by reducing the glutamate and

sulfate concentrations, might increase the probability of sampling successful diffusions. Finally, Markov

State modelling could be explored, where several short aMD simulations (e.g., 20 ns) are run to map the

reaction-coordinate probability distribution.

Using a polarizable forcefield such as AMOEBA [Shi, et al., 2013] could be necessary to correctly

model a system containing nucleic acids and highly charged substrates. At this stage, such forcefields

are still in development and lack an important range of parameters for metabolites and nucleic acids.

Also, using polarizable forcefield increases simulation time of about 10-fold. Developments in the

electronic industry, in particular in GPUs being increasingly powerful, might allow sufficient sampling

time in the future.

Additional analyses to be performed could include examining the dipole alignment of the NTP with the

local Electrostatic field, which could reduce the diffusive degrees of freedom, and to monitor the water

flow across the channel which could be partly directional and impact input/output of substrates.

193

6. Conclusions

The substrate diffusion and loading mechanism to the active site of RNAP has many fundamental

implications concerning matters such as nucleotide discrimination, translocation and the general

sequential orchestration of the enzyme. The molecular architecture of the enzyme is very complex and

structural characteristics have been overlooked, such as the existence of several additional pathways

connecting the inside of the enzyme to the solvent. The secondary channel has been erroneously

considered as being the only unobstructed path of substrate diffusion. We propose that the widespread

CH2 theory about nucleotide triphosphate diffusion should be rejected, because the evidences

supporting the theory do not withstand scrutiny. The channel does not seem suitable both

conformationally and electrostatically to accommodate rapid input of substrates, moreover fast diffusion

through the pathway is not supported by aMD and sMD simulations. The pathway imposes conceptual

issues; such as bottle-neck roadblocking where successive substrates must halt in front of a narrow

section, the “corridor”, until wrong alternative substrates bound at the E site or the A site diffuse away,

in order to eventually load to the active site to check if they are matched to the DNA base to be

transcribed. The alternative main channel model on the other hand, initially proposed on the basis of

kinetics experiments, which evidences were sometimes overlooked, is fully supported by the research

presented in this thesis. An aMD simulation, using realistic conditions, such as a full nucleic acid EC

and physiological concentration of metabolites, captured the initial diffusion process of a nucleotide

travelling through an alternative channel, termed the tertiary channel and leading to a pre-binding region

in the main channel. In particular, a specific potential loading path through the tertiary channel, CH3C,

is supported by conformational, electrostatic and sMD analysis. An alternative pathway: CH4, has been

identified, and seems also to be a credible route of substrate diffusion to CH1. The following general

mechanism of NTP loading is proposed. Nucleotide substrates diffuse via CH3C or CH4. The last fourth

of the CH3C path is sometimes stochastically gated by scRPB2: 204-206, in which case incoming NTPs

would temporarily halt in CH3 bubble adjacent to DNA or diffuse away until a favorable time window

occurs. They then reach a pre-binding region where i + 2 and i + 3 tDNA registers are predominantly

melted and i + 4 is sometimes available. They bind to the latter registers if they are complementary or

diffuse away and exit the protein. Stacking interactions between multiple pre-bound substrates, between

NTP-dNMPs and DS DNA or interaction of FL2 with ntDNA i + 2 position, might facilitate their

stabilization in the DS bubble. Finally, the pre-bound substrates are loaded sequentially into the active

site, when post-translocation advances the enzymatic complex one tDNA base forward to incorporate

the next nucleotide. Although CH2 does not seem to serve the function of substrate input, we propose

that it is an excellent output pathway, where misloaded substrate and the bi-product of the elongation

reaction are expelled. Additional functions of the secondary channel are TF binding site (TFIIS for

eukaryotic RNAP II and GreA/B for bacterial RNAP), possible transient binding site for RNA during

pause-arrest, and site for RNA backtracking. Subsidiary conclusions are the following. NTP loading is

194

not rate limiting at non-subsaturating concentrations because CH3C/CH4 allows fast substrate input,

and most importantly because while i + 1 NTP undergoes incorporation, DS registers dispose of an

important time window to bind substrates, without impacting the on-pathway kinetics. The latter

considerations would corroborate very high elongation speed measured in studies. The first layer of

nucleotide discrimination is performed directly in the downstream bubble, prior to NTP loading, and the

catalytic site only concerns the second layer of selection, notably involving the TL induced fit

mechanism. We complete the model of substrate loading by suggesting that misloading recovery in

performed in three steps. i + 1 register mismatched substrate is expelled through CH2, the enzyme then

pre-translocates via BH induced nucleotide flip against Switch 2 domain and the register is reset for

base-pairing in the downstream bubble where it becomes available again to a CH3C/CH4 diffusing

rNTP. Finally, we note that the main channel model has several fundamental implications concerning

the manner translocation, the central mechanism underlying elongation, proceeds. The standard

Brownian ratchet model is most likely partly incorrect, where the EC does not necessarily oscillate.

Immediate loading of pre-bound nucleotide via the main channel during translocation is perfectly in line

with a model of forward translocation locking during normal elongation, which is consistent with recent

studies indicating that translocation would not oscillate when substrates are supplemented at sufficient

concentrations. RNAP can be seen as a factory chain where substrates are lined up inside the enzyme

before undergoing catalysis. The enzymatic machine orchestrating genetic transcription, truly is, a

masterpiece of Engineering.

195

References

Abbondanzieri, E., et al., Direct observation of base-pair stepping by RNA polymerase, Nature, Vol.

438, 460-465 (2005)

Allner, O., et al., Magnesium Ion−Water Coordination and Exchange in Biomolecular Simulations,

Chem. Theory Comput., Vol. 8, 1493−1502 (2012)

Andreacka, J., et al., Nano positioning system reveals the course of upstream and nontemplate DNA

within the RNA polymerase II elongation complex, Nucleic Acids Research, Vol. 37, 1–7 (2009)

Aqvist, J., A Simple Way to Calculate the Axis of an α-Helix, Computers & Chemistry, Vol. 10, 97-99

(1986)

Aqvist, J., Ion-Water Potentials Derived from Free Energy Perturbation Simulations, J. Phys. Chem.,

Vol. 94, 8021-8024 (1990)

Arino, J., et al., Alkali Metal Cation Transport and Homeostasis in Yeasts, Microbiol. Mol. Biol. Rev.,

Vol. 74, 95–120 (2010)

Armache, K.-J., et al., Architecture of initiation-competent 12-subunit RNA polymerase II, PNAS, Vol.

100, 6964–6968 (2003)

Auesukaree, C., et al., Intracellular Phosphate Serves as a Signal for the Regulation of the PHO Pathway

in Saccharomyces cerevisiae, Vol. 279, 17289–17294 (2004)

Bai, L., et al., Sequence-dependent Kinetic Model for Transcription Elongation by RNA Polymerase, J.

Mol. Biol., Vol. 344, 335-349 (2004)

Bai, L., et al., Mechanochemical Kinetics of Transcription Elongation, Physical Review Letters, Vol.

98, 068103-1-068103-4 (2007)

Baker, C., M., et al., Development of CHARMM polarizable force field for nucleic acid bases based on

the classical Drude oscillator model, J. Phys. Chem. B, Vol. 155, 580-596 (2011)

Bansal, M., et al., HELANAL: A Program to Characterize Helix Geometry in Proteins, Journal of

Biomolecular Sructure & Dynamics, Vol. 17, 811-819 (2012)

Bar-Nahum, G., et al., A Ratchet Mechanism of Transcription Elongation and Its Control, Cell, Vol.

120, 183-193 (2005)

Barnes, C., O., et al., Crystal Structure of a Transcribing RNA Polymerase II Complex Reveals a

Complete Transcription Bubble, Molecular Cell, Vol. 59, 258–269 (2015)

Batada, N., et al., Diffusion of nucleoside triphosphates and role of the entry site to the RNA polymerase

II active center, PNAS, Vol. 101, 17361-17364 (2004)

Beauchamp, K., A., et al., Are Protein Force Fields Getting Better? A Systematic Benchmark on 524

Diverse NMR Measurements, J. Chem. Theory Comput., Vol. 8, 1409-1414 (2012)

Belogurov, G., A., et al., Transcription inactivation through local refolding of the RNA polymerase

structure, Nature, Vol. 457, 332-336 (2009)

196

Best, R., B., et al., Are Current Molecular Dynamics Force Fields too Helical?, Biophysical Journal:

Biophysical Letters, Vol. 95, L07-L09 (2008)

Bochkareva, A., et al., Factor-independent transcription pausing caused by recognition of the RNA–

DNA hybrid sequence, The EMBO Journal, Vol. 31, 630–639 (2012)

Boer, V., M., Growth-limiting Intracellular Metabolites in Yeast Growing under Diverse Nutrient

Limitations, Molecular Biology of the Cell Vol. 21, 198–211 (2010)

Brueckner, F., Cramer, P., Structural basis of transcription inhibition by α-amanitin and implications for

RNA polymerase II translocation, nature structural & molecular biology, Vol. 15, 811-816 (2008)

Brueckner, F., et al., A movie of the RNA polymerase nucleotide addition cycle, Current Opinion in

Structural Biology, Vol. 19, 294-299 (2009)

Bucher, D., et al., Accessing a Hidden Conformation of the Maltose Binding Protein Using Accelerated

Molecular Dynamics, PLoS Computational Biology, Vol. 7, e1002034 (2011A)

Bucher, D., et al., On the Use of Accelerated Molecular Dynamics to Enhance Configurational Sampling

in Ab Initio Simulations, J. Chem. Theory Comput., Vol. 7, 890–897 (2011B)

Burton, Z., F., et al., NTP-driven translocation and regulation of downstream template opening by multi-

subunit RNA polymerases, Biochem. Cell Biol., Vol. 83, 486–496 (2005)

Bushnell, D., A., et al., Structural basis of transcription: α-Amanitin–RNA polymerase II cocrystal at

2.8 Å resolution, PNAS, Vol. 99, 1218–1222 (2002)

Bushnell, D., A., Kornberg, R., D., Complete, 12-subunit RNA polymerase II at 4.1-Å resolution:

Implications for the initiation of transcription, PNAS, Vol. 100, 6969–6973 (2003)

Bushnell, D., A., et al., Structural Basis of Transcription: An RNA Polymerase II-TFIIB Cocrystal at

4.5 Angstroms, Science, Vol. 303, 983-988 (2004)

Camacho, M., et al., Potassium requirements of Saccharomyces cerevisiae. Current Microbiology, Vol.

6, 295-299 (1981)

Canelas, A., B., et al., Leakage-free rapid quenching technique for yeast metabolomics, Metabolomics,

Vol. 4, 226–239 (2008A)

Canelas, A., B., et al., Determination of the cytosolic free NAD/ NADH ratio in Saccharomyces

cerevisiae under steady-state and highly dynamic conditions, Biotechnol Bioeng, Vol. 100, 734–743

(2008B)

Cannon, W., R., et al., Sulfate Anion in Water: Model Structural, Thermodynamic, and Dynamic

Properties, J. Phys. Chem., Vol. 98, 6225-6230 (1994)

Case, D., A., et al., AMBER 2016, University of California, San Francisco (2016)

Cheung, A., C., M., Cramer, P., Structural basis of RNA polymerase II backtracking, arrest and

reactivation, Nature, Vol. 471, 249-253 (2011)

Cheung, A., C., M., Cramer, P., A Movie of RNA Polymerase II Transcription, Cell, Vol. 149, 1431-

1437 (2012)

197

Christopher, J., A., Swanson, R., et al., Algorithms for Finding the Axis of a Helix: Fast Rotational and

Parametric Least-Squares Methods, Computers Chem., Vol. 20, 339-345 (1996)

Chovancova, E., et al., CAVER 3.0: A Tool for the Analysis of Transport Pathways in Dynamic Protein

Structures, e1002708 (2012)

Cino, E., A., et al., Comparison of Secondary Structure Formation Using 10 Different Force Fields in

Microsecond Molecular Dynamics Simulations, J. Chem. Theory Comput., Vol. 8, 2725-2740 (2012)

Conaway, R., C., et al., TFIIS and GreB: Two Like-Minded Transcription Elongation Factors with

Sticky Fingers, Cell, Vol. 114, 272-274 (2003)

Cramer, P., et al., Architecture of RNA Polymerase II and Implications for the Transcription

Mechanism, Science, Vol. 288, 640-649 (2000)

Cramer, P., et al., Structural Basis of Transcription: RNA Polymerase II at 2.8 Ångstrom Resolution,

Science, Vol. 292, 1963-1876 (2001)

Da, L.-T., et al., Dynamics of Pyrophosphate Ion Release and Its Coupled Trigger Loop Motion from

Closed to Open State in RNA Polymerase II, J. Am. Chem. Soc., Vol. 134, 2399−2406 (2011)

Da, L.-T., et al., A Two-State Model for the Dynamics of the Pyrophosphate Ion Release in Bacterial

RNA Polymerase, PLOS Computational Biology, Vol. 9, 1-9 (2013)

Dalton, J., A., R., et al., Calculating of helix packing angles in protein strcutures, Bioinformatics, Vol.

19, 1298-1299 (2003)

Damsma, G., E., et al., Mechanism of transcriptional stalling at cisplatin-damaged DNA, Nature

Structural & Molecular Biology, Vol. 14, 1127-1133 (2007)

Dangkulwanich, M., et al., Complete dissection of transcription elongation reveals slow translocation

of RNA polymerase II in a linear ratchet mechanism, eLife, Vol. 2, 1-22 (2013)

Davenport, R., J., et al., Single-Molecule Study of Transcriptional Pausing and Arrest by E. coli RNA

Polymerase, Science, Vol. 287, 2497-2500 (2000)

de Oliviera, C., A., F., et al., On the Application of Accelerated Molecular Dynamics to Liquid Water

Simulations, J. Phys. Chem. B, Vol. 110, 22695-22701 (2006)

de Oliviera, C., A., F., et al., Large-Scale Conformational Changes of Trypanosoma cruzi Proline

Racemase Predicted by Accelerated Molecular Dynamics Simulation, PLoS Computational Biology,

Vol. 7, e1002178 (2011)

Domecq, C., et al., Site-directed mutagenesis, purification and assay of Saccharomyces cerevisiae RNA

polymerase II, Protein Expression and Purification, Vol. 69, 83-90 (2010)

Doshi, U., Hamelberg, D., Achieving Rigorous Accelerated Conformational Sampling in Explicit

Solvent, J. Phys. Chem. Lett., Vol. 5, 1217-1224 (2014)

Duan, B., et al., A Critical Residue Selectively Recruits Nucleotides for T7 RNA Polymerase

Transcription Fidelity Control, Biophysics Journal, Vol. 107, 2130-2140 (2014)

Eastman, P., et al., OpenMM 4: A Reusable, Extensible, Hardware Independent Library for High

Performance Molecular Simulation, J. Chem. Theory Comput., Vol. 9, 461-469 (2013)

198

Eastman, P., Pande, V., S., Constant Constraint Matrix Approximation: A Robust, Parallelizable

Constraint Method for Molecular Simulations, J. Chem. Theory Comput., Vol. 6, 434-437 (2010A)

Eastman, P., Pande, V., S., Efficient Nonbonded Interactions for Molecular Dynamics on a Graphics

Processing Unit, J. Comput. Chem., Vol. 31, 1268–1272 (2010B)

Enkhbayar, P., Damdinsuren, S., et al., HELFIT: Helix fitting by a total least squares method,

Computational Biology and Chemistry, Vol. 32, 307-310 (2008)

Erie, D., A., Kennedy, S., R., Forks, pincers, and triggers: the tools for nucleotide incorporation and

translocation in multi-subunit RNA polymerases, Current Opinion in Structural Biology, Vol. 19, 708-

714 (2009)

Eun, C., et al., Molecular Dynamics Simulation Study of Conformational Changes of Transcription

Factor TFIIS during RNA Polymerase II Transcriptional Arrest and Reactivation, PLOS ONE, Vol. 9,

1-8 (2014)

Feig, M., Burton, Z., F., RNA Polymerase II with Open and Closed Trigger Loops: Active Site

Dynamics and Nucleic Acid Translocation, Biophysical Journal, Vol. 99, 2577-2586 (2010)

Foster, J., E., et al., Allosteric Binding of Nucleoside Triphosphates to RNA Polymerase Regulates

Transcription Elongation, Cell, Vol. 106, 243–252 (2001)

Fouqueau, T., et al., The RNA polymerase trigger loop functions in all three phases of the transcription

cycle, Nucleic Acids Research, Vol. 41, 7048-7059 (2013)

Frenkel, D., Smit, B., Understanding Molecular Simulation, From Algorithms to Applications,

Academic Press, San Diego, USA (2002)

Friedrichs, M., S., et al., Accelerating Molecular Dynamic Simulation on Graphics Processing Units, J.

Comput. Chem., Vol. 30, 864-872 (2009)

Fu, J., et al., Yeast RNA Polymerase II at 5 Å Resolution, Cell, Vol. 98, 799–810, (1999)

Gnatt, A., L., et al., Structural Basis of Transcription: An RNA Polymerase II Elongation Complex at

3.3 Å Resolution, Science, Vol. 292, 1876-1882 (2001)

Gong, X., et al., Dynamic Error Correction and Regulation of Downstream Bubble Opening by Human

RNA Polymerase II, Molecular Cell, Vol. 18, 461–470 (2005)

Gonzalez, B., et al., Dynamic in vivo 31P nuclear magnetic resonance study of Saccharomyces cerevisiae

in glucose-limited chemostat culture during the aerobic-anaerobic shift, Yeast, Vol. 16, 483-497 (2000)

Grant, B., J., et al., Ras Conformational Switching: Simulating Nucleotide- Dependent Conformational

Transitions with Accelerated Molecular Dynamics, PLoS Computational Biology, Vol. 5, e1000325

(2009)

Graschopf, A., et al., The Yeast Plasma Membrane Protein Alr1 Controls Mg2+ Homeostasis and Is

Subject to Mg2+ -dependent Control of Its Synthesis and Degradation, The Journal of Biological

Chemistry, Vol. 276, 16216-16222 (2001)

Greive, S., J., von Hippel, P. H., Thinking Quantitatively About Transcriptional Regulation, Nature

Reviews Molecular Cell Biology, Vol. 6, 221-232 (2005)

199

Guajardo, R., Sousa, R., A Model for the Mechanism of Polymerase Translocation, J. Mol. Biol., Vol.

265, 8-19 (1997)

Guo, Q., Sousa, R., Translocation by T7 RNA Polymerase: A Sensitively Poised Brownian Ratchet, J.

Mol. Biol., Vol. 358, 241-254 (2006)

Hamelberg, D., et al., Accelerated molecular dynamics: A promising and efficient simulation method

for biomolecules, The Journal of Chemical Physics, Vol. 120, 11919-11929 (2004)

Hamelberg, D., et al., Sampling of slow diffusive conformational transitions with accelerated molecular

Dynamics, The Journal of Chemical Physics, Vol. 127, 155102-155110 (2007)

Hans, M., A., et al., Quantification of intracellular amino acids in batch cultures of Saccharomyces

cerevisiae, Appl Microbiol Biotechnol, Vol. 56, 776–779 (2001)

Hans, M., A., et al., Free Intracellular Amino Acid Pools During Autonomous Oscillations in

Saccharomyces cerevisiae, Biotechnology and Bioengineering, Vol. 82, 143-151 (2003)

Hein, P., P., et al., RNA Transcript 3′-Proximal Sequence Affects Translocation Bias of RNA

Polymerase, Biochemistry, Vol. 50, 7002-7014 (2011)

Herbert, K., M., et al., Sequence-Resolved Detection of Pausing by Single RNA Polymerase Molecules,

Cell, Vol.125, 1083–1094 (2006)

Herrera, R., et al., Subcellular potassium and sodium distribution in Saccharomyces cerevisiae wild-

type and vacuolar mutants, Biochem. J., Vol. 454, 525-532 (2013)

Holmes, S., F., Erie, D. A., Downstream DNA Sequence Effects on Transcription Elongation: Allosteric

Binding Of Nucleoside Triphosphates Facilitates Translocation Via A Ratchet Motion, J. Biol. Chem,

Vol. 278, 35597-35608 (2003)

Holmes, S., F., et al., Kinetic Investigation of Escherichia coli RNA Polymerase Mutants That Influence

Nucleotide Discrimination and Transcription Fidelity, J. Biol. Chem., Vol. 281, 18677-18683 (2006)

Homeyer, N., et al., AMBER force-field parameters for phosphorylated amino acids in different

protonation states: phosphoserine, phosphothreonine, phosphotyrosine, and phosphohistidine, J Mol

Model, Vol. 12, 281-289 (2006)

Horn, H., W., et al., Development of an improved four-site water model for biomolecular simulations:

TIP4P-Ew. J. Chem. Phys., Vol. 120, 9665-9678 (2004)

Horn, H., W., et al., J. Characterization of the TIP4P-Ew water model: Vapor pressure and boiling point.

J. Chem. Phys., Vol. 123, 194504 (2005)

Horn, A., H., C., A consistent force field parameter set for zwitterionic amino acid Residues, J Mol

Model, Vol. 20, 2478-2491 (2014)

Humphrey, W., et al., VMD-Visual Molecular Dynamics, J. Molec. Graphics, Vol. 14, 33-38 (1996)

Imashimizu, M., et al., Intrinsic Translocation Barrier as an Initial Step in Pausing by RNA Polymerase

II, J. Mol. Biol. Vol. 425, 697-712 (2013)

200

Jennings, M., L., Cui J., Chloride homeostasis in Saccharomyces cerevisiae: high affinity influx, V-

ATPase-dependent sequestration, and identification of a candidate Cl− sensor, J. Gen. Physiol., Vol.

131, 379-391 (2008)

Jiang, Y., et al., Refined Dummy Atom of Mg2+ by Simple Parameter Screening Strategy with Revised

Experimental Solvation Free Energy, J. Inf. Chem. Model., Vol. 55, 2575-2586 (2015)

Kahm, M., et al., Potassium Starvation in Yeast: Mechanisms of Homeostasis Revealed by Mathematical

Modeling, PLoS Computational Biology, Vol. 8, e1002548 (2012)

Kahn, P., C., Defining the Axis of a Helix, Computers Chem., Vol. 13, 185-189 (1988)

Kaplan, C., D., et al., The RNA Polymerase II Trigger Loop Functions in Substrate Selection and Is

Directly Targeted by a-Amanitin, Molecular Cell, Vol. 30, 547-556 (2008)

Kaplan, C., D., et al., Dissection of Pol II Trigger Loop Function and Pol II Activity–Dependent Control

of Start Site Selection In Vivo, PLoS Genetics, Vol. 8, 1-17 (2012)

Kappel, K., et al., Accelerated molecular dynamics simulations of ligand binding to a muscarinic G-

protein-coupled receptor, Quarterly Reviews of Biophysics, Vol. 48, 479-487 (2015)

Kashkina, E., et al., Multisubunit RNA Polymerases Melt Only a Single DNA Base Pair Downstream

of the Active Site, J. Biol. Chem., Vol. 282, 21578-21582 (2007)

Kennedy, S., Erie, D., Templated nucleoside triphosphate binding to a noncatalytic site on RNA

polymerase regulates transcription, PNAS, Vol. 108, 6079-6084 (2011)

Kettenberger, H., et al., Architecture of the RNA Polymerase II-TFIIS Complex and Implications for

mRNA Cleavage, Cell, Vol. 114, 347–357 (2003)

Kettenberger, H., et al., Complete RNA Polymerase II Elongation Complex Structure and Its

Interactions with NTP and TFIIS, Molecular Cell, Vol. 16, 955–965 (2004)

Kettenberger, H., et al., Structure of an RNA polymerase II-RNA inhibtor complex elucidates

transcription regulation by noncoding RNAs, Nature Structural & Molecular Biology, Vol. 13, 44-48

(2006)

Kireeva, M., L., et al., Nature of the Nucleosomal Barrier to RNA Polymerase II, Molecular Cell, Vol.

18, 97-108, (2005)

Kireeva, M., L., et al., Transient Reversal of RNA Polymerase II Active Site Closing Controls Fidelity

of Transcription Elongation, Molecular Cell, Vol. 30, 557-566 (2008)

Kireeva, M., L., et al., Millisecond phase kinetic analysis of elongation catalyzed by human, yeast and

Escherichia coli RNA polymerase, Methods, Vol. 48, 333-345 (2009)

Kireeva, M., L., et al., Translocation by multi-subunit RNA polymerases, Biochimica et Biophysica

Acta, Vol. 1799, 389-401 (2010)

Kireeva, M., L., et al., Interaction of RNA Polymerase II Fork Loop 2 with Downstream Non-template

DNA Regulates Transcription Elongation, J. Biol. Chem., Vol. 286, 30898-30910 (2011)

Kireeva, M., L., et al., Molecular dynamics and mutational analysis of the catalytic and translocation

cycle of RNA polymerase, BMC Biophysics, Vol. 5, 11.1-11.18 (2012)

201

Kirkegaard, K., et al., Mapping of single-stranded regions in duplex DNA at the sequence level: Single-

strand-specific cytosine methylation in level: Single-strand-specific cytosine methylation in RNA

polymerase-promoter complexes, Proc. Nati Acad. Sci., Vol. 80, 2544-2548 (1983)

Kolacna, L., et al., New phenotypes of functional expression of the mKir2.1 channel in potassium efflux-

deficient Saccharomyces cerevisiae strains, Yeast, Vol. 22, 1315-1323 (2005)

Komissarova, N., Kashlev, M., RNA Polymerase Switches between Inactivated and Activated States By

Translocating Back and Forth along the DNA and the RNA, J. Biol. Chem., Vol. 272, 15329-15338

(1997A)

Komissarova, N., Kashlev, M., Transcriptional arrest: Escherichia coli RNA polymerase translocates

backward, leaving the 3’ end of the RNA intact and extruded, Proc. Natl. Acad. Sci., Vol. 94, 1755-

1760 (1997B)

Komuro, Y., et al., CHARMM Force-Fields with Modified Polyphosphate Parameters Allow Stable

Simulation of the ATP-Bound Structure of Ca2+-ATPase, J. Chem. Theory Comput., Vol. 10,

4133−4142 (2014)

Korzheva, N., et al., A Structural Model of Transcription Elongation, Science, Vol. 289, 619-625 (2000)

Kozlikova, et al., CAVER Analyst 1.0: graphic tool for interactive visualization and analysis of tunnels

and channels in protein structures, Bioinformatics, Vol. 30, 2684-2685 (2014)

Krepl, M., et al., Reference simulations of noncanonical nucleic acids with different chi variants of the

AMBER force field: Quadruplex DNA, quadruplex RNA, and Z-DNA, J. Chem. Theory Comp., Vol.

8, 2506–2520 (2012)

Krieger, E., et al., Increasing the precision of comparative models with YASARA NOVA-a self-

parametizing force field, Proteins, Vol. 47, 393-402 (2002)

Kumar, P., Bansal, M., HELANAL-Plus: a web server for analysis for helix geometry in protein

structures, Journal of Biomolecular Structures and Dynamics, Vol. 30, 773-783 (2012)

Landick, R., NTP-entry routes in multi-subunit RNA polymerases, Trends in Biochemical Sciences,

Vol.30, 651-654 (2005)

Lange, O., F., et al., Scrutinizing Molecular Mechanics Force Fields on the Submicrosecond Timescale

with NMR Data, Biophysical Journal, Vol. 99, 647-655 (2010)

Langelier, M.-F., et al., The highly conserved glutamic acid 791 of Rpb2 is involved in the binding of

NTP and Mg(B) in the active center of human RNA polymerase II, Nucleic Acids Research, Vol. 33,

2629–2639 (2005)

Larson, M., H., et al., Trigger loop dynamics mediate the balance between the transcriptional fidelity

and speed of RNA polymerase II, PNAS, Vol. 109, 6555-6560 (2012)

Le Grand, S., et al., SPFP: Speed without compromise—A mixed precision model for GPU accelerated

molecular dynamics simulations, Computer Physics Communications, Vol. 184, 374-380 (2013)

Lee, H., S., et al., QHELIX: A Computational Tool for the Improved Measurement of Inter-Helical

Angles in Proteins, Protein J, Vol. 56, 556-561 (2007)

202

Li, P., et al., Systematic Parameterization of Monovalent Ions Employing the Nonbonded Model, J.

Chem. Theory Comput., Vol. 11, 1645-1657 (2015)

Li, P., Merz Jr., K., M., Taking into Account the Ion-induced Dipole Interaction in the Nonbonded

Model of Ions, J Chem Theory Comput, Vol. 10, 289-297 (2014)

Lindert, S., et al., Dynamics and Calcium Association to the N-Terminal Regulatory Domain of Human

Cardiac Troponin C: A Multiscale Computational Study, J. Phys. Chem. B, Vol. 116, 8449-8459 (2012)

Lindert, S., et al., Accelerated Molecular Dynamics Simulations with the AMOEBA Polarizable Force

Field on Graphics Processing Units, J. Chem. Theory Comput, Vol. 9, 4684−4691 (2013)

Lindorff-Larsen, K., et al., Systematic Validation of Protein Force Fields against Experimental Data,

PLoS ONE, Vol. 7, e32131: 6 (2012)

Lu, X.-J., Olson, W., L., 3DNA: a software package for the analysis rebuilding and visulization of three-

dimensional nucleic acid structures, Nucleic Acids Research, Vol. 31, 5108-5121 (2003)

Maathius, F., J., M., Amtmann A., K+ Nutrition and Na+ Toxicity: The Basis of Cellular K+/Na+ Ratios,

Annals of Botany, Vol. 84, 123-133 (1999)

Magdenoska, O., et al., Quantifying intracellular metabolites in yeast using a matrix with minimal

interference from naturally occurring analytes, Anal. Biochem., Vol. 487, 17-26 (2015)

Maier, J., et al., ff14SB: Improving the Accuracy of Protein Side Chain and Backbone Parameters from

ff99SB. J. Chem. Theory Comput., Vo. 11, 3696–3713 (2015)

Malagon, F., et al., Mutations in the Saccharomyces cerevisiae RPB1 Gene Conferring Hypersensitivity

to 6-Azauracil, Genetics, Vol.172, 2201-2209 (2006)

Malinen, A., M., et al., Active site opening and closure control translocation of multisubunit RNA

polymerase, Nucleic Acids Research, Vol. 40, 7442-7451 (2012)

Maoileidigh, D., O., et al., A Unified Model of Transcription Elongation: What Have We Learned from

Single-Molecule Experiments?, Biophysical Journal, Vol. 100, 1-10 (2011)

Markwick, P., R., L., et al., Exploring Multiple Timescale Motions in Protein GB3 Using Accelerated

Molecular Dynamics and NMR Spectroscopy, J. AM. CHEM. SOC., Vol. 129, 4724-4730 (2007)

Markwick, P., R., L., et al., Toward a Unified Representation of Protein Structural Dynamics in Solution,

J. AM. CHEM. SOC., Vol. 131, 16968-16975 (2009)

Markwick, P., R., L., McCammon, J., A., Studying functional dynamics in bio-molecules using

accelerated molecular dynamics, Phys. Chem. Chem. Phys., Vol. 13, 20053-20065 (2011)

Martinez, P., Persson, B., L., Identification, cloning and characterization of a derepressible Na+-coupled

phosphate transporter in Saccharomyces cerevisiae, Mol. Gen. Genet., Vol. 258, 628-638 (1998)

Martinez-Rucobo, F., Cramer, P., Structural basis of transcription elongation, Biochimica et Biophysica

Acta, Vol. 1829, 9-19 (2013)

McLahan, A., D., Gene Duplication in the Structural Evolution of Chymotrypsin, J. Mol. Biol., Vol.

128, 49-79 (1979)

Meagher, K., L., et al., Development of polyphosphate parameters for use with the AMBER force field,

203

Journal of Comp. Chemistry, Vol. 24, 1016-1025 (2003)

Meller, J., Molecular Dynamics, Encyclopedia of Life Sciences, Nature Publishing Group (2001)

Meyer, P., A., et al., Phasing RNA Polymerase II Using Intrinsically Bound Zn Atoms: An Updated

Structural Model, Structure, Vol.14, 973-982 (2006)

Miao, Y., et al., General trends of dihedral conformational transitions in a globular protein, Proteins,

Vol. 84, 501-514 (2016)

Miropolskaya, N., et al., Interplay between the trigger loop and the F loop during RNA polymerase

catalysis, Nucleic Acids Research, Vol. 42, 544-552 (2014)

Montiel, V., Ramos, J., Intracellular Na+ and K+ distribution in Debaryomyces hansenii. Cloning and

expression in Saccharomyces cerevisiae of DhNHX1, FEMS Yeast Res, Vol. 7, 102-109 (2007)

Mukhopadhyay, J., et al., Antibacterial Peptide Microcin J25 Inhibits Transcription by Binding within

and Obstructing the RNA Polymerase Secondary Channel, Molecular Cell, Vol. 14, 739-751 (2004)

Naryshkina, T., et al., The Role of the Largest RNA Polymerase Subunit Lid Element in Preventing the

Formation of Extended RNA-DNA Hybrid, J. Mol. Biol., Vol. 361, 634-643 (2006)

Nedialkov, Y., A., et al., NTP-driven Translocation by Human RNA Polymerase II, J. Biol. Chem., Vol.

278, 18303-18312 (2003)

Nedialkov, Y., A., et al., RNA polymerase stalls in a post-translocated register and can hyper-

translocate, Transcription, Vol. 3, 260-269 (2012)

Nick McElhinny, S., A., et al., Abundant ribonucleotide incorporation into DNA by yeast replicative

polymerases, PNAS, Vol. 107, 4949-4954 (2010)

Nierman, W., C., Chamberlin, M. J., The Effect of Low Substrate Concentrations on the Extent of

Productive RNA Chain Initiation from T7 Promoters A1 and A2 by Escherichia coli RNA Polymerase,

The Journal of Biological Chemistry, Vol. 225, 4495-4500 (1980)

Nudler, E., et al., The RNA–DNA Hybrid Maintains the Register of Transcription by Preventing

Backtracking of RNA Polymerase, Cell, Vol. 89, 33-41 (1997)

Nudler, E., RNA Polymerase Active Center: The Molecular Engine of Transcription, Annu. Rev.

Biochem., Vol. 78, 335-361 (2009)

Nudler, E., RNA Polymerase Backtracking in Gene Regulation and Genome Instability, Cell, Vol. 149,

1438-1443 (2012)

Olz, R., et al., Energy Flux and Osmoregulation of Saccharomyces cerevisiae Grown in Chemostats

under NaCl Stress, Journal OF Bacteriology, Vol. 175, 2205-2213 (1993)

Oster, G., Darwin’s motors, Nature, Vol. 417, p.25 (2002)

Palangat, M., Landick, R., Roles of RNA:DNA Hybrid Stability, RNA Structure, and Active Site

Conformation in Pausing by Human RNA Polymerase II, J. Mol. Biol., Vol. 311, 265-282 (2001)

Pande, V., S., Eastman, P., OpenMM: A Hardware-Independent Framework for Molecular Simulations,

Computing in Science & Engineering, Vol. 12, 34-39 (2010)

204

Panteva, M., T., et al., Comparison of Structural, Thermodynamic, Kinetic and Mass Transport

Properties of Mg2+ Ion Models Commonly used in Biomolecular Simulations, Journal of Computational

Chemistry, Vol. 36, 970-982 (2015A)

Panteva, M., T., et al., Force Field for Mg2+, Mn2+, Zn2+, and Cd2+ Ions That Have Balanced Interactions

with Nucleic Acids, J. Phys. Chem., Vol. 119, 15460-15470 (2015B)

Pavelka, A., et al., CAVER: Algorithms for Analyzing Dynamics of Tunnels in Macromolecules,

Transactions on Computational Biology and Bioinformatics, Vol. 13, 505-517 (2016)

Pellegrini-Calace, P., et al., PoreWalker: A Novel Tool ofr the Identification and Characterization of

Channels in Transmembreane Proteins from Their Three-Dimensional Structure, Vol. 5, e1000440

(2009)

Perez, A., et al., Refinement of the AMBER Force Field for Nucleic Acids: Improving the Description

of alpha/gamma Conformers, Biophys. J., Vol. 92, 3817-3829 (2007)

Perez-Villa, A., et al., ATP dependent NS3 helicase interaction with RNA: insights from molecular

simulations, Nucleic Acids Research, Vol. 43, 1-10 (2015)

Piana, S., et al., Assessing the accuracy of physical models used in protein-folding simulations:

quantitative evidence from long molecular dynamics simulations, Current Opinion in Structural

Biology, Vol. 24, 98-105 (2014)

Pierce, L., C., T., et al., Routine Access to Millisecond Time Scale Events with Accelerated Molecular

Dynamics, J. Chem. Theory Comput., Vol. 8, 2997-3002 (2012)

Ramos, J., et al., Yeast Membrane Transport, Advances in Experimental Biology and Medicine, ISBN

978-3-319-25304-6, p. 206 (2016)

Rodriguez-Navarro, A., Potassium transport in fungi and plants, Biochimica et Biophysica Acta, Vol.

1469, 1-30 (2000)

Romani, A., Scarpa A., Regulation of Cell Magnesium, Archives of Biochemistry and Biophysics, Vol.

298, 1-12 (1992)

Saeki, H., Svejstrup, J., Q., Stability, Flexibility, and Dynamic Interactions of Colliding RNA

Polymerase II Elongation Complexes, Molecular Cell, Vol. 35, 191-205 (2009)

Santangelo T., J., Roberts, J., W., Forward Translocation Is the Natural Pathway of RNA Release at an

Intrinsic Terminator, Molecular Cell, Vol. 14, 117-126 (2004)

Semenova, E., et al., Structure-Activity Analysis of Microcin J25: Distinct Parts of the Threaded Lasso

Molecule Are Responsible for Interaction with Bacterial RNA Polymerase, J. Bacteriol., Vol. 187, 3859-

3863 (2005)

Shaevitz, J., W., et al., Backtracking by single RNA polymerase molecules observed at near-base-pair

resolution, Nature, Vol. 426, 684-687 (2003)

Shi, Y., et al., Polarizable Atomic Multipole-Based AMOEBA Force Field for Proteins, J. Chem. Theory

Comput., Vol. 9, 4046-4063 (2013)

205

Sigel, H., Griesser, R., Nucleoside 5’-triphosphates: self-association, acid–base, and metal ionbinding

properties in solution, Chem. Soc. Rev., Vol. 34, 875-900 (2005)

Silva, D.-A., et al., Millisecond dynamics of RNA polymerase II translocation at atomic resolution,

PNAS, 1-6 (2014)

Sims III, R., J., et al., Elongation by RNA polymerase II: the short and long of it, Genes Dev., Vol. 18,

2437-2468 (2004)

Song, J., et al., Functional Loop Dynamics of the Streptavidin-Biotin Complex, Scientific Reports, Vol.

5, 7906: 10 (2015)

Sosunov, V., et al., Unified two-metal mechanism of RNA synthesis and degradation by RNA

polymerase, The EMBO Journal, Vol. 22, 2234-2244 (2003)

Stano, N., M., et al., The +2 NTP Binding Drives Open Complex Formation in T7 RNA Polymerase, J.

Biol. Chem, Vol. 277, 37292-37300 (2002)

Steinbrecher, T., et al., Revised AMBER parameters for bioorganic phosphates, J Chem Theory

Comput., Vol. 8, 4405-4412 (2012)

Steitz, T., A mechanism for all polymerases, Nature, Vol. 391, 231-232 (1998)

Sunder, S., et al., Regulation of intracellular level of Na+, K+ and glycerol in Saccharomyces cerevisiae

under osmotic stress, Molecular and Cellular Biochemistry, Vol. 158, 121-124 (1996)

Svetlov, et al., Discrimination against Deoxyribonucleotide Substrates by Bacterial RNA Polymerase,

J. Biol. Chem, Vol.279, 38087-38090 (2004)

Swaminathan, R., Magnesium Metabolism and its Disorders, Clin Biochem Rev, Vol. 24, 47-66 (2003)

Sychrova, H., Yeast as a Model Organism to Study Transport and Homeostasis of Alkali Metal Cations,

Physiol. Res., Vol. 53, S91-S98 (2004)

Sydow, J. F., Cramer, P., RNA polymerase fidelity and transcriptional proofreading, Current Opinion

in Structural Biology, Vol. 19, 732-739 (2009A)

Sydow, J. F., et al., Structural Basis of Transcription: Mismatch-Specific Fidelity Mechanisms and

Paused RNA Polymerase II with Frayed RNA, Molecular Cell, Vol. 34, 710-721 (2009B)

Tadigotla, V., R., et al., Thermodynamic and kinetic modeling of transcriptional pausing, PNAS, Vol.

103, 4439-4444 (2006)

Tahirov, T., H., et al., Structure of a T7 RNA polymerase elongation complex at 2.9Å resolution, Nature,

Vol. 420, 43-50 (2002)

Tan, L., et al., Bridge helix and trigger loop perturbations generate superactive RNA polymerases,

Journal of Biology, Vol.7, 40.1-40.15 (2008)

Temiakov, D., et al., Structural Basis for Substrate Selection by T7 RNA Polymerase, Cell, Vol. 116,

381-391 (2004)

Temiakov, D., et al., Structural Basis of Transcription Inhibition by Antibiotic Streptolydigin, Molecular

Cell, Vol. 19, 655-666 (2005)

206

Theobald, U., et al., Determination of In-vivo Cytosplasmic Orthophosphate Concentration in Yeast,

Biotechnology Techniques, Vol. 10, 297-302 (1996)

Theobald, U., et al., In Vivo Analysis of Metabolic Dynamics in Saccharomyces cerevisiae: I.

Experimental Observations, Biotechnology and Bioengineering, Vol. 55, 305-316 (1997)

Tikhonova, I., G., et al., Simulations of Biased Agonists in the β2 Adrenergic Receptor with Accelerated

Molecular Dynamics, Biochemistry, Vol. 52, 5593-5603 (2013)

Toulokhonov, I., et al., A Central Role of the RNA Polymerase Trigger Loop in Active-Site

Rearrangement during Transcriptional Pausing, Molecular Cell, Vol. 27, 406-419 (2007)

Traut, T., W., Physiological concentrations of purines and pyrimidines, Molecular and Cellular

Biochemistry, Vol. 140, 1-22 (1994)

van Eunen, K., Bakker, B., M., The importance and challenges of in vivo-like enzyme kinetics,

Perspectives in Science, Vol. 1, 126-130 (2014)

van Eunen, K., et al., Measuring enzyme activities under standardized in vivo-like conditions for

systems biology, FEBS Journal, Vol. 277, 749-760 (2010)

Vassylyev, D., G., et al., Crystal structure of a bacterial RNA polymerase holoenzyme at 2.6 Å

resolution, Nature, Vol. 417, 712-719 (2002)

Vassylyev, D., G., et al., Structural basis for transcription elongation by bacterial RNA polymerase,

Nature, Vol. 448, 157-164 (2007A)

Vassylyev, D., G., et al., Structural basis for substrate loading in bacterial RNA polymerase, Nature,

Vol. 448, 163-169 (2007B)

Vassylyev, D., G., Elongation by RNA polymerase: a race through roadblocks, Current Opinion in

Structural Biology, Vol. 19, 691-700 (2009)

Volkov, V., Quantitative description of ion transport via plasma membrane of yeast and small cells,

Front. Plant Sci., Vol. 6, art. 425 (2015)

Wang, H.-Y., et al., Force Generation in RNA Polymerase, Biophysical Journal, Vol. 74, 1186-1202

(1998)

Wang, J., et al., How Well Does a Restrained Electrostatic Potential (RESP) Model Perform in

Calculating Conformational Energies of Organic and Biological Molecules?, Journal of Computational

Chemistry, Vol. 21, 1049-1074 (2000)

Wang, H., Oster, G., Ratchets, power strokes, and molecular motors, Appl. Phys. A, Vol. 75, 315-323

(2002)

Wang, D., et al., Structural basis of transcription: role of the trigger loop in substrate specificity and

catalysis, Cell, Vol. 127, 941-954 (2006)

Wang, D., et al., Structural Basis of Transcription: Backtracked RNA Polymerase II at 3.4 Angstrom

Resolution, Science, Vol. 324, 1203-1206 (2009)

Wang, Y., et al., Enhanced Lipid Diffusion and Mixing in Accelerated Molecular Dynamics, J. Chem.

Theory Comput., Vol. 7, 3199-3207 (2011A)

207

Wang, Y., et al., Implementation of accelerated molecular dynamics in NAMD, Computational Science

& Discovery, Vol. 4, 015002: 10 (2011B)

Wang, B., et al., Computational Simulation Strategies for Analysis of Multisubunit RNA Polymerases,

Chem. Rev., Vol. 113, 8546-8566 (2013)

Weinzierl, R., O., J., Nanomechanical constraints acting on the catalytic site of cellular RNA

polymerases, Biochem. Soc. Trans., Vol. 38, 428-432 (2010A)

Weinzierl, R., O., J., The nucleotide addition cycle of RNA polymerase is controlled by two molecular

hinges in the Bridge Helix domain, BMC Biology, Vol. 8, 134.1-134.15 (2010B)

Weinzierl, R., O., J., The Bridge Helix of RNA Polymerase Acts as a Central Nanomechanical

Switchboard for Coordinating Catalysis and Substrate Movement, Archaea, Vol. 2011, 608385.1-

608385.7 (2011)

Weixlbaumer, A., et al., Structural Basis of Transcriptional Pausing in Bacteria, Cell, Vol. 152, 431-441

(2013)

Westover, K., D., et al., Structural Basis of Transcription: Nucleotide Selection by Rotation in the RNA

Polymerase II Active Center, Cell, Vol. 119, 481-489 (2004A)

Westover, K., D., et al., Structural Basis of Transcription: Separation of RNA from DNA by RNA

Polymerase II, Science, Vol. 303, 1014-1016 (2004B)

Woo, H.-J., et al., Molecular dynamics studies of the energetics of translocation in model T7 RNA

polymerase elongation complexes, Proteins, Vol. 73, 1021-1036 (2008)

Xie, P., A dynamic model for processive transcription elongation and backtracking long pauses by

multisubunit RNA polymerases, Proteins, Vol. 80, 2020–2034 (2012)

Xiong, Y., Burton, Z., A Tunable Ratchet Driving Human RNA Polymerase II Translocation Adjusted

by Accurately Templated Nucleoside Triphosphates Loaded at Downstream Sites and by Elongation

Factors, The Journal of Biological Chemistry, Vol. 282, 36582-36592 (2007)

Yaffe, E., et al., MolAxis: a server for identification of channels in macromolecules, Nucleic Acids

Research, Vol. 36, W210-W215 (2008)

Yu, J., Oster, G., A Small Post-Translocation Energy Bias Aids Nucleotide Selection in T7 RNA

Polymerase Transcription, Biophysical Journal, Vol. 102, 532-541 (2012)

Yuzenkova, Y., et al., Stepwise mechanism for transcription fidelity, BMC Biology, Vol.8, art. 54

(2010)

Zaychikov, A., et al., Translocation of the Escherichia coli transcription complex observed in the

registers 11 to 20: "Jumping" of RNA polymerase and asymmetric expansion and contraction of the

"transcription bubble", PNAS, Vol. 92, 1739-1743 (1995)

Zenkin, N., et al., Transcript-Assisted Transcriptional Proofreading, Science, Vol. 313, 518-520 (2006)

Zgarbova, M, et al., Refinement of the Cornell et al. Nucleic Acids Force Field Based on Reference

Quantum Chemical Calculations of Glycosidic Torsion Profiles, J. Chem. Theory Comput., Vol. 7, 2886-

2902(2011)

208

Zgarbova, M., et al., Toward improved description of dna backbone: Revisiting epsilon and zeta torsion

force field parameters, J. Chem. Theory Comput., Vol. 9, 2339-2354 (2013)

Zgarbova, M., et al., Refinement of the Sugar-Phosphate Backbone Torsion Beta for AMBER Force

Fields Improves the Description of Z- and B-DNA, J. Chem. Theor. and Comp., Vol. 12, 5723-5736.

(2015)

Zhang, G., et al., Crystal Structure of Thermus aquaticus Core RNA Polymerase at 3.Å Resolution, Cell,

Vol. 98, 811-824 (1999)

Zhang, C., et al., Combinatorial Control of Human RNA Polymerase II (RNAP II) Pausing and

Transcript Cleavage by Transcription Factor IIF, Hepatitis d Antigen, and Stimulatory Factor II, J. Biol.

Chem., Vol. 278, 50101-50111 (2003)

Zhang, C., Burton, Z., Transcription Factors IIF and IIS and Nucleoside Triphosphate Substrates as

Dynamic Probes of the Human RNA Polymerase II Mechanism, J. Mol. Biol., Vol. 342, 1085-1099

(2004)

Zhang, J., et al., Role of the RNA polymerase trigger loop in catalysis and pausing, Nature Structural &

Molecular Biology, Vol. 17, 99-105 (2010)

Zhang, Y., et al., Structural Basis of Transcription Initiation, Science, Vol. 338, 1076-1080 (2012)

Zhang, L., et al., Structural Model of RNA Polymerase II Elongation Complex with Complete

Transcription Bubble Reveals NTP Entry Routes, PLOS Computational Biology, Vol. 11, e1004354

(2015A)

Zhang, J., et al., A Fast Sensor For in Vivo Quantification of cytosolic Phosphate in Saccharomyces

Cerevisiae, Biotechnology and Bioengineering, Vol. 112, 1033-1046 (2015B)

209

Appendix 1: aMD simulation procedure

use File::Slurp; use Math::Round; use strict; use autodie; use warnings qw(all); use Statistics::Descriptive; use List::Util qw( min max ); $ENV{PYTHONPATH} = "/home/ng/amber16/lib/python2.7/site-packages"; $ENV{OPENMM_CUDA_COMPILER} = "/usr/local/cuda-8.0/bin/nvcc"; $ENV{LD_LIBRARY_PATH} = "/usr/local/cuda-7.5/lib64:/lib"; $ENV{PATH} = "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"; $ENV{AMBERHOME} = "/home/ng/amber16"; my $dna1; my $dna2; my $dna3; my $dna4; my $dna5; my $dna6; my $dna7; my $dna8; ############################################################################### ##Preliminary notice ############################################################################### ##We start with 2e2h structure, ##with pdb file cleaned up ##i.e. only keep ATOM, HETATM, TER and END lines, ##with extended nucleic acid frame, ##with gtp in A site removed, ##with missing loops added, ##and with C and NTER added, ##the structure is also pre-minimized (see chapter 2) ############################################################################### ##END Preliminary notice ############################################################################### ############################################################################### ##Execute first dummy leap run ############################################################################### #Note: #Execute first Leap run on initial structure (without metabolites), only hydrogenize and solvate #generate hydrogenated structure>struct-ini-hydro.pdb my $outfile="leap-1.scrpt"; open (FILE2, "> $outfile") || die; print (FILE2 "source leaprc.protein.ff14SB\n"); print (FILE2 "source leaprc.DNA.OL15\n"); print (FILE2 "source leaprc.RNA.OL3\n"); print (FILE2 "loadoff atomic_ions.lib\n");

210

print (FILE2 "loadamberparams frcmod.ions1lm_1264_tip4pew\n"); print (FILE2 "loadamberparams frcmod.ions234lm_1264_tip4pew\n"); print (FILE2 "loadoff solvents.lib\n"); print (FILE2 "loadamberparams frcmod.tip4pew\n"); print (FILE2 "sys = loadpdb 2e2h-pre-minimized.pdb\n"); print (FILE2 "saveamberparm sys 2e2h-pre-minimized.prmtop 2e2h-pre-minimized.inpcrd\n"); print (FILE2 "savepdb sys out-leap1-1.pdb\n"); print (FILE2 "solvatebox sys TIP4PEWBOX 15.0\n"); print (FILE2 "saveamberparm sys 2e2h-pre-minimized-solv.prmtop 2e2h-pre-minimized-solv.inpcrd\n"); print (FILE2 "savepdb sys out-leap1-2.pdb\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/xleap -s -f leap-1.scrpt > out-leap1.out"; system($cmd); #keep file: rename "leap.log", "leap-1.log"; ############################################################################### ##END Execute first leap run ############################################################################### ############################################################################### ##Extract number of protein residues and extract T and N strand anchors ############################################################################### my @pdb_input = read_file("out-leap1-1.pdb") or die; my $array_size=scalar @pdb_input; ############## EXTRACT DNA anchors and protein atom index range my $down_dna_anchor; my $up_dna_anchor; my $down=0; my $up=0; my $y=2; my $down_dna_anchor_first_id; my $down_dna_anchor_last_id; my $up_dna_anchor_first_id; my $up_dna_anchor_last_id; my $down_dna_anchor_first_id_chain1; my $down_dna_anchor_last_id_chain1; my $down_dna_anchor_first_id_chain2; my $down_dna_anchor_last_id_chain2; my $up_dna_anchor_first_id_chain1; my $up_dna_anchor_last_id_chain1; my $up_dna_anchor_first_id_chain2; my $up_dna_anchor_last_id_chain2; my $last_protein_id; my $last_protein_res_id; my $AA_prec = 0; my $AA_foll = 0; for (my $count=0; $count<$array_size; $count++) { my $line= @pdb_input[$count]; my $trigger_TER=substr($line, 0, 3); ############ Detect TER ##################

211

if ($trigger_TER eq "TER") { #EXTRACT DNA ANCHORS: my $line_prec= @pdb_input[$count-1]; my $RP=substr($line_prec, 17, 2); my $RP_alt=substr($line_prec, 18, 2); my $line_foll= @pdb_input[$count+1]; my $RF=substr($line_foll, 17, 2); my $RF_alt=substr($line_foll, 18, 2); if (($RF eq "DA") or ($RF eq "DG") or ($RF eq "DC") or ($RF eq "DT")){ $down_dna_anchor = "on"; } if (($RF_alt eq "DA") or ($RF_alt eq "DG") or ($RF_alt eq "DC") or ($RF_alt eq "DT")){ $down_dna_anchor = "on"; } if (($RP eq "DA") or ($RP eq "DG") or ($RP eq "DC") or ($RP eq "DT")){ $up_dna_anchor = "on"; } if (($RP_alt eq "DA") or ($RP_alt eq "DG") or ($RP_alt eq "DC") or ($RP_alt eq "DT")){ $up_dna_anchor = "on"; } ## EXTRACT down_dna_anchor if ($down_dna_anchor eq "on") { ##Go forwards on residue length my $line_DNA_segment_start=@pdb_input[$count+1]; #get first atom index (for first chain x=1, for second chain x=2) $down_dna_anchor_first_id = substr($line_DNA_segment_start, 6, 5); my $resid_DNA_segment_start=substr($line_DNA_segment_start, 22, 4); for (my $c=1; $c<$y; $c++) { my $line_DNA_segment= @pdb_input[$count+$c]; my $resid_DNA_segment=substr($line_DNA_segment, 22, 4); if ($resid_DNA_segment==$resid_DNA_segment_start) { $y++; $down_dna_anchor_last_id=substr($line_DNA_segment, 6, 5); } } $down_dna_anchor = "off"; $down++; } ## END EXTRACT down_dna_anchor ## EXTRACT up_dna_anchor if ($up_dna_anchor eq "on") { ##Go forwards on residue length my $line_DNA_segment_start=@pdb_input[$count-1]; #get first atom index (for first chain x=1, for second chain x=2) $up_dna_anchor_last_id = substr($line_DNA_segment_start, 6, 5); my $resid_DNA_segment_start=substr($line_DNA_segment_start, 22, 4); for (my $c=1; $c<$y; $c++) { my $line_DNA_segment= @pdb_input[$count-$c]; my $resid_DNA_segment=substr($line_DNA_segment, 22, 4); if ($resid_DNA_segment==$resid_DNA_segment_start) { $y++; $up_dna_anchor_first_id=substr($line_DNA_segment, 6, 5); } }

212

$up_dna_anchor = "off"; $up++; } ## END EXTRACT up_dna_anchor #EXTRACT PROTEIN atom index range: #if the residue preceding the TER is an AA, #but the next residue is not an AA #then we have reached the end of the protein atoms my $RP=substr($line_prec, 17, 3); my $RF=substr($line_foll, 17, 3); if (($RP eq "ALA") or ($RP eq "ARG") or ($RP eq "ASH") or ($RP eq "ASN") or ($RP eq "ASP") or ($RP eq "CYM") or ($RP eq "CYS") or ($RP eq "CYX") or ($RP eq "GLN") or ($RP eq "GLU") or ($RP eq "GLY") or ($RP eq "HID") or ($RP eq "HIE") or ($RP eq "HIP") or ($RP eq "HYP") or ($RP eq "ILE") or ($RP eq "LEU") or ($RP eq "LYN") or ($RP eq "LYS") or ($RP eq "MET") or ($RP eq "PHE") or ($RP eq "PRO") or ($RP eq "SER") or ($RP eq "THR") or ($RP eq "THR") or ($RP eq "TRP") or ($RP eq "TRP") or ($RP eq "TYR") or ($RP eq "VAL")){ $AA_prec = 1; } if (($RF eq "ALA") or ($RF eq "ARG") or ($RF eq "ASH") or ($RF eq "ASN") or ($RF eq "ASP") or ($RF eq "CYM") or ($RF eq "CYS") or ($RF eq "CYX") or ($RF eq "GLN") or ($RF eq "GLU") or ($RF eq "GLY") or ($RF eq "HID") or ($RF eq "HIE") or ($RF eq "HIP") or ($RF eq "HYP") or ($RF eq "ILE") or ($RF eq "LEU") or ($RF eq "LYN") or ($RF eq "LYS") or ($RF eq "MET") or ($RF eq "PHE") or ($RF eq "PRO") or ($RF eq "SER") or ($RF eq "THR") or ($RF eq "THR") or ($RF eq "TRP") or ($RF eq "TRP") or ($RF eq "TYR") or ($RF eq "VAL")){ $AA_foll = 1; } if (($AA_prec == 1) and ($AA_foll == 0)){ $last_protein_id = substr($line_prec, 6, 5); $last_protein_id = $last_protein_id - 1; $last_protein_res_id = substr($line_prec, 22, 4); } $AA_prec = 0; $AA_foll = 0; #END EXTRACT PROTEIN atom index range } ############ END Detect TER ################## if ($down == 1) { $down_dna_anchor_first_id_chain1 = $down_dna_anchor_first_id-1; $down_dna_anchor_last_id_chain1 = $down_dna_anchor_last_id-1; } if ($down == 2) { $down_dna_anchor_first_id_chain2 = $down_dna_anchor_first_id-1; $down_dna_anchor_last_id_chain2 = $down_dna_anchor_last_id-1; } if ($up == 1) { $up_dna_anchor_first_id_chain1 = $up_dna_anchor_first_id-1; $up_dna_anchor_last_id_chain1 = $up_dna_anchor_last_id-1; } if ($up == 2) { $up_dna_anchor_first_id_chain2 = $up_dna_anchor_first_id-1; $up_dna_anchor_last_id_chain2 = $up_dna_anchor_last_id-1; }

213

} ############## END of line loop and END EXTRACT DNA anchors print "chain1_dna_anchors are: $down_dna_anchor_first_id_chain1 to $down_dna_anchor_last_id_chain1 $up_dna_anchor_first_id_chain1 to$up_dna_anchor_last_id_chain1\n"; print "chain2_dna_anchors are: $down_dna_anchor_first_id_chain2 to $down_dna_anchor_last_id_chain2 $up_dna_anchor_first_id_chain2 to $up_dna_anchor_last_id_chain2\n"; print "protein index range is: 1 to $last_protein_id\n"; print "protein resid range is: 1 to $last_protein_res_id\n"; ############################################################################### ##END Extract number of protein residues and extract T and N strand anchors ############################################################################### ################################################################################ ##Extract number of water molecules ################################################################################ my @pdb_input = read_file("leap-1.log") or die; my $array_size=scalar @pdb_input; my $wat; my $trigger_solvate; my $trigger_wat=0; my $trigger_wat_line; my @line_handle; for (my $count=0; $count<$array_size; $count++) { $trigger_solvate=substr(@pdb_input[$count], 0, 9); $trigger_wat_line=substr(@pdb_input[$count], 2, 5); if ($trigger_solvate eq "> solvate") { $trigger_wat=1; } if (($trigger_wat_line eq "Added") and ($trigger_wat == 1)){ @line_handle = split ( /\s+/, @pdb_input[$count] ); $wat = @line_handle[2]; $trigger_wat = 0; } } print "\nwat is *$wat*\n"; ################################################################################ ##END Extract number of water molecules ################################################################################ ################################################################################ ##Extract water box size ################################################################################ #Get water dims: my $outfile="scr-box.vmd"; open (FILE2, "> $outfile") || die; print (FILE2 "set outFile out-box.txt\n"); print (FILE2 "set out [open \$outFile w]\n"); print (FILE2 "set mol [mol new out-leap1-2.pdb type pdb]\n"); print (FILE2 "set sel [atomselect top water]\n");

214

print (FILE2 "set minmax [measure minmax \$sel]\n"); print (FILE2 "set b [split \$minmax { }]\n"); print (FILE2 "set xmin [lindex \$b 0]\n"); print (FILE2 "set xmin [string trim \$xmin \"{\"]\n"); print (FILE2 "set ymin [lindex \$b 1]\n"); print (FILE2 "set zmin [lindex \$b 2]\n"); print (FILE2 "set zmin [string trim \$zmin \"}\"]\n"); print (FILE2 "set xmax [lindex \$b 3]\n"); print (FILE2 "set xmax [string trim \$xmax \"{\"]\n"); print (FILE2 "set ymax [lindex \$b 4]\n"); print (FILE2 "set zmax [lindex \$b 5]\n"); print (FILE2 "set zmax [string trim \$zmax \"}\"]\n"); print (FILE2 "set xdim [expr \$xmax - \$xmin]\n"); print (FILE2 "set ydim [expr \$ymax - \$ymin]\n"); print (FILE2 "set zdim [expr \$zmax - \$zmin]\n"); print (FILE2 "puts \$out \"\$xdim\"\n"); print (FILE2 "puts \$out \"\$ydim\"\n"); print (FILE2 "puts \$out \"\$zdim\"\n"); print (FILE2 "exit\n"); close (FILE2); my $cmd = "vmd -dispdev text -nt -e scr-box.vmd"; system($cmd); unlink "scr-box.vmd"; ################################################################################ ##END Extract water box size ################################################################################ ################################################################################ ##Read box size ################################################################################ my $count=0; my $x_box; my $y_box; my $z_box; my @pdb_input_ini = read_file("out-box.txt") or die; $x_box= @pdb_input_ini[0]; $x_box =~ s/^\s+|\s+$//g; $y_box= @pdb_input_ini[1]; $y_box =~ s/^\s+|\s+$//g; $z_box= @pdb_input_ini[2]; $z_box =~ s/^\s+|\s+$//g; $x_box= sprintf "%.3f", $x_box; $y_box= sprintf "%.3f", $y_box; $z_box= sprintf "%.3f", $z_box; print "x_box is *$x_box*\n"; print "y_box is *$y_box*\n"; print "z_box is *$z_box*\n"; ################################################################################ ##END Read box size ################################################################################ ################################################################################

215

##AMEND pdb, with water box size, and OXT atoms removed ################################################################################ #Note: #With box size, Prepare AddtoBox ready pdb file out-0.pdb >addtobox-ready-struct-ini-hydro.pdb, #and remove OXT atoms from pdb file (for second leap run) my $x_pdb; my $y_pdb; my $z_pdb; my $size_x= length($x_box); if ($size_x == 6){ $x_pdb= " ". $x_box; } if ($size_x == 7){ $x_pdb= $x_box; } my $size_y= length($y_box); if ($size_y == 6){ $y_pdb= " ". $y_box; } if ($size_y == 7){ $y_pdb= $y_box; } my $size_z= length($z_box); if ($size_z == 6){ $z_pdb= " ". $z_box; } if ($size_z == 7){ $z_pdb= $z_box; } my $sel_total = "CRYST1 " . $x_pdb . " " . $y_pdb . " " . $z_pdb . " 90.00 90.00 90.00 1" . "\n"; my @pdb_input = read_file("out-leap1-1.pdb") or die; my @pdb_output; my $array_size=scalar @pdb_input; my $outfile= "out-ini.pdb"; my $count_update_output=0; @pdb_output[0] = "$sel_total"; ########## LINE LOOP for (my $count=1; $count<$array_size; $count++) { my $line= @pdb_input[$count]; my $atom=substr($line, 13, 3); if ($atom ne "OXT"){ @pdb_output[$count+$count_update_output] = "$line"; } if ($atom eq "OXT"){ $count_update_output--; } } ########## END of LINE LOOP

216

# print to file open (FILE, "> $outfile"); for (my $count_2=0; $count_2<$array_size+$count_update_output; $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); ################################################################################ ##END AMEND pdb, with water box size, and OXT atoms removed ################################################################################ ################################################################################ ##CALCULATE metabolite amounts ################################################################################ #Note: #Calculate number of K+, Na+, glu, phos, mg, sul, mg2+, ca2+, (and gtps for later) needed #0.5 mM Ca2+: my $nb_Ca=($wat/55)*0.0005; $nb_Ca=nearest (1, $nb_Ca); #2 mM Mg2+: my $nb_Mg=($wat/55)*0.002; $nb_Mg=nearest (1, $nb_Mg); #5 mM S2+: my $nb_S=($wat/55)*0.005; $nb_S=nearest (1, $nb_S); #20 mM Na+: my $nb_Na=($wat/55)*0.02; $nb_Na=nearest (1, $nb_Na); #2 mM Lys: my $nb_ZK=($wat/55)*0.002; $nb_ZK=nearest (1, $nb_ZK); #2.5 mM His: my $nb_ZHE=($wat/55)*0.0025; $nb_ZHE=nearest (1, $nb_ZHE); #6 mM Arg: my $nb_ZR=($wat/55)*0.006; $nb_ZR=nearest (1, $nb_ZR); #8.5 mM Asp: my $nb_ZD=($wat/55)*0.0085; $nb_ZD=nearest (1, $nb_ZD); #80 mM Glu: my $nb_ZE=($wat/55)*0.08; $nb_ZE=nearest (1, $nb_ZE); #300 mM K+: my $nb_K=($wat/55)*0.3; $nb_K=nearest (1, $nb_K); #And calculations for phasis two (later), #when the gtps are added in a metabolite #relaxed solvent bath: #5.9 mM NTPs: my $nb_gtp=($wat/55)*0.0059; $nb_gtp=nearest (1, $nb_gtp); #number of Cl- to be removed later: my $del_Cl=$nb_gtp*2;

217

print "nb_Ca is *$nb_Ca*\n"; print "nb_Mg is *$nb_Mg*\n"; print "nb_S is *$nb_S*\n"; print "nb_HP is *$nb_HP*\n"; print "nb_2HP is *$nb_2HP*\n"; print "nb_Na is *$nb_Na*\n"; print "nb_ZK is *$nb_ZK*\n"; print "nb_ZHE is *$nb_ZHE*\n"; print "nb_ZR is *$nb_ZR*\n"; print "nb_ZD is *$nb_ZD*\n"; print "nb_ZE is *$nb_ZE*\n"; print "nb_K is *$nb_K*\n"; print "nb_gtp is *$nb_gtp*\n"; print "del_Cl is *$del_Cl*\n"; ################################################################################ ##END CALCULATE metabolite amounts ################################################################################ ################################################################################ ##ADD first round of metabolite to solvent box ################################################################################ #Note: #Execute AddToBox on out-ini.pdb to add first round of metabolites (not the gtps yet) my $nb_protein_res = $last_protein_res_id; my $cmd = "/home/ng/amber16/bin/AddToBox -c out-ini.pdb -a Ca.pdb -na $nb_Ca -o out2.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out2.pdb -a MG.pdb -na $nb_Mg -o out3.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out3.pdb -a SUL.pdb -na $nb_S -o out4.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out4.pdb -a Na+.pdb -na $nb_Na -o out7.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out7.pdb -a ZK.pdb -na $nb_ZK -o out8.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out8.pdb -a ZHE.pdb -na $nb_ZHE -o out9.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out9.pdb -a ZR.pdb -na $nb_ZR -o out10.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out10.pdb -a ZD.pdb -na $nb_ZD -o out11.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1";

218

system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out11.pdb -a ZE.pdb -na $nb_ZE -o out12.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out12.pdb -a K+.pdb -na $nb_K -o out13.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); ################################################################################ ##END ADD first round of metabolite to solvent box ################################################################################ ################################################################################ ##EXECUTE second dummy Leap run ################################################################################ #Note: #Run second Leap run, with required param files, to hydrogenise the mets #, and get unbalanced charge my $outfile="leap-2.scrpt"; open (FILE2, "> $outfile") || die; print (FILE2 "source leaprc.protein.ff14SB\n"); print (FILE2 "source leaprc.DNA.OL15\n"); print (FILE2 "source leaprc.RNA.OL3\n"); print (FILE2 "loadoff atomic_ions.lib\n"); print (FILE2 "loadamberparams frcmod.ions1lm_1264_tip4pew\n"); print (FILE2 "loadamberparams frcmod.ions234lm_1264_tip4pew\n"); print (FILE2 "loadoff zaa-new.off\n"); print (FILE2 "loadoff SUL.lib\n"); print (FILE2 "loadamberparams frcmod.sul\n"); print (FILE2 "sys = loadpdb out13.pdb\n"); print (FILE2 "charge sys\n"); print (FILE2 "setBox sys vdw 1.0\n"); print (FILE2 "set sys box {$x_box $y_box $z_box}\n"); print (FILE2 "saveamberparm sys out13.prmtop out13.inpcrd\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/xleap -s -f leap-2.scrpt > out-leap2.out"; system($cmd); #keep file: rename "leap.log", "leap-2.log"; ################################################################################ ##END EXECUTE second dummy Leap run ################################################################################ ################################################################################ ##EXTRACT unbalanced charge ################################################################################ my @pdb_input = read_file("leap-2.log") or die; my $array_size=scalar @pdb_input; my $charge; my $done_charge=0; my @line_handle;

219

for (my $count=0; $count<$array_size; $count++) { my $trigger_charge=substr(@pdb_input[$count], 0, 8); if (($trigger_charge eq "> charge") and ($done_charge == 0)){ @line_handle = split ( /\s+/, @pdb_input[$count+1] ); $charge = @line_handle[3]; $charge=nearest (1, $charge); $done_charge=1; } } print "\ncharge is *$charge*\n"; ################################################################################ ##END EXTRACT unbalanced charge ################################################################################ ################################################################################ ##CALCULATE number of Cl- required to neutralise the system ################################################################################ #Note: #Calculate number of Cl- required to neutralise the system now #and for later when the gtps will be added my $nb_Cl=$charge; print "nb_Cl is *$nb_Cl*\n"; ################################################################################ ##END CALCULATE number of Cl- required to neutralise the system ################################################################################ ################################################################################ ##ADD Cl- and water molecules to solvent box ################################################################################ my $cmd = "/home/ng/amber16/bin/AddToBox -c out13.pdb -a Cl-.pdb -na $nb_Cl -o out14.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $cmd = "/home/ng/amber16/bin/AddToBox -c out14.pdb -a WAT.pdb -na $wat -o out15.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); ################################################################################ ##END ADD Cl- and water molecules to solvent box ################################################################################ ################################################################################ ##EXECUTE non-dummy Leap run ################################################################################ #Note: #Execute third Leap run to generate the simulation ready amber inpcrd and prmtop files my $outfile="leap-3.scrpt"; open (FILE2, "> $outfile") || die; print (FILE2 "source leaprc.protein.ff14SB\n");

220

print (FILE2 "source leaprc.DNA.OL15\n"); print (FILE2 "source leaprc.RNA.OL3\n"); print (FILE2 "loadoff atomic_ions.lib\n"); print (FILE2 "loadamberparams frcmod.ions1lm_1264_tip4pew\n"); print (FILE2 "loadamberparams frcmod.ions234lm_1264_tip4pew\n"); print (FILE2 "loadoff solvents.lib\n"); print (FILE2 "loadamberparams frcmod.tip4pew\n"); print (FILE2 "loadoff zaa-new.off\n"); print (FILE2 "loadoff SUL.lib\n"); print (FILE2 "loadamberparams frcmod.sul\n"); print (FILE2 "sys = loadpdb out15.pdb\n"); print (FILE2 "setBox sys vdw 1.0\n"); print (FILE2 "set sys box {$x_box $y_box $z_box}\n"); print (FILE2 "saveamberparm sys out15.prmtop out15.inpcrd\n"); print (FILE2 "savepdb sys out15-leap.pdb\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/xleap -s -f leap-3.scrpt > out-leap3.out"; system($cmd); #keep file: rename "leap.log", "leap-3.log"; #Apply C-4 term to Lennard-Jones potential #and apply Panteva 2015 m1264 refined set my $outfile="parmed.in"; open (FILE2, "> $outfile") || die; print (FILE2 "setOverwrite True\n"); print (FILE2 "change AMBER_ATOM_TYPE :A*,DA*\@N7 NAMG\n"); print (FILE2 "change AMBER_ATOM_TYPE :G*,DG*\@N7 NGMG\n"); print (FILE2 "change AMBER_ATOM_TYPE :*\@OP* OPMG\n"); print (FILE2 "addLJType @\%NAMG\n"); print (FILE2 "addLJType @\%NGMG\n"); print (FILE2 "addLJType @\%OPMG\n"); print (FILE2 "add12_6_4 :ZN watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :ZN\n"); print (FILE2 "add12_6_4 :MG watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :MG\n"); print (FILE2 "add12_6_4 :Na+ watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :Na+\n"); print (FILE2 "add12_6_4 :Cl- watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :Cl-\n"); print (FILE2 "add12_6_4 :CA watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :CA\n"); print (FILE2 "add12_6_4 :K+ watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :K+\n"); print (FILE2 "outparm out15-parmed.prmtop out15-parmed.inpcrd\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/parmed -i parmed.in -p out15.prmtop -c out15.inpcrd >out-parmed.txt"; system($cmd); ################################################################################ ##END EXECUTE non-dummy Leap run ################################################################################ ################################################################################ ##EXECUTE first round of simulations ################################################################################

221

##################### MIN my $outfile="min1.in"; open (FILE2, "> $outfile") || die; print (FILE2 "2e2h: initial minimisation solvent + ions\n"); print (FILE2 " &cntrl\n"); print (FILE2 " imin = 1,\n"); print (FILE2 " ntmin = 2,\n"); print (FILE2 " maxcyc = 5000,\n"); print (FILE2 " ncyc = 1000,\n"); print (FILE2 " ntb = 1,\n"); print (FILE2 " ntr = 1,\n"); print (FILE2 " cut = 10.0\n"); print (FILE2 " /\n"); print (FILE2 "Hold the protein fixed\n"); print (FILE2 "500.0\n"); print (FILE2 "RES 1 $nb_protein_res\n"); print (FILE2 "END\n"); print (FILE2 "END\n"); close (FILE2); my $outfile="min2.in"; open (FILE2, "> $outfile") || die; print (FILE2 "2e2h: initial minimisation whole system\n"); print (FILE2 " &cntrl\n"); print (FILE2 " imin = 1,\n"); print (FILE2 " ntmin = 2,\n"); print (FILE2 " maxcyc = 5000,\n"); print (FILE2 " ncyc = 2500,\n"); print (FILE2 " ntb = 1,\n"); print (FILE2 " ntr = 0,\n"); print (FILE2 " cut = 10.0\n"); print (FILE2 " /\n"); print (FILE2 "END\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/sander -O -i min1.in -o min1-p.out -p out15-parmed.prmtop -c out15-parmed.inpcrd -r min1-p.rst -ref out15-parmed.inpcrd"; system($cmd); my $cmd = "/home/ng/amber16/bin/sander -O -i min2.in -o min2-p.out -p out15-parmed.prmtop -c min1-p.rst -r min2-p.rst"; system($cmd); ################################################################################ ##EXECUTE next preliminary steps with OPENMM ################################################################################ ##################### HEAT (MD1) 20 ps my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n");

222

print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out15-parmed.prmtop', 'min2-p.rst')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=10*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); #Add restraints to protein print (FILE2 "force_res_prot = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_res_prot.addGlobalParameter(\"k\", 10.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i <= $last_protein_id):\n"); print (FILE2 " force_res_prot.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_res_prot)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 100, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md1-p.nc', 5000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md1-p.rst7', 10000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(10000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md1-p.txt"; system($cmd); ##################### EQ-VEL (MD2-eq) 100 ps my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out15-parmed.prmtop', 'md1-p.rst7.10000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md1-p.rst7.10000')\n");

223

print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); #Add restraints to anchors print (FILE2 "force_anchors = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_anchors.addGlobalParameter(\"k\", 50.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " force_anchors.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_anchors)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 1000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md2-eq-p.nc', 10000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md2-eq-p.rst7', 50000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(50000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md2-eq-p.txt"; system($cmd); ##################### EQ-BOX (MD2-sim1) 20 ns my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out15-parmed.prmtop', 'md2-eq-p.rst7.50000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md2-eq-p.rst7.50000')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n");

224

print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); print (FILE2 "system.addForce(MonteCarloBarostat(1*bar, 300*kelvin))\n"); #Add restraints to anchors print (FILE2 "force_anchors = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_anchors.addGlobalParameter(\"k\", 50.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " force_anchors.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_anchors)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 250000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md2-sim1-p.nc', 250000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md2-sim1-p.rst7', 1000000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(10000000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md2-sim1-p.txt"; system($cmd); ##################### EQ-VEL2 (MD2-sim2) 20 ns my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out15-parmed.prmtop', 'md2-sim1-p.rst7.10000000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md2-sim1-p.rst7.10000000')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n");

225

print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); #Add restraints to anchors print (FILE2 "force_anchors = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_anchors.addGlobalParameter(\"k\", 50.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " force_anchors.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_anchors)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 250000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md2-sim2-p-rst.nc', 250000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md2-sim2-p-rst.rst7', 1000000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(1000000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md2-sim2-p-rst.txt"; system($cmd); ################################################################################ ##END EXECUTE next preliminary steps with OPENMM ################################################################################ ################################################################################ ##EXTRACT LAST FRAME ################################################################################ #Note: ##image the trajectory back inside the periodic box ##extract last frame, strip the water and convert to PDB #NB: ##One can strip the water directly because ##in contrast to phasis 1, one does not need to extract the ##water box dimensions, as the simulation routine has ##automatically implemented its information in the CRYST line ##of the PDB file $ENV{LD_LIBRARY_PATH} = "/home/ng/amber16/lib"; my $outfile="scr-frame.vmd";

226

open (FILE2, "> $outfile") || die; print (FILE2 "set outFile out-frame.txt\n"); print (FILE2 "set out [open \$outFile w]\n"); print (FILE2 "set mol [mol new out15-parmed.prmtop]\n"); print (FILE2 "mol addfile md2-sim2-p.nc waitfor all molid \$mol\n"); print (FILE2 "set n [molinfo top get numframes]\n"); print (FILE2 "puts \$out \"\$n\"\n"); print (FILE2 "exit\n"); close (FILE2); my $cmd = "vmd -dispdev text -nt -e scr-frame.vmd"; system($cmd); my @pdb_input_ini = read_file("out-frame.txt") or die; my $last_frame= @pdb_input_ini[0]; $last_frame =~ s/^\s+|\s+$//g; print "last_frame is *$last_frame*\n"; my $outfile="autoimage.ptraj"; open (FILE2, "> $outfile") || die; print (FILE2 "trajin md2-sim2-imaged-p.nc $last_frame $last_frame 1\n"); print (FILE2 "strip :WAT\n"); print (FILE2 "trajout md2-sim2-imaged-stripped-p.pdb\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/cpptraj out15-parmed.prmtop < autoimage.ptraj > out-ptraj3.txt"; system($cmd); $ENV{LD_LIBRARY_PATH} = "/usr/local/cuda-7.5/lib64:/lib"; ################################################################################ ##END EXTRACT LAST FRAME ################################################################################ ################################################################################ ##Extract water box size ################################################################################ my @pdb_input_ini = read_file("md2-sim2-imaged-stripped-p.pdb") or die; my $line_cryst= @pdb_input_ini[0]; my @line_cryst_handle = split ( /\s+/, $line_cryst ); my $x_box= @line_cryst_handle[1]; my $y_box= @line_cryst_handle[2]; my $z_box= @line_cryst_handle[3]; print "\nx_box is *$x_box*\n"; print "y_box is *$y_box*\n"; print "z_box is *$z_box*\n"; ################################################################################ ##END Extract water box size ################################################################################ ################################################################################ ##AMEND PDB ################################################################################ #Note:

227

##add again NCTER atoms removed by simulation routine ##and remove OXT atoms, because they are not supported by xLeap ##and remove del_Cl amount of Cl- ions my @pdb_input = read_file("md2-sim2-imaged-stripped-p.pdb") or die; my @pdb_output; my $array_size=scalar @pdb_input; my $outfile= "md2-sim2-imaged-stripped-amended-p.pdb"; ##Copy PDB file: for (my $count=0; $count<$array_size; $count++) { @pdb_output[$count] = @pdb_input[$count]; } ############## UPDATE NCTER lines my $update_NTER; my $update_CTER; my $y=2; for (my $count=0; $count<$array_size; $count++) { my $line= @pdb_input[$count]; my $trigger_TER=substr($line, 0, 3); ############ Detect TER ################## if ($trigger_TER eq "TER") { #Detect if TER event occurs: #1/ at the start of the protein #2/ inbetween two protein segments #3/ at the end of the protein #To do so, look at the residue type preceding and following TER line: my $line_prec= @pdb_input[$count-1]; my $RP=substr($line_prec, 17, 3); my $line_foll= @pdb_input[$count+1]; my $RF=substr($line_foll, 17, 3); if (($RP eq "ALA") or ($RP eq "ARG") or ($RP eq "ASH") or ($RP eq "ASN") or ($RP eq "ASP") or ($RP eq "CYM") or ($RP eq "CYS") or ($RP eq "CYX") or ($RP eq "GLN") or ($RP eq "GLU") or ($RP eq "GLY") or ($RP eq "HID") or ($RP eq "HIE") or ($RP eq "HIP") or ($RP eq "HYP") or ($RP eq "ILE") or ($RP eq "LEU") or ($RP eq "LYN") or ($RP eq "LYS") or ($RP eq "MET") or ($RP eq "PHE") or ($RP eq "PRO") or ($RP eq "SER") or ($RP eq "THR") or ($RP eq "THR") or ($RP eq "TRP") or ($RP eq "TRP") or ($RP eq "TYR") or ($RP eq "VAL")){ $update_CTER = "on"; } if (($RF eq "ALA") or ($RF eq "ARG") or ($RF eq "ASH") or ($RF eq "ASN") or ($RF eq "ASP") or ($RF eq "CYM") or ($RF eq "CYS") or ($RF eq "CYX") or ($RF eq "GLN") or ($RF eq "GLU") or ($RF eq "GLY") or ($RF eq "HID") or ($RF eq "HIE") or ($RF eq "HIP") or ($RF eq "HYP") or ($RF eq "ILE") or ($RF eq "LEU") or ($RF eq "LYN") or ($RF eq "LYS") or ($RF eq "MET") or ($RF eq "PHE") or ($RF eq "PRO") or ($RF eq "SER") or ($RF eq "THR") or ($RF eq "THR") or ($RF eq "TRP") or ($RF eq "TRP") or ($RF eq "TYR") or ($RF eq "VAL")){ $update_NTER = "on"; } ## UPDATE NTER if ($update_NTER eq "on") { ##Go forwards on residue length my $line_NTER_segment_start=@pdb_input[$count+1]; my $resid_NTER_segment_start=substr($line_NTER_segment_start, 22, 4);

228

for (my $c=1; $c<$y; $c++) { my $line_NTER_segment= @pdb_input[$count+$c]; my $resid_NTER_segment=substr($line_NTER_segment, 22, 4); if ($resid_NTER_segment==$resid_NTER_segment_start) { $y++; my $sel1=substr($line_NTER_segment, 0, 16); my $sel2="N"; my $sel3=substr($line_NTER_segment, 17, 63); my $sel_tot=$sel1 . $sel2 . $sel3; @pdb_output[$count+$c]="$sel_tot\n"; } } $update_NTER = "off" } ## END UPDATE NTER ## UPDATE CTER if ($update_CTER eq "on") { ##Go backwards on residue length my $line_CTER_segment_start= @pdb_input[$count-1]; my $resid_CTER_segment_start=substr($line_CTER_segment_start, 22, 4); for (my $c=1; $c<$y; $c++) { my $line_CTER_segment= @pdb_input[$count-$c]; my $resid_CTER_segment=substr($line_CTER_segment, 22, 4); if ($resid_CTER_segment==$resid_CTER_segment_start) { $y++; my $sel1=substr($line_CTER_segment, 0, 16); my $sel2= "C"; my $sel3=substr($line_CTER_segment, 17, 63); my $sel_tot = $sel1 . $sel2 . $sel3; @pdb_output[$count-$c] = "$sel_tot\n"; } } $update_CTER = "off" } ## END UPDATE CTER } ############ END Detect TER ################## } ############## END of line loop and END UPDATE NCTER lines # print to file open (FILE, "> $outfile"); for (my $count_2=0; $count_2<$array_size; $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); #Remove some Cl- to balance the NTPs to be injected #Note: $del_Cl is calculated in phasis 1 my $done=0; my $update_count=0; my $count_update_output=0; my @pdb_input = read_file("md2-sim2-imaged-stripped-amended-p.pdb") or die; my @pdb_output; my $array_size=scalar @pdb_input; my $outfile= "md2-sim2-imaged-stripped-amended2-p.pdb"; for (my $count=0; $count<$array_size; $count++) {

229

my $line= @pdb_input[$count]; my $trigger_Cl=substr($line, 17, 3); my $atom=substr($line, 13, 3); if ($done==0){ if ($trigger_Cl eq "Cl-") { #then remove twice the number of lines corresponding to #nb of Cl to be removed in order to account for TER $update_count=$del_Cl*2; $done=1; } } if ($atom ne "OXT"){ @pdb_output[$count+$count_update_output] = @pdb_input[$count+$update_count]; } if ($atom eq "OXT"){ $count_update_output--; } } # print to file open (FILE, "> $outfile"); for (my $count_2=0; $count_2<($array_size - $update_count); $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); ################################################################################ ##END AMEND PDB ################################################################################ ################################################################################ ##INJECT NTPs ################################################################################ my $cmd = "/home/ng/amber16/bin/AddToBox -c md2-sim2-imaged-stripped-amended2-p.pdb -a gtp.pdb -na $nb_gtp -o out16.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); ################################################################################ ##END INJECT NTPs ################################################################################ ################################################################################ ##AMEND NTPs ################################################################################ #Note: #MgB is supplemented directly with GTP (same resid for AddToBox) #Hence now MgB residues are to be specified in their own resid my $count=0; my $update_resid=0; my $count_update_output=0; my @pdb_input = read_file("out16.pdb") or die; my @pdb_output;

230

my $array_size=scalar @pdb_input; my $outfile= "out16-amended.pdb"; ########### LINE LOOP for (my $count=0; $count<$array_size; $count++) { my $line= @pdb_input[$count]; my $resid=substr($line, 22, 7); my $resname=substr($line, 17, 3); my $atom=substr($line, 13, 3); @pdb_output[$count+$count_update_output] = "$line"; if (($resname eq "gtp") and ($atom eq "MG ")){ my $sel1=substr($line, 0, 17); my $sel2="MG "; my $sel3=substr($line, 20, 2); my $sel4=$resid+$update_resid; my $sel5=substr($line, 26, 40); @pdb_output[$count+$count_update_output] = $sel1 . $sel2 . $sel3 . $sel4 . $sel5; $update_resid++; @pdb_output[$count+$count_update_output+1] = "TER \n"; $count_update_output++; } if (($resname eq "gtp") and ($atom ne "MG ")){ my $sel1=substr($line, 0, 22); my $sel2=$resid+$update_resid; my $sel3=substr($line, 26, 40); @pdb_output[$count+$count_update_output] = $sel1 . $sel2 . $sel3; } } open (FILE, "> $outfile"); for (my $count_2=0; $count_2<$array_size+$count_update_output; $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); ################################################################################ ##END AMEND NTPs ################################################################################ ################################################################################ ##ADD WATER AGAIN ################################################################################ my $cmd = "/home/ng/amber16/bin/AddToBox -c out16-amended.pdb -a WAT.pdb -na $wat -o out17.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); ################################################################################ ##END ADD WATER AGAIN ################################################################################ ################################################################################ ##EXECUTE second non-dummy Leap run ################################################################################

231

#Note: ##Execute Leap run, #to generate simulation ready files #and to be able to count nb of atoms used for the upcoming aMD run my $outfile="leap-4.scrpt"; open (FILE2, "> $outfile") || die; print (FILE2 "source leaprc.protein.ff14SB\n"); print (FILE2 "source leaprc.DNA.OL15\n"); print (FILE2 "source leaprc.RNA.OL3\n"); print (FILE2 "loadoff atomic_ions.lib\n"); print (FILE2 "loadamberparams frcmod.ions1lm_1264_tip4pew\n"); print (FILE2 "loadamberparams frcmod.ions234lm_1264_tip4pew\n"); print (FILE2 "loadoff solvents.lib\n"); print (FILE2 "loadamberparams frcmod.tip4pew\n"); print (FILE2 "loadoff zaa-new.off\n"); print (FILE2 "loadoff SUL.lib\n"); print (FILE2 "loadamberparams frcmod.sul\n"); print (FILE2 "loadamberprep gtp.prep\n"); print (FILE2 "loadamberparams frcmod.gtp\n"); print (FILE2 "sys = loadpdb out17.pdb\n"); print (FILE2 "setBox sys vdw 1.0\n"); print (FILE2 "set sys box {$x_box $y_box $z_box}\n"); print (FILE2 "saveamberparm sys out17.prmtop out17.inpcrd\n"); print (FILE2 "savepdb sys out17-leap.pdb\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/xleap -s -f leap-4.scrpt > out-leap4.out"; system($cmd); #keep file: rename "leap.log", "leap-4.log"; #Apply r-4 term to Lennard-Jones potential #and apply Panteva 2015 m1264 refined set my $outfile="parmed.in"; open (FILE2, "> $outfile") || die; print (FILE2 "setOverwrite True\n"); print (FILE2 "change AMBER_ATOM_TYPE :A*,DA*\@N7 NAMG\n"); print (FILE2 "change AMBER_ATOM_TYPE :G*,DG*\@N7 NGMG\n"); print (FILE2 "change AMBER_ATOM_TYPE :*\@OP* OPMG\n"); print (FILE2 "addLJType @\%NAMG\n"); print (FILE2 "addLJType @\%NGMG\n"); print (FILE2 "addLJType @\%OPMG\n"); print (FILE2 "add12_6_4 :ZN watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :ZN\n"); print (FILE2 "add12_6_4 :MG watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :MG\n"); print (FILE2 "add12_6_4 :Na+ watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :Na+\n"); print (FILE2 "add12_6_4 :Cl- watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :Cl-\n"); print (FILE2 "add12_6_4 :CA watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :CA\n"); print (FILE2 "add12_6_4 :K+ watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :K+\n"); print (FILE2 "outparm out17-parmed.prmtop out17-parmed.inpcrd\n"); print (FILE2 "quit\n"); close (FILE2);

232

my $cmd = "/home/ng/amber16/bin/parmed -i parmed.in -p out17.prmtop -c out17.inpcrd >out-parmed2.txt"; system($cmd); ################################################################################ ##END EXECUTE second non-dummy Leap run ################################################################################ ################################################################################ ##EXTRACT NB ATOMS (for aMD) ################################################################################ my @pdb_input=read_file("out17-leap.pdb") or die; my $array_size=scalar @pdb_input; my $nb_atoms=substr(@pdb_input[$array_size-2], 6, 6); print "\nnb_atoms is *$nb_atoms*\n"; ################################################################################ ##END EXTRACT NB ATOMS (for aMD) ################################################################################ ################################################################################ ##EXECUTE second round of simulations ################################################################################ ##################### MIN my $cmd = "/home/ng/amber16/bin/sander -O -i min1.in -o min1.out -p out17-parmed.prmtop -c out17-parmed.inpcrd -r min1.rst -ref out17-parmed.inpcrd"; system($cmd); my $cmd = "/home/ng/amber16/bin/sander -O -i min2.in -o min2.out -p out17-parmed.prmtop -c min1.rst -r min2.rst"; system($cmd); ################################################################################ ##EXECUTE next preliminary steps with OPENMM ################################################################################ ##################### HEAT (MD1) 20 ps my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out17-parmed.prmtop', 'min2.rst')\n");

233

print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); #Add restraints to protein print (FILE2 "force_res_prot = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_res_prot.addGlobalParameter(\"k\", 10.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i <= $last_protein_id):\n"); print (FILE2 " force_res_prot.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_res_prot)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 100, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md1.nc', 5000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md1.rst7', 10000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(10000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md1.txt"; system($cmd); ##################### EQ-VEL (MD2-eq) 100 ps my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out17-parmed.prmtop', 'md1.rst7.10000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md1.rst7.10000')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); #Add restraints to anchors

234

print (FILE2 "force_anchors = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_anchors.addGlobalParameter(\"k\", 50.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " force_anchors.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_anchors)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 1000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md2-eq.nc', 10000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md2-eq.rst7', 50000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(50000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md2-eq.txt"; system($cmd); ##################### EQ-BOX (MD2-sim1) 20 ns my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out17-parmed.prmtop', 'md2-eq.rst7.50000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md2-eq.rst7.50000')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); print (FILE2 "system.addForce(MonteCarloBarostat(1*bar, 300*kelvin))\n"); #Add restraints to anchors print (FILE2 "force_anchors = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n");

235

print (FILE2 "force_anchors.addGlobalParameter(\"k\", 50.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " force_anchors.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_anchors)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 250000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md2-sim1.nc', 250000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md2-sim1.rst7', 1000000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(10000000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md2-sim1.txt"; system($cmd); ##################### EQ-VEL2 (MD2-sim2) 20 ns my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "def forcegroupify(system):\n"); print (FILE2 " forcegroups = {}\n"); print (FILE2 " for i in range(system.getNumForces()):\n"); print (FILE2 " force = system.getForce(i)\n"); print (FILE2 " force.setForceGroup(i)\n"); print (FILE2 " forcegroups[force] = i\n"); print (FILE2 " return forcegroups\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out17-parmed.prmtop', 'md2-sim1.rst7.10000000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md2-sim1.rst7.10000000')\n");

236

print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "fgrps=forcegroupify(system)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); #Add restraints to anchors print (FILE2 "force_anchors = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_anchors.addGlobalParameter(\"k\", 50.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " force_anchors.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_anchors)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter('out-openmm-md2-sim2.txt', 250000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md2-sim2.nc', 250000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md2-sim2.rst7', 1000000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); ##Simulate in checkpoints in order to supervise data #40 *250000 = 10000000=20 ns print (FILE2 "for i in range (40):\n"); print (FILE2 " simulation.step(250000)\n"); #print total potential energy print (FILE2 " y = simulation.context.getState(getEnergy=True).getPotentialEnergy()\n"); print (FILE2 " y = y/4.184\n"); print (FILE2 " print(\"ET =\", y)\n"); #print dihedral potential energy print (FILE2 " x = simulation.context.getState(getEnergy=True,groups=4).getPotentialEnergy()\n"); print (FILE2 " x = x/4.184\n"); print (FILE2 " print(\"Ed =\", x)\n"); print (FILE2 "simulation.saveState('md2-sim2.rst7')\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-Ep.txt"; system($cmd); ################################################################################ ##END EXECUTE next preliminary steps with OPENMM ################################################################################

237

################################################################################ ##EXTRACT aMD parameters ################################################################################ my $median_EPtot; my $median_DIHED; my $count=0; my @md_output = read_file("out-Ep.txt", chomp => 1) or die; my @EPtot_output; #define array for total potential energy values my @DIHED_output; #define array for dihedral energy values my $array_size=scalar @md_output; my @line_handle; my $Ep; my $type; for (my $count=0; $count<$array_size; $count++) { @line_handle = split ( /\s+/, @md_output[$count] ); $type = $line_handle[0]; $Ep = $line_handle[2]; if ($type eq "ET"){ push(@EPtot_output,($Ep)); } if ($type eq "Ed"){ push(@DIHED_output,($Ep)); } } #computes basic statistics on data print "\nEPtot statistics:\n"; my $EPtot_stat=Statistics::Descriptive::Full->new(); $EPtot_stat->add_data(@EPtot_output); my $median=$EPtot_stat->median(); $median_EPtot= round($median); print "\nMedian value chosen for EPtot aMD parameter calculation is: $median_EPtot\n\n"; print "\nDIHED statistics:\n"; my $DIHED_stat=Statistics::Descriptive::Full->new(); $DIHED_stat->add_data(@DIHED_output); my $median=$DIHED_stat->median(); $median_DIHED= round($median); print "\nMedian value chosen for DIHED aMD parameter calculation is: $median_DIHED\n\n"; print "**********************************************************************"; print "\n\tCalculating parameters for aMD simulation\n"; print "\talpha factor:\t\t0.20\n\tnumber of residues:\t$nb_protein_res\n\tnumber of atoms:\t$nb_atoms\n\tDIHED:\t\t\t$median_DIHED\n\tEPtot:\t\t\t$median_EPtot\t\n\n"; print "Boosting DIHEDRAL potential:"; my $energy_contribution=$nb_protein_res*(3.5*4.184); print "\n\tenergy contribution (3.5kcal/mol/residue) =\t$energy_contribution"; my $alphaD=round($energy_contribution*0.20); print "\n\talphaD \t (rounded) =\t\t\t\t$alphaD"; my $EthreshD=round($energy_contribution+($median_DIHED*4.184)); print "\n\tEthreshD (rounded) =\t\t\t\t$EthreshD"; print "\n\nBoosting EPtot potential:"; my $alphaP=round($nb_atoms*(0.20*4.184)); print "\n\talphaP \t (rounded) =\t\t\t\t$alphaP";

238

my $EthreshP=round(($median_EPtot*4.184)+$alphaP); print "\n\tEthreshP (rounded) =\t\t\t\t$EthreshP"; ################################################################################ ##END EXTRACT aMD parameters ################################################################################ ################################################################################ ##EXECUTE aMD ################################################################################ my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "def forcegroupify(system):\n"); print (FILE2 " forcegroups = {}\n"); print (FILE2 " for i in range(system.getNumForces()):\n"); print (FILE2 " force = system.getForce(i)\n"); print (FILE2 " force.setForceGroup(i)\n"); print (FILE2 " forcegroups[force] = i\n"); print (FILE2 " return forcegroups\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out17-parmed.prmtop', 'md2-sim2.rst7.10000000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md2-sim2.rst7.10000000')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "fgrps=forcegroupify(system)\n"); print (FILE2 "integrator = DualAMDIntegrator(2*femtoseconds, 2, $alphaP, $EthreshP, $alphaD, $EthreshD)\n"); print (FILE2 "system.addForce(AndersenThermostat(300*kelvin, 1.0/picosecond))\n"); #Add restraints to anchors print (FILE2 "force_anchors = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_anchors.addGlobalParameter(\"k\", 50.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_anchors.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " force_anchors.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_anchors)\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n");

239

print (FILE2 "test = Platform.getPluginLoadFailures()\n"); print (FILE2 "print(\"test-platform is\", test)\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "test = Platform.getPluginLoadFailures()\n"); print (FILE2 "print(\"test-platform is\", test)\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 250000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('aMD2.nc', 250000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('aMD2.rst7', 250000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(50000000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-aMD.txt"; system($cmd); ################################################################################ ##END EXECUTE aMD ################################################################################ exit;

240

Appendix 2: sMD simulation procedure

use File::Slurp; use Math::Round; use autodie; use warnings qw(all); $ENV{PYTHONPATH} = "/home/ng/amber16/lib/python2.7/site-packages"; $ENV{OPENMM_CUDA_COMPILER} = "/usr/local/cuda-8.0/bin/nvcc"; $ENV{LD_LIBRARY_PATH} = "/usr/local/cuda-7.5/lib64:/lib"; $ENV{PATH} = "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"; $ENV{AMBERHOME} = "/home/ng/amber16"; my $wat=159600; my $nb_protein_res=3795; my $x_box=168.024; my $y_box=187.235; my $z_box=170.807; my $last_protein_id=61866; my $dna1=585; my $dna2=615; my $dna3=1777; my $dna4=1809; my $dna5=1810; my $dna6=1839; my $dna7=3028; my $dna8=3058; my $line; my $trigger_Cl; my $trigger_MG; my $atom; my $resid; my $atom_type; my $trigger_gtp; my $resname; my $L0x; my $L0y; my $L0z; my $L1Ax; my $L1Ay; my $L1Az; my $L1Bx; my $L1By; my $L1Bz; my $L2x; my $L2y; my $L2z; my $L3x; my $L3y; my $L3z; my $L4x; my $L4y; my $L4z; my $L4x; my $L4y; my $L1x; my $L1y; my $L1z; my $L1A_id;

241

my $L1B_id; my $L2_id; my $L3_id; my $L4_id; my $gtp_first_id; my $smd_atom_id; my $gtp_last_id; my $last_step; my $exit; ################################################################################ ##EXTRACT LAST FRAME ################################################################################ $ENV{LD_LIBRARY_PATH} = "/home/ng/amber16/lib"; my $outfile="scr-frame.vmd"; open (FILE2, "> $outfile") || die; print (FILE2 "set outFile out-frame.txt\n"); print (FILE2 "set out [open \$outFile w]\n"); print (FILE2 "set mol [mol new out17-parmed.prmtop]\n"); print (FILE2 "mol addfile aMD2-rst.nc waitfor all molid \$mol\n"); print (FILE2 "set n [molinfo top get numframes]\n"); print (FILE2 "puts \$out \"\$n\"\n"); print (FILE2 "exit\n"); close (FILE2); my $cmd = "vmd -dispdev text -nt -e scr-frame.vmd"; system($cmd); my @pdb_input_ini = read_file("out-frame.txt") or die; my $last_frame= @pdb_input_ini[0]; $last_frame =~ s/^\s+|\s+$//g; print "last_frame is *$last_frame*\n"; my $outfile="autoimage.ptraj"; open (FILE2, "> $outfile") || die; print (FILE2 "trajin aMD2-rst.nc $last_frame $last_frame 1\n"); print (FILE2 "strip :WAT\n"); print (FILE2 "strip :gtp\n"); print (FILE2 "trajout frame.pdb\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/cpptraj out17-parmed.prmtop < autoimage.ptraj > out-ptraj3.txt"; system($cmd); $ENV{LD_LIBRARY_PATH} = "/usr/local/cuda-7.5/lib64:/lib"; ################################################################################ ##END EXTRACT LAST FRAME ################################################################################ ################################################################################################# #################################### PRELIMINARY PROCEDURES ############################### #################################################################################################

242

my @pdb_input = read_file("frame.pdb") or die; my @pdb_output; my $array_size=scalar @pdb_input; my $outfile= "frame-out.pdb"; ##Copy PDB file: for (my $count=0; $count<$array_size; $count++) { @pdb_output[$count] = @pdb_input[$count]; } ############## UPDATE NCTER lines my $update_NTER; my $update_CTER; my $y=2; for (my $count=0; $count<$array_size; $count++) { my $line= @pdb_input[$count]; my $trigger_TER=substr($line, 0, 3); ############ Detect TER ################## if (($trigger_TER eq "TER") or ($trigger_TER eq "CRY")){ #Detect if TER event occurs: #1/ at the start of the protein #2/ inbetween two protein segments #3/ at the end of the protein #To do so, look at the residue type preceding and following TER line: my $line_prec= @pdb_input[$count-1]; my $RP=substr($line_prec, 17, 3); my $line_foll= @pdb_input[$count+1]; my $RF=substr($line_foll, 17, 3); if (($RP eq "ALA") or ($RP eq "ARG") or ($RP eq "ASH") or ($RP eq "ASN") or ($RP eq "ASP") or ($RP eq "CYM") or ($RP eq "CYS") or ($RP eq "CYX") or ($RP eq "GLN") or ($RP eq "GLU") or ($RP eq "GLY") or ($RP eq "HID") or ($RP eq "HIE") or ($RP eq "HIP") or ($RP eq "HYP") or ($RP eq "ILE") or ($RP eq "LEU") or ($RP eq "LYN") or ($RP eq "LYS") or ($RP eq "MET") or ($RP eq "PHE") or ($RP eq "PRO") or ($RP eq "SER") or ($RP eq "THR") or ($RP eq "THR") or ($RP eq "TRP") or ($RP eq "TRP") or ($RP eq "TYR") or ($RP eq "VAL")){ $update_CTER = "on"; } if (($RF eq "ALA") or ($RF eq "ARG") or ($RF eq "ASH") or ($RF eq "ASN") or ($RF eq "ASP") or ($RF eq "CYM") or ($RF eq "CYS") or ($RF eq "CYX") or ($RF eq "GLN") or ($RF eq "GLU") or ($RF eq "GLY") or ($RF eq "HID") or ($RF eq "HIE") or ($RF eq "HIP") or ($RF eq "HYP") or ($RF eq "ILE") or ($RF eq "LEU") or ($RF eq "LYN") or ($RF eq "LYS") or ($RF eq "MET") or ($RF eq "PHE") or ($RF eq "PRO") or ($RF eq "SER") or ($RF eq "THR") or ($RF eq "THR") or ($RF eq "TRP") or ($RF eq "TRP") or ($RF eq "TYR") or ($RF eq "VAL")){ $update_NTER = "on"; } ## UPDATE NTER if ($update_NTER eq "on") { ##Go forwards on residue length my $line_NTER_segment_start=@pdb_input[$count+1]; my $resid_NTER_segment_start=substr($line_NTER_segment_start, 22, 4); for (my $c=1; $c<$y; $c++) { my $line_NTER_segment= @pdb_input[$count+$c]; $line_NTER_segment =~ s/\s*$//; my $resid_NTER_segment=substr($line_NTER_segment, 22, 4);

243

if ($resid_NTER_segment==$resid_NTER_segment_start) { $y++; my $sel1=substr($line_NTER_segment, 0, 16); my $sel2="N"; my $sel3=substr($line_NTER_segment, 17, 62); my $sel_tot=$sel1 . $sel2 . $sel3; @pdb_output[$count+$c]="$sel_tot\n"; } } $update_NTER = "off" } ## END UPDATE NTER ## UPDATE CTER if ($update_CTER eq "on") { ##Go backwards on residue length my $line_CTER_segment_start= @pdb_input[$count-1]; my $resid_CTER_segment_start=substr($line_CTER_segment_start, 22, 4); for (my $c=1; $c<$y; $c++) { my $line_CTER_segment= @pdb_input[$count-$c]; $line_CTER_segment =~ s/\s*$//; my $resid_CTER_segment=substr($line_CTER_segment, 22, 4); if ($resid_CTER_segment==$resid_CTER_segment_start) { $y++; my $sel1=substr($line_CTER_segment, 0, 16); my $sel2= "C"; my $sel3=substr($line_CTER_segment, 17, 62); my $sel_tot = $sel1 . $sel2 . $sel3; @pdb_output[$count-$c] = "$sel_tot\n"; } } $update_CTER = "off" } ## END UPDATE CTER } ############ END Detect TER ################## } ############## END of line loop and END UPDATE NCTER lines # print to file open (FILE, "> $outfile"); for (my $count_2=0; $count_2<$array_size; $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); #Remove 2 Cl- to balance the NTPs to be injected #and remove OXT atoms my $done=0; my $update_count=0; my $update_count2=0; my $count_update_output=0; my @pdb_input = read_file("frame-out.pdb") or die; my @pdb_output; my $array_size=scalar @pdb_input; my $outfile= "out1.pdb"; my $del_Cl=32; for (my $count=0; $count<$array_size; $count++) {

244

$line= @pdb_input[$count]; $trigger_Cl=substr($line, 17, 3); $trigger_MG=substr(@pdb_input[$count+$update_count], 22, 4); $atom=substr($line, 13, 3); if ($done==0){ if ($trigger_Cl eq "Cl-") { #then remove twice the number of lines corresponding to #nb of Cl to be removed in order to account for TER $update_count=2*$del_Cl; $done=1; } } #$trigger_MG < 5565 to remove MgB atoms if (($atom ne "OXT") and ($trigger_MG < 5565)){ @pdb_output[$count+$count_update_output] = @pdb_input[$count+$update_count]; } if ($atom eq "OXT"){ $count_update_output--; } } # print to file open (FILE, "> $outfile"); for (my $count_2=0; $count_2<($array_size - $update_count); $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); #Specify cubic region defined by the projection of a point #outside of the protein, in front of landmark 1 my @pdb_input = read_file("out1.pdb") or die; my $array_size=scalar @pdb_input; for (my $count=0; $count<$array_size; $count++) { $line= @pdb_input[$count]; $resid=substr($line, 22, 4); $atom_type=substr($line, 12, 4); if (($resid == 1317) and ($atom_type eq ' CG ')){ $L1Ax=substr($line, 31, 7); $L1Ay=substr($line, 39, 7); $L1Az=substr($line, 47, 7); } if (($resid == 3126) and ($atom_type eq ' CG ')){ $L1Bx=substr($line, 31, 7); $L1By=substr($line, 39, 7); $L1Bz=substr($line, 47, 7); } if (($resid == 76) and ($atom_type eq ' CG ')){ $L0x=substr($line, 31, 7); $L0y=substr($line, 39, 7); $L0z=substr($line, 47, 7); } } #Checkpoint 0 calculation : $L1x=($L1Ax+$L1Bx)/2;

245

$L1y=($L1Ay+$L1By)/2; $L1z=($L1Az+$L1Bz)/2; my $vec_x= $L1x - $L0x; my $vec_y= $L1y - $L0y; my $vec_z= $L1z - $L0z; my $norm=sqrt($vec_x*$vec_x+$vec_y*$vec_y+$vec_z*$vec_z); $vec_x=$vec_x/$norm; $vec_y=$vec_y/$norm; $vec_z=$vec_z/$norm; my $CK0x=$L1x+15*$vec_x; my $CK0y=$L1y+15*$vec_y; my $CK0z=$L1z+15*$vec_z; print "CK0 is $CK0x $CK0y $CK0z\n"; my $x= sprintf "%.3f", $CK0x; my $y= sprintf "%.3f", $CK0y; my $z= sprintf "%.3f", $CK0z; if ($x < 100){ $x= " ". $x; } if ($x < 100){ $x= " ". $x; } if ($y < 100){ $y= " ". $y; } #Extract inner box dimensions: #edges: my $edge1x=$CK0x-8; my $edge2x=$CK0x+8; my $edge1y=$CK0y-8; my $edge2y=$CK0y+8; my $edge1z=$CK0z-8; my $edge2z=$CK0z+8; #dimensions: my $range_x=$edge2x-$edge1x; my $range_y=$edge2y-$edge1y; my $range_z=$edge2z-$edge1z; #format dimensions for later use: $range_x= sprintf "%.3f", $range_x; $range_y= sprintf "%.3f", $range_y; $range_z= sprintf "%.3f", $range_z; if ($range_x < 100){ $range_x= " ". $range_x; } if ($range_y < 100){ $range_y= " ". $range_y; } if ($range_z < 100){ $range_z= " ". $range_z; } my $cryst_line = "CRYST1 " . $range_x . " " . $range_y . " " . $range_z . " 90.00 90.00 90.00 1" . "\n"; print "cryst_line is $cryst_line\n"; print "edge x are $edge1x $edge2x\n";

246

print "edge y are $edge1y $edge2y\n"; print "edge z are $edge1z $edge2z\n"; my $x1= sprintf "%.3f", $edge1x; my $y1= sprintf "%.3f", $edge1y; my $z1= sprintf "%.3f", $edge1z; my $x2= sprintf "%.3f", $edge2x; my $y2= sprintf "%.3f", $edge2y; my $z2= sprintf "%.3f", $edge2z; if ($x1 < 100){ $x1= " ". $x1; } if ($x2 < 100){ $x2= " ". $x2; } if ($y1 < 100){ $y1= " ". $y1; } if ($y2 < 100){ $y2= " ". $y2; } if ($z1 < 100){ $z1= " ". $z1; } if ($z2 < 100){ $z2= " ". $z2; } print "ATOM 68743 Cl- Cl- 9999 $x $y $z 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x1 $y1 $z1 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x1 $y1 $z2 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x1 $y2 $z1 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x1 $y2 $z2 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x2 $y1 $z1 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x2 $y1 $z2 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x2 $y2 $z1 1.00 0.00\n"; print "ATOM 68743 Cl- Cl- 9999 $x2 $y2 $z2 1.00 0.00\n"; #Extract inner box from pdb: my @pdb_input = read_file("out1.pdb") or die; my @pdb_output; my $array_size=scalar @pdb_input; my $outfile= "box.pdb"; my $count_output=2; my $atom_x; my $atom_y; my $atom_z; @pdb_output[0] = "\n"; @pdb_output[1] = $cryst_line; for (my $count=0; $count<$array_size; $count++) { $atom_x=substr($pdb_input[$count], 31, 7); $atom_y=substr($pdb_input[$count], 39, 7); $atom_z=substr($pdb_input[$count], 47, 7); if (($atom_x >= $edge1x) and ($atom_x <= $edge2x) and ($atom_y >= $edge1y) and ($atom_y <= $edge2y) and ($atom_z >= $edge1z) and ($atom_z <= $edge2z)){ @pdb_output[$count_output] = $pdb_input[$count]; $count_output++; }

247

} #if no atoms in the inner box, create artificially some reference if ($count_output==2) { @pdb_output[3] = "ATOM 68743 Cl- Cl- 9999 $x$y$z 1.00 0.00\n"; $count_output++; } # print to file open (FILE, "> $outfile"); for (my $count_2=0; $count_2<($count_output + 1); $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); #Add gtp in the inner box my $cmd = "/home/ng/amber16/bin/AddToBox -c box.pdb -a gtp.pdb -na 1 -o out-box.pdb -P 0 -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); #Copy gtp in the global pdb my @pdb_input1 = read_file("out1.pdb") or die; my @pdb_input2 = read_file("out-box.pdb") or die; my @pdb_output; my $array_size1=scalar @pdb_input1; my $array_size2=scalar @pdb_input2; my $outfile= "out2.pdb"; my $i=1; for (my $count=0; $count<$array_size1-1; $count++) { @pdb_output[$count]=$pdb_input1[$count]; } @pdb_output[$array_size1]="TER\n"; for (my $count=0; $count<$array_size2; $count++) { $trigger_gtp=substr($pdb_input2[$count], 17, 3); if ($trigger_gtp eq 'gtp') { @pdb_output[$array_size1+$i]="$pdb_input2[$count]"; $i++; } } @pdb_output[$array_size1+$i]="END\n"; # print to file open (FILE, "> $outfile"); for (my $count_2=0; $count_2<($array_size1+$i+1); $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); #Amend gtp my $count=0; my $update_resid=0; my $count_update_output=0; my @pdb_input = read_file("out2.pdb") or die; my @pdb_output; my $array_size=scalar @pdb_input; my $outfile= "out3.pdb"; ########### LINE LOOP

248

for (my $count=0; $count<$array_size; $count++) { $line= @pdb_input[$count]; $resid=9998; $resname=substr($line, 17, 3); $atom=substr($line, 13, 3); @pdb_output[$count+$count_update_output] = "$line"; if (($resname eq "gtp") and ($atom eq "MG ")){ my $sel1=substr($line, 0, 17); my $sel2="MG "; my $sel3=substr($line, 20, 2); my $sel4=$resid+$update_resid; my $sel5=substr($line, 26, 40); @pdb_output[$count+$count_update_output] = $sel1 . $sel2 . $sel3 . $sel4 . $sel5; $update_resid++; @pdb_output[$count+$count_update_output+1] = "TER \n"; $count_update_output++; } if (($resname eq "gtp") and ($atom ne "MG ")){ my $sel1=substr($line, 0, 22); my $sel2=$resid+$update_resid; my $sel3=substr($line, 26, 40); @pdb_output[$count+$count_update_output] = $sel1 . $sel2 . $sel3; } } open (FILE, "> $outfile"); for (my $count_2=0; $count_2<$array_size+$count_update_output; $count_2++) { print (FILE"@pdb_output[$count_2]"); } close (FILE); #Add water: my $cmd = "/home/ng/amber16/bin/AddToBox -c out3.pdb -a WAT.pdb -na $wat -o out4.pdb -P $nb_protein_res -RP 2.0 -RW 3.0 -G 0.1 -V 1"; system($cmd); my $outfile="leap.scrpt"; open (FILE2, "> $outfile") || die; print (FILE2 "source leaprc.protein.ff14SB\n"); print (FILE2 "source leaprc.DNA.OL15\n"); print (FILE2 "source leaprc.RNA.OL3\n"); print (FILE2 "loadoff atomic_ions.lib\n"); print (FILE2 "loadamberparams frcmod.ions1lm_1264_tip4pew\n"); print (FILE2 "loadamberparams frcmod.ions234lm_1264_tip4pew\n"); print (FILE2 "loadoff solvents.lib\n"); print (FILE2 "loadamberparams frcmod.tip4pew\n"); print (FILE2 "loadoff zaa-new.off\n"); print (FILE2 "loadoff SUL.lib\n"); print (FILE2 "loadamberparams frcmod.sul\n"); print (FILE2 "loadamberprep gtp.prep\n"); print (FILE2 "loadamberparams frcmod.gtp\n"); print (FILE2 "sys = loadpdb out4.pdb\n"); print (FILE2 "charge sys\n");

249

print (FILE2 "setBox sys vdw 1.0\n"); print (FILE2 "set sys box {$x_box $y_box $z_box}\n"); print (FILE2 "saveamberparm sys out4.prmtop out4.inpcrd\n"); print (FILE2 "savepdb sys out4-leap.pdb\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/xleap -s -f leap.scrpt > out-leap.out"; system($cmd); #keep file: rename "leap.log", "leap-1.log"; #Apply r-4 term to Lennard-Jones potential #and apply Panteva 2015 m1264 refined set my $outfile="parmed.in"; open (FILE2, "> $outfile") || die; print (FILE2 "setOverwrite True\n"); print (FILE2 "change AMBER_ATOM_TYPE :A*,DA*\@N7 NAMG\n"); print (FILE2 "change AMBER_ATOM_TYPE :G*,DG*\@N7 NGMG\n"); print (FILE2 "change AMBER_ATOM_TYPE :*\@OP* OPMG\n"); print (FILE2 "addLJType @\%NAMG\n"); print (FILE2 "addLJType @\%NGMG\n"); print (FILE2 "addLJType @\%OPMG\n"); print (FILE2 "add12_6_4 :ZN watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :ZN\n"); print (FILE2 "add12_6_4 :MG watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :MG\n"); print (FILE2 "add12_6_4 :Na+ watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :Na+\n"); print (FILE2 "add12_6_4 :Cl- watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :Cl-\n"); print (FILE2 "add12_6_4 :CA watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :CA\n"); print (FILE2 "add12_6_4 :K+ watermodel TIP4PEW\n"); print (FILE2 "printLJMatrix :K+\n"); print (FILE2 "outparm out4-parmed.prmtop out4-parmed.inpcrd\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/parmed -i parmed.in -p out4.prmtop -c out4.inpcrd >out-parmed.txt"; system($cmd); #EXTRACT LANDMARK INDEX #extract smd atom, landmarks, and gtp indexes my @pdb_input = read_file("out4-leap.pdb") or die; my $array_size=scalar @pdb_input; for (my $count=0; $count<$array_size; $count++) { $line= @pdb_input[$count]; $atom_type=substr($line, 12, 4); $resid=substr($line, 22, 4); $trigger_gtp=substr($line, 17, 3); if (($resid == 1317) and ($atom_type eq ' CG ')){ $L1A_id=substr($line, 6, 6) - 1; } if (($resid == 3126) and ($atom_type eq ' CG ')){ $L1B_id=substr($line, 6, 6) - 1; } if (($resid == 1373) and ($atom_type eq ' CG ')){ $L2_id=substr($line, 6, 6) - 1;

250

} if (($resid == 38) and ($atom_type eq ' N3 ')){ $L3_id=substr($line, 6, 6) - 1; } if (($trigger_gtp eq 'gtp') and ($atom_type eq ' O1G')){ $gtp_first_id=substr($line, 6, 6) - 1; } if (($trigger_gtp eq 'gtp') and ($atom_type eq ' N1 ')){ $smd_atom_id=substr($line, 6, 6) - 1; } if (($trigger_gtp eq 'gtp') and ($atom_type eq 'HO\'2')){ $gtp_last_id=substr($line, 6, 6) - 1; } } print "L1A id is $L1A_id\n"; print "L1B id is $L1B_id\n"; print "L2 id is $L2_id\n"; print "L3 id is $L3_id\n"; print "gtp_first_id is $gtp_first_id\n"; print "smd_atom_id is $smd_atom_id\n"; print "gtp_last_id is $gtp_last_id\n"; ################################################################################################# #################################### END PRELIMINARY PROCEDURES ############################### ################################################################################################# ################################################################################################# #################################### PREPARE THE SYSTEM ############################### ################################################################################################# #Minimize the system my $outfile="min1.in"; open (FILE2, "> $outfile") || die; print (FILE2 "2e2h: initial minimisation solvent + ions\n"); print (FILE2 " &cntrl\n"); print (FILE2 " imin = 1,\n"); print (FILE2 " ntmin = 2,\n"); print (FILE2 " maxcyc = 5000,\n"); print (FILE2 " ncyc = 1000,\n"); print (FILE2 " ntb = 1,\n"); print (FILE2 " ntr = 1,\n"); print (FILE2 " cut = 10.0\n"); print (FILE2 " /\n"); print (FILE2 "Hold the protein fixed\n"); print (FILE2 "500.0\n"); print (FILE2 "RES 1 $nb_protein_res\n"); print (FILE2 "END\n"); print (FILE2 "END\n");

251

close (FILE2); my $outfile="min2.in"; open (FILE2, "> $outfile") || die; print (FILE2 "2e2h: initial minimisation whole system\n"); print (FILE2 " &cntrl\n"); print (FILE2 " imin = 1,\n"); print (FILE2 " ntmin = 2,\n"); print (FILE2 " maxcyc = 5000,\n"); print (FILE2 " ncyc = 2500,\n"); print (FILE2 " ntb = 1,\n"); print (FILE2 " ntr = 0,\n"); print (FILE2 " cut = 10.0\n"); print (FILE2 " /\n"); print (FILE2 "END\n"); close (FILE2); my $cmd = "/home/ng/amber16/bin/sander -O -i min1.in -o min1.out -p out4-parmed.prmtop -c out4-parmed.inpcrd -r min1.rst -ref out4-parmed.inpcrd"; system($cmd); my $cmd = "/home/ng/amber16/bin/sander -O -i min2.in -o min2.out -p out4-parmed.prmtop -c min1.rst -r min2.rst"; system($cmd); ##################### HEAT (MD1) 20 ps my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out4-parmed.prmtop', 'min2.rst')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); #Add restraints to protein and gtp print (FILE2 "force_res_prot = CustomExternalForce(\"k*periodicdistance(x, y, z, x0, y0, z0)^2\")\n"); print (FILE2 "force_res_prot.addGlobalParameter(\"k\", 10.0*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"x0\")\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"y0\")\n"); print (FILE2 "force_res_prot.addPerParticleParameter(\"z0\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if ((i <= $last_protein_id) or (i >= $gtp_first_id and i <= $gtp_last_id)):\n"); print (FILE2 " force_res_prot.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_res_prot)\n");

252

print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter(stdout, 100, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md1.nc', 5000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('md1.rst7', 10000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(10000)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-openmm-md1.txt"; system($cmd); ##################### EQ VEL (MD2-sim1) 20 ps my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out4-parmed.prmtop', 'md1.rst7.10000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md1.rst7.10000')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); #Add constraints to anchors and gtp-MG print (FILE2 "for i, atom in enumerate(parm.atoms):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8) or (i >= $gtp_first_id and i <= $gtp_last_id):\n"); print (FILE2 " system.setParticleMass(i, 0*dalton)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); print (FILE2 "system.addForce(MonteCarloBarostat(1*bar, 300*kelvin))\n"); print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter('out-openmm-md2-sim1.txt', 1000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True, volume=True, density=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('md2-sim1.nc', 250000, crds=True))\n");

253

print (FILE2 "restrt = RestartReporter('md2-sim1.rst7', 10000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); print (FILE2 "simulation.step(10000)\n"); print (FILE2 "positions = simulation.context.getState(getPositions=True).getPositions()\n"); #print coordinates for first checkpoint: print (FILE2 "for i, atom_crd in enumerate(positions):\n"); print (FILE2 " if (i == $L1A_id):\n"); print (FILE2 " coords = atom_crd.value_in_unit(angstroms)\n"); print (FILE2 " x = coords[0]\n"); print (FILE2 " y = coords[1]\n"); print (FILE2 " z = coords[2]\n"); print (FILE2 " print(x)\n"); print (FILE2 " print(y)\n"); print (FILE2 " print(z)\n"); print (FILE2 " if (i == $L1B_id):\n"); print (FILE2 " coords = atom_crd.value_in_unit(angstroms)\n"); print (FILE2 " x = coords[0]\n"); print (FILE2 " y = coords[1]\n"); print (FILE2 " z = coords[2]\n"); print (FILE2 " print(x)\n"); print (FILE2 " print(y)\n"); print (FILE2 " print(z)\n"); print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-md2-sim1-coords.txt"; system($cmd); ################################################################################################# #################################### END PREPARE THE SYSTEM ############################### ################################################################################################# ################################################################################################# #################################### FIRST CHECKPOINT SMD ############################### ################################################################################################# #First, calculate checkpoint coordinates my @pdb_input = read_file("out-md2-sim1-coords.txt") or die; $L1Ax=$pdb_input[0]; $L1Ax =~ s/^\s+|\s+$//g; $L1Ay=$pdb_input[1]; $L1Ay =~ s/^\s+|\s+$//g; $L1Az=$pdb_input[2]; $L1Az =~ s/^\s+|\s+$//g; $L1Bx=$pdb_input[3]; $L1Bx =~ s/^\s+|\s+$//g; $L1By=$pdb_input[4]; $L1By =~ s/^\s+|\s+$//g; $L1Bz=$pdb_input[5]; $L1Bz =~ s/^\s+|\s+$//g;

254

my $CK1x=($L1Ax+$L1Bx)/2; my $CK1y=($L1Ay+$L1By)/2; my $CK1z=($L1Az+$L1Bz)/2; print "CK1 is $CK1x $CK1y $CK1z\n"; $CK1x= sprintf "%.3f", $CK1x; $CK1y= sprintf "%.3f", $CK1y; $CK1z= sprintf "%.3f", $CK1z; my $k=0.075; ##################### SMD my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out4-parmed.prmtop', 'md2-sim1.rst7.10000')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('md2-sim1.rst7.10000')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); #Add constraints to anchors print (FILE2 "for i, atom in enumerate(parm.atoms):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " system.setParticleMass(i, 0*dalton)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); print (FILE2 "integrator.setConstraintTolerance(0.0000001)\n"); #End Add constraints to anchors #Add sMD force print (FILE2 "force_smd = CustomExternalForce(\"k*((x-xd)^2+(y-yd)^2+(z-zd)^2)\")\n"); print (FILE2 "force_smd.addGlobalParameter(\"k\", $k*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_smd.addGlobalParameter(\"xd\", $CK1x*angstroms)\n"); print (FILE2 "force_smd.addGlobalParameter(\"yd\", $CK1y*angstroms)\n"); print (FILE2 "force_smd.addGlobalParameter(\"zd\", $CK1z*angstroms)\n"); print (FILE2 "force_smd.addPerParticleParameter(\"xd\")\n"); print (FILE2 "force_smd.addPerParticleParameter(\"yd\")\n"); print (FILE2 "force_smd.addPerParticleParameter(\"zd\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i == $smd_atom_id):\n"); print (FILE2 " force_smd.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_smd)\n"); #End Add sMD force print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n");

255

print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter('out-openmm-smd1.txt', 1000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('smd1.nc', 50000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('smd1.rst7', 50000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); #### CHECKPOINT LOOP print (FILE2 "smd_loop = 0\n"); print (FILE2 "it_check = 1\n"); print (FILE2 "stop_check = 0\n"); print (FILE2 "step = 0\n"); print (FILE2 "while smd_loop < it_check:\n"); print (FILE2 " simulation.step(12500)\n"); print (FILE2 " positions = simulation.context.getState(getPositions=True).getPositions()\n"); print (FILE2 " for i, atom_crd in enumerate(positions):\n"); print (FILE2 " if (i == $smd_atom_id):\n"); print (FILE2 " coords_smd = atom_crd.value_in_unit(angstroms)\n"); print (FILE2 " x_smd = coords_smd[0]\n"); print (FILE2 " y_smd = coords_smd[1]\n"); print (FILE2 " z_smd = coords_smd[2]\n"); print (FILE2 " x = x_smd - $CK1x\n"); print (FILE2 " y = y_smd - $CK1y\n"); print (FILE2 " z = z_smd - $CK1z\n"); print (FILE2 " dist = math.sqrt(x*x+y*y+z*z)\n"); print (FILE2 " step = step + 12500\n"); print (FILE2 " print(\"step =\", step)\n"); print (FILE2 " print(\"dist =\", dist)\n"); print (FILE2 " smd_loop += 1\n"); print (FILE2 " it_check += 1\n"); print (FILE2 " stop_check += 1\n"); #synchronise stop check with traj writing print (FILE2 " if (stop_check == 4):\n"); print (FILE2 " stop_check = 0\n"); print (FILE2 " if (dist < 4):\n"); print (FILE2 " it_check = 0\n"); #avoid infinite loop, in case of stucked state print (FILE2 " if (smd_loop == 80):\n"); print (FILE2 " it_check = 0\n"); print (FILE2 " print(\"!!SMD stopped: state not reached after 2 ns!!\")\n"); #print coordinates for second checkpoint and rst step: print (FILE2 "for i, atom_crd in enumerate(positions):\n"); print (FILE2 " if (i == $L2_id):\n"); print (FILE2 " coords = atom_crd.value_in_unit(angstroms)\n"); print (FILE2 " x = coords[0]\n"); print (FILE2 " y = coords[1]\n"); print (FILE2 " z = coords[2]\n"); print (FILE2 " print(x)\n"); print (FILE2 " print(y)\n"); print (FILE2 " print(z)\n"); print (FILE2 "print(step)\n"); #### END CHECKPOINT LOOP print (FILE2 "quit\n"); close (FILE2);

256

my $cmd = "~/anaconda3/bin/python openmm-input.py >out-smd1-dist.txt"; system($cmd); ################################################################################################# #################################### END FIRST CHECKPOINT SMD ############################## ################################################################################################# ################################################################################################# #################################### SECOND CHECKPOINT SMD ############################### ################################################################################################# #First, calculate checkpoint coordinates my @pdb_input = read_file("out-smd1-dist.txt") or die; my $array_size = scalar @pdb_input; $L2x=$pdb_input[$array_size-4]; $L2x =~ s/^\s+|\s+$//g; $L2y=$pdb_input[$array_size-3]; $L2y =~ s/^\s+|\s+$//g; $L2z=$pdb_input[$array_size-2]; $L2z =~ s/^\s+|\s+$//g; $last_step=$pdb_input[$array_size-1]; $last_step =~ s/^\s+|\s+$//g; #EXIT if previous step not converged: $exit=substr($pdb_input[$array_size-8], 0, 2); if ($exit eq '!!') { exit; } my $CK2x=$L2x; my $CK2y=$L2y; my $CK2z=$L2z; print "CK2 is $CK2x $CK2y $CK2z\n"; $CK2x= sprintf "%.3f", $CK2x; $CK2y= sprintf "%.3f", $CK2y; $CK2z= sprintf "%.3f", $CK2z; ##################### SMD my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n");

257

print (FILE2 "\n"); print (FILE2 "parm = load_file('out4-parmed.prmtop', 'smd1.rst7.$last_step')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('smd1.rst7.$last_step')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); #Add constraints to anchors print (FILE2 "for i, atom in enumerate(parm.atoms):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " system.setParticleMass(i, 0*dalton)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); print (FILE2 "integrator.setConstraintTolerance(0.0000001)\n"); #End Add constraints to anchors #Add sMD force print (FILE2 "force_smd = CustomExternalForce(\"k*((x-xd)^2+(y-yd)^2+(z-zd)^2)\")\n"); print (FILE2 "force_smd.addGlobalParameter(\"k\", $k*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_smd.addGlobalParameter(\"xd\", $CK2x*angstroms)\n"); print (FILE2 "force_smd.addGlobalParameter(\"yd\", $CK2y*angstroms)\n"); print (FILE2 "force_smd.addGlobalParameter(\"zd\", $CK2z*angstroms)\n"); print (FILE2 "force_smd.addPerParticleParameter(\"xd\")\n"); print (FILE2 "force_smd.addPerParticleParameter(\"yd\")\n"); print (FILE2 "force_smd.addPerParticleParameter(\"zd\")\n"); print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i == $smd_atom_id):\n"); print (FILE2 " force_smd.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_smd)\n"); #End Add sMD force print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter('out-openmm-smd2.txt', 1000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('smd2.nc', 50000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('smd2.rst7', 50000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); #### CHECKPOINT LOOP print (FILE2 "smd_loop = 0\n"); print (FILE2 "it_check = 1\n"); print (FILE2 "stop_check = 0\n"); print (FILE2 "step = 0\n"); print (FILE2 "while smd_loop < it_check:\n"); print (FILE2 " simulation.step(12500)\n"); print (FILE2 " positions = simulation.context.getState(getPositions=True).getPositions()\n"); print (FILE2 " for i, atom_crd in enumerate(positions):\n"); print (FILE2 " if (i == $smd_atom_id):\n"); print (FILE2 " coords_smd = atom_crd.value_in_unit(angstroms)\n"); print (FILE2 " x_smd = coords_smd[0]\n");

258

print (FILE2 " y_smd = coords_smd[1]\n"); print (FILE2 " z_smd = coords_smd[2]\n"); print (FILE2 " x = x_smd - $CK2x\n"); print (FILE2 " y = y_smd - $CK2y\n"); print (FILE2 " z = z_smd - $CK2z\n"); print (FILE2 " dist = math.sqrt(x*x+y*y+z*z)\n"); print (FILE2 " step = step + 12500\n"); print (FILE2 " print(\"step =\", step)\n"); print (FILE2 " print(\"dist =\", dist)\n"); print (FILE2 " smd_loop += 1\n"); print (FILE2 " it_check += 1\n"); print (FILE2 " stop_check += 1\n"); #synchronise stop check with traj writing print (FILE2 " if (stop_check == 4):\n"); print (FILE2 " stop_check = 0\n"); print (FILE2 " if (dist < 7):\n"); print (FILE2 " it_check = 0\n"); #avoid infinite loop, in case of stucked state print (FILE2 " if (smd_loop == 80):\n"); print (FILE2 " it_check = 0\n"); print (FILE2 " print(\"!!SMD stopped: state not reached after 2 ns!!\")\n"); #print coordinates for third checkpoint and last step: print (FILE2 "for i, atom_crd in enumerate(positions):\n"); print (FILE2 " if (i == $L3_id):\n"); print (FILE2 " coords = atom_crd.value_in_unit(angstroms)\n"); print (FILE2 " x = coords[0]\n"); print (FILE2 " y = coords[1]\n"); print (FILE2 " z = coords[2]\n"); print (FILE2 " print(x)\n"); print (FILE2 " print(y)\n"); print (FILE2 " print(z)\n"); print (FILE2 "print(step)\n"); #### END CHECKPOINT LOOP print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-smd2-dist.txt"; system($cmd); ################################################################################################# #################################### END SECOND CHECKPOINT SMD ############################## ################################################################################################# ################################################################################################# #################################### THIRD CHECKPOINT SMD ############################### ################################################################################################# #First, calculate checkpoint coordinates my @pdb_input = read_file("out-smd2-dist.txt") or die; my $array_size = scalar @pdb_input; $L3x=$pdb_input[$array_size-4]; $L3x =~ s/^\s+|\s+$//g; $L3y=$pdb_input[$array_size-3];

259

$L3y =~ s/^\s+|\s+$//g; $L3z=$pdb_input[$array_size-2]; $L3z =~ s/^\s+|\s+$//g; $last_step=$pdb_input[$array_size-1]; $last_step =~ s/^\s+|\s+$//g; #EXIT if previous step not converged: $exit=substr($pdb_input[$array_size-8], 0, 2); if ($exit eq '!!') { exit; } my $CK3x=$L3x; my $CK3y=$L3y; my $CK3z=$L3z; print "CK3 is $CK3x $CK3y $CK3z\n"; $CK3x= sprintf "%.3f", $CK3x; $CK3y= sprintf "%.3f", $CK3y; $CK3z= sprintf "%.3f", $CK3z; ##################### SMD my $outfile="openmm-input.py"; open (FILE2, "> $outfile") || die; print (FILE2 "from __future__ import absolute_import\n"); print (FILE2 "from simtk.openmm import CustomIntegrator\n"); print (FILE2 "from simtk.unit import kilojoules_per_mole, is_quantity\n"); print (FILE2 "from simtk.openmm import CustomExternalForce\n"); print (FILE2 "from simtk.openmm.app import *\n"); print (FILE2 "from simtk.openmm import *\n"); print (FILE2 "from simtk.unit import *\n"); print (FILE2 "from sys import stdout\n"); print (FILE2 "from parmed import load_file\n"); print (FILE2 "from parmed.openmm import StateDataReporter, NetCDFReporter\n"); print (FILE2 "from parmed.openmm.reporters import RestartReporter\n"); print (FILE2 "\n"); print (FILE2 "parm = load_file('out4-parmed.prmtop', 'smd2.rst7.$last_step')\n"); print (FILE2 "inpcrd = AmberInpcrdFile('smd2.rst7.$last_step')\n"); print (FILE2 "system = parm.createSystem(nonbondedMethod=PME, nonbondedCutoff=8*angstroms, constraints=HBonds, rigidWater=True)\n"); #Add constraints to anchors print (FILE2 "for i, atom in enumerate(parm.atoms):\n"); print (FILE2 " if (i >= $dna1 and i <= $dna2) or (i >= $dna3 and i <= $dna4) or (i >= $dna5 and i <= $dna6) or (i >= $dna7 and i <= $dna8):\n"); print (FILE2 " system.setParticleMass(i, 0*dalton)\n"); print (FILE2 "integrator = LangevinIntegrator(300*kelvin, 1.0/picosecond, 2*femtoseconds)\n"); print (FILE2 "integrator.setConstraintTolerance(0.0000001)\n"); #End Add constraints to anchors #Add sMD force print (FILE2 "force_smd = CustomExternalForce(\"k*((x-xd)^2+(y-yd)^2+(z-zd)^2)\")\n"); print (FILE2 "force_smd.addGlobalParameter(\"k\", $k*kilocalories_per_mole/angstroms**2)\n"); print (FILE2 "force_smd.addGlobalParameter(\"xd\", $CK3x*angstroms)\n"); print (FILE2 "force_smd.addGlobalParameter(\"yd\", $CK3y*angstroms)\n"); print (FILE2 "force_smd.addGlobalParameter(\"zd\", $CK3z*angstroms)\n"); print (FILE2 "force_smd.addPerParticleParameter(\"xd\")\n"); print (FILE2 "force_smd.addPerParticleParameter(\"yd\")\n"); print (FILE2 "force_smd.addPerParticleParameter(\"zd\")\n");

260

print (FILE2 "for i, atom_crd in enumerate(parm.positions):\n"); print (FILE2 " if (i == $smd_atom_id):\n"); print (FILE2 " force_smd.addParticle(i, atom_crd.value_in_unit(nanometers))\n"); print (FILE2 "system.addForce(force_smd)\n"); #End Add sMD force print (FILE2 "platform = Platform.getPlatformByName('CUDA')\n"); print (FILE2 "properties = {'CudaDeviceIndex': '0', 'CudaPrecision': 'mixed'}\n"); print (FILE2 "simulation = Simulation(parm.topology, system, integrator, platform, properties)\n"); print (FILE2 "simulation.context.setPositions(parm.positions)\n"); print (FILE2 "simulation.context.setPeriodicBoxVectors(*inpcrd.boxVectors)\n"); print (FILE2 "simulation.context.setVelocities(inpcrd.velocities)\n"); print (FILE2 "simulation.reporters.append(StateDataReporter('out-openmm-smd3.txt', 1000, step=True, potentialEnergy=True, temperature=True, kineticEnergy=True, totalEnergy=True))\n"); print (FILE2 "simulation.reporters.append(NetCDFReporter('smd3.nc', 50000, crds=True))\n"); print (FILE2 "restrt = RestartReporter('smd3.rst7', 50000, parm.ptr('natom'), netcdf=True)\n"); print (FILE2 "simulation.reporters.append(restrt)\n"); #### CHECKPOINT LOOP print (FILE2 "smd_loop = 0\n"); print (FILE2 "it_check = 1\n"); print (FILE2 "stop_check = 0\n"); print (FILE2 "step = 0\n"); print (FILE2 "while smd_loop < it_check:\n"); print (FILE2 " simulation.step(12500)\n"); print (FILE2 " positions = simulation.context.getState(getPositions=True).getPositions()\n"); print (FILE2 " for i, atom_crd in enumerate(positions):\n"); print (FILE2 " if (i == $smd_atom_id):\n"); print (FILE2 " coords_smd = atom_crd.value_in_unit(angstroms)\n"); print (FILE2 " x_smd = coords_smd[0]\n"); print (FILE2 " y_smd = coords_smd[1]\n"); print (FILE2 " z_smd = coords_smd[2]\n"); print (FILE2 " x = x_smd - $CK3x\n"); print (FILE2 " y = y_smd - $CK3y\n"); print (FILE2 " z = z_smd - $CK3z\n"); print (FILE2 " dist = math.sqrt(x*x+y*y+z*z)\n"); print (FILE2 " step = step + 12500\n"); print (FILE2 " print(\"step =\", step)\n"); print (FILE2 " print(\"dist =\", dist)\n"); print (FILE2 " smd_loop += 1\n"); print (FILE2 " it_check += 1\n"); print (FILE2 " stop_check += 1\n"); #synchronise stop check with traj writing print (FILE2 " if (stop_check == 4):\n"); print (FILE2 " stop_check = 0\n"); print (FILE2 " if (dist < 3):\n"); print (FILE2 " it_check = 0\n"); #avoid infinite loop, in case of stucked state print (FILE2 " if (smd_loop == 80):\n"); print (FILE2 " it_check = 0\n"); print (FILE2 " print(\"!!SMD stopped: state not reached after 2 ns!!\")\n"); #### END CHECKPOINT LOOP print (FILE2 "quit\n"); close (FILE2); my $cmd = "~/anaconda3/bin/python openmm-input.py >out-smd3-dist.txt";

261

system($cmd); ################################################################################################# #################################### END THIRD CHECKPOINT SMD ############################## ################################################################################################# exit;

Documents

Investigation of the nucleotide triphosphate diffusion