Algorithms for BioinformaticsProtein Folding1 Disulfide Bonds Two cyteines in close proximity will form a covalent bond Disulfide bond, disulfide bridge,

Protein Folding 1Algorithms for Bioinformatics

Disulfide BondsDisulfide Bonds Two cyteines in

close proximity will form a covalent bond

Disulfide bond, disulfide bridge, or dicysteine bond.

Significantly stabilizes tertiary structure.


Determining Protein StructureDetermining Protein Structure There are O(100,000) distinct proteins in the

human proteome. 3D structures have been determined for 14,000

proteins, from all organisms• Includes duplicates with different ligands bound,

etc.

Coordinates are determined by X-ray X-ray crystallographycrystallography


X-Ray CrystallographyX-Ray Crystallography

~0.5mm

• The crystal is a mosaic of millions of copies of the protein.

• As much as 70% is solvent (water)!

• May take months (and a “green” thumb) to grow.


X-Ray diffractionX-Ray diffraction

Image is averagedover:• Space (many copies)• Time (of the diffraction

experiment)


Electron Density MapsElectron Density Maps Resolution is

dependent on the quality/regularity of the crystal

R-factor is a measure of “leftover” electron density

Solvent fitting Refinement


The Protein Data BankThe Protein Data Bank

ATOM 1 N ALA E 1 22.382 47.782 112.975 1.00 24.09 3APR 213ATOM 2 CA ALA E 1 22.957 47.648 111.613 1.00 22.40 3APR 214ATOM 3 C ALA E 1 23.572 46.251 111.545 1.00 21.32 3APR 215ATOM 4 O ALA E 1 23.948 45.688 112.603 1.00 21.54 3APR 216ATOM 5 CB ALA E 1 23.932 48.787 111.380 1.00 22.79 3APR 217ATOM 6 N GLY E 2 23.656 45.723 110.336 1.00 19.17 3APR 218ATOM 7 CA GLY E 2 24.216 44.393 110.087 1.00 17.35 3APR 219ATOM 8 C GLY E 2 25.653 44.308 110.579 1.00 16.49 3APR 220ATOM 9 O GLY E 2 26.258 45.296 110.994 1.00 15.35 3APR 221ATOM 10 N VAL E 3 26.213 43.110 110.521 1.00 16.21 3APR 222ATOM 11 CA VAL E 3 27.594 42.879 110.975 1.00 16.02 3APR 223ATOM 12 C VAL E 3 28.569 43.613 110.055 1.00 15.69 3APR 224ATOM 13 O VAL E 3 28.429 43.444 108.822 1.00 16.43 3APR 225ATOM 14 CB VAL E 3 27.834 41.363 110.979 1.00 16.66 3APR 226ATOM 15 CG1 VAL E 3 29.259 41.013 111.404 1.00 17.35 3APR 227ATOM 16 CG2 VAL E 3 26.811 40.649 111.850 1.00 17.03 3APR 228

http://www.rcsb.org/pdb/


A Peek at Protein FunctionA Peek at Protein Function Serine proteases – cleave other proteins

• Catalytic Triad: ASP, HIS, SER


Cleaving the peptide bondCleaving the peptide bond


Three Serine ProteasesThree Serine Proteases Chymotrypsin – Cleaves the peptide bond on

the carboxyl side of aromatic (ring) residues: Trp, Phe, Tyr; and large hydrophobic residues: Met.

Trypsin – Cleaves after Lys (K) or Arg (R)• Positive charge

Elastase – Cleaves after small residues: Gly, Ala, Ser, Cys


Specificity Binding PocketSpecificity Binding Pocket


The Protein Folding ProblemThe Protein Folding Problem Central question of molecular biology:

“Given a particular sequence of amino acid Given a particular sequence of amino acid residues (primary structure), what will the residues (primary structure), what will the tertiary/quaternary structure of the resulting tertiary/quaternary structure of the resulting protein be?”protein be?”

Input: AAVIKYGCAL…Output: 11, 22…= backbone conformation:(no side chains yet)


Protein Folding – Biological perspectiveProtein Folding – Biological perspective ““Central dogma”: Central dogma”: Sequence specifies structureSequence specifies structure Denature – to “unfold” a protein back to

random coil configuration-mercaptoethanol – breaks disulfide bonds• Urea or guanidine hydrochloride – denaturant• Also heat or pH

Anfinsen’s experiments• Denatured ribonuclease• Spontaneously regained enzymatic activity• Evidence that it re-folded to native conformation


Folding intermediatesFolding intermediates Levinthal’s paradox – Consider a 100 residue

protein. If each residue can take only 3 positions, there are 3100 = 5 1047 possible conformations.• If it takes 10-13s to convert from 1 structure to

another, exhaustive search would take 1.6 1027 years!

Folding must proceed by progressive stabilization of intermediates• Molten globules – most secondary structure formed,

but much less compact than “native” conformation.


Forces driving protein foldingForces driving protein folding It is believed that hydrophobic collapse is a key

driving force for protein folding• Hydrophobic core• Polar surface interacting with solvent

Minimum volume (no cavities) Disulfide bond formation stabilizes Hydrogen bonds Polar and electrostatic interactions


Folding helpFolding help Proteins are, in fact, only marginally stable

• Native state is typically only 5 to 10 kcal/mole more stable than the unfolded form

Many proteins help in folding• Protein disulfide isomerase – catalyzes shuffling of

disulfide bonds• Chaperones – break up aggregates and (in theory)

unfold misfolded proteins


The Hydrophobic CoreThe Hydrophobic Core Hemoglobin A is the protein in red blood cells

(erythrocytes) responsible for binding oxygen. The mutation E6V in the chain places a

hydrophobic Val on the surface of hemoglobin The resulting “sticky patch” causes hemoglobin

S to agglutinate (stick together) and form fibers which deform the red blood cell and do not carry oxygen efficiently

Sickle cell anemia was the first identified molecular disease


Sickle Cell AnemiaSickle Cell Anemia

Sequestering hydrophobic residues in Sequestering hydrophobic residues in the protein core protects proteins from the protein core protects proteins from hydrophobic agglutination.hydrophobic agglutination.


Computational Problems in Protein FoldingComputational Problems in Protein Folding

Two key questions:• Evaluation – how can we tell a correctly-folded

protein from an incorrectly folded protein? H-bonds, electrostatics, hydrophobic effect, etc. Derive a function, see how well it does on “real” proteins

• Optimization – once we get an evaluation function, can we optimize it?

Simulated annealing/monte carlo EC Heuristics We’ll talk more about these methods later…


Fold OptimizationFold Optimization Simple lattice models (HP-

models)• Two types of residues:

hydrophobic and polar• 2-D or 3-D lattice• The only force is hydrophobic

collapse• Score = number of HH

contacts


H/P model scoring: count noncovalent hydrophobic interactions.

Sometimes:• Penalize for buried polar or surface hydrophobic

residues

Scoring Lattice ModelsScoring Lattice Models


What can we do with lattice models?What can we do with lattice models? For smaller polypeptides, exhaustive search can

be used• Looking at the “best” fold, even in such a simple

model, can teach us interesting things about the protein folding process

For larger chains, other optimization and search methods must be used• Greedy, branch and bound• Evolutionary computing, simulated annealing• Graph theoretical methods


The “hydrophobic zipper” effect:

Learning from Lattice ModelsLearning from Lattice Models

Ken Dill ~ 1997


Absolute directions• UURRDLDRRU

Relative directions• LFRFRRLLFFL• Advantage, we can’t have UD or RL in absolute• Only three directions: LRF

What about bumps? LFRRR• Bad score• Use a better representation

Representing a lattice modelRepresenting a lattice model


Preference-order representationPreference-order representation Each position has two “preferences”

• If it can’t have either of the two, it will take the “least favorite” path if possible

Example: {LR},{FL},{RL},{FR},{RL},{RL},{FR},{RF}

Can still cause bumps:{LF},{FR},{RL},{FL},{RL},{FL},{RF},{RL},{FL}


““Decoding” the representationDecoding” the representation The optimizer works on the representation, but

to score, we have to “decode” into a structure that lets us check for bumps and score.

Example: How many bumps in: URDDLLDRURU?

We can do it on graph paper• Start at 0,0• Fill in the graph

In PERL we use a two-dimensional array


A two-dimensional array in PERLA two-dimensional array in PERL$configuration = “URDDLLDRURU”;$sequence = “HPPHHPHPHHH”;foreach $i (1..100) { foreach $j (1..100) { $grid[$i][$j] = “empty”; }}$x = 0;$y = 0;@moves = split(//,$configuration);@residues = split(//,$sequence);


Setting up the gridSetting up the gridforeach $move (@moves) { $residue = shift(@residues); if ($move = “U”) { $y_position++; } if ($move = “R”) { $x_position++; } etc…if ($grid[$x][$y] ne “empty”) { BUMP!} else { $grid[$x][$y] = $residue;}


More realistic modelsMore realistic models Higher resolution lattices (45° lattice, etc.) Off-lattice models

• Local moves• Optimization/search methods and /

representations Greedy search Branch and bound EC, Monte Carlo, simulated annealing, etc.


The Other Half of the PictureThe Other Half of the Picture Now that we have a more realistic off-lattice

model, we need a better energy function to evaluate a conformation (fold).

Theoretical force field:G = Gvan der Waals + Gh-bonds + Gsolvent + Gcoulomb

Empirical force fields• Start with a database• Look at neighboring residues – similar to known

protein folds?


Threading: Fold recognitionThreading: Fold recognition Given:

• Sequence: IVACIVSTEYDVMKAAR…

• A database of molecular coordinates

Map the sequence onto each fold

Evaluate• Objective 1: improve

scoring function• Objective 2: folding


Secondary Structure PredictionSecondary Structure Prediction

AGVGTVPMTAYGNDIQYYGQVT…AGVGTVPMTAYGNDIQYYGQVT…A-VGIVPM-AYGQDIQY-GQVT…AG-GIIP--AYGNELQ--GQVT…AGVCTVPMTA---ELQYYG--T…

AGVGTVPMTAYGNDIQYYGQVT…AGVGTVPMTAYGNDIQYYGQVT…----hhhHHHHHHhhh--eeEE…----hhhHHHHHHhhh--eeEE…


Secondary Structure PredictionSecondary Structure Prediction Easier than folding

• Current algorithms can prediction secondary structure with 70-80% accuracy

Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222.

• Based on frequencies of occurrence of residues in helices and sheets

PhD – Neural network based• Uses a multiple sequence alignment• Rost & Sander, Proteins, 1994 , 19, 55-72


Chou-Fasman ParametersChou-Fasman ParametersName Abbrv P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)Alanine A 142 83 66 0.06 0.076 0.035 0.058Arginine R 98 93 95 0.07 0.106 0.099 0.085Aspartic Acid D 101 54 146 0.147 0.11 0.179 0.081Asparagine N 67 89 156 0.161 0.083 0.191 0.091Cysteine C 70 119 119 0.149 0.05 0.117 0.128Glutamic Acid E 151 37 74 0.056 0.06 0.077 0.064Glutamine Q 111 110 98 0.074 0.098 0.037 0.098Glycine G 57 75 156 0.102 0.085 0.19 0.152Histidine H 100 87 95 0.14 0.047 0.093 0.054Isoleucine I 108 160 47 0.043 0.034 0.013 0.056Leucine L 121 130 59 0.061 0.025 0.036 0.07Lysine K 114 74 101 0.055 0.115 0.072 0.095Methionine M 145 105 60 0.068 0.082 0.014 0.055Phenylalanine F 113 138 60 0.059 0.041 0.065 0.065Proline P 57 55 152 0.102 0.301 0.034 0.068Serine S 77 75 143 0.12 0.139 0.125 0.106Threonine T 83 119 96 0.086 0.108 0.065 0.079Tryptophan W 108 137 96 0.077 0.013 0.064 0.167Tyrosine Y 69 147 114 0.082 0.065 0.114 0.125Valine V 106 170 50 0.062 0.048 0.028 0.053


Chou-Fasman AlgorithmChou-Fasman Algorithm Identify -helices

• 4 out of 6 contiguous amino acids that have P(a) > 100

• Extend the region until 4 amino acids with P(a) < 100 found

• Compute P(a) and P(b); If the region is >5 residues and P(a) > P(b) identify as a helix

Repeat for -sheets [use P(b)] If an and a region overlap, the overlapping

region is predicted according to P(a) and P(b)


Chou-Fasman, cont’dChou-Fasman, cont’d Identify hairpin turns:

• P(t) = f(i) of the residue f(i+1) of the next residue f(i+2) of the following residue f(i+3) of the residue at position (i+3)

• Predict a hairpin turn starting at positions where: P(t) > 0.000075 The average P(turn) for the four residues > 100 P(a) < P(turn) > P(b) for the four residues

Accuracy 60-65%


Chou-Fasman ExampleChou-Fasman Example CAENKLDHVRGPTCILFMTWYNDGP CAENKL – Potential helix (!C and !N)

Residues with P(a) < 100: RNCGPSTY

• Extend: When we reach RGPT, we must stop• CAENKLDHV: P(a) = 972, P(b) = 843• Declare alpha helix

Identifying a hairpin turn• VRGP: P(t) = 0.000085• Average P(turn) = 113.25

Avg P(a) = 79.5, Avg P(b) = 98.25

Documents

Algorithms for BioinformaticsProtein Folding1 Disulfide Bonds Two cyteines in close proximity will form a covalent bond Disulfide bond, disulfide bridge,