Chapter 17 Prediction, Engineering and Design of Protein Structures

Chapter 17Chapter 17

Prediction, Engineering and Design of Protein Structures

Protein Engineering vs. Protein DesignProtein Engineering vs. Protein Design

• Protein Engineering: Mutating gene(s) to modify an existing protein.– Capability exists– Many examples can be found

• Protein Design: Designing an entire protein from scratch to serve a specific purpose.– Unlikely until we can reliably predict folding from sequence– Levinthal’s Paradox: Why we cannot test random combinations– We can predict 2° structure, but prediction of 3° structure will require

a shortcut (e.g., energy considerations, kinetics, etc)

Prediction of Secondary Structure from SequencePrediction of Secondary Structure from Sequence• PDBSum (EMBL-EBI) http://www.ebi.ac.uk/pdbsum/• Jpred: http://www.compbio.dundee.ac.uk/www-jpred/• PredictProtein: https://www.predictprotein.org/

• Either enter FASTA sequence file or can load new/existing sequence

• Based on propensity of certain AA’s to form specific structures, or stereochemical considerations (compactness & hydrophobicity related to known tertiary structures), but all are related to extensive analyses of sequences and the applications of scoring matrices

http://www.ebi.ac.uk/pdbsum/

http://www.compbio.dundee.ac.uk/www-jpred/

https://www.predictprotein.org/

FASTA format:versatile, compact with one header line

followed by a string of nucleotides or amino acids in the single letter code

Pairwise AlignmentPairwise Alignment

• Potential relationships between proteins or nucleic acids can be explored by comparing 2 or more sequences of amino acids or nucleotides.

• Difficult to do visually.• Computer algorithms help us by:

– Accelerating the comparison process– Allowing for “gaps” or indels in sequences (i.e., insertions, deletions)– Identifying substituted amino acids that are structurally or functionally

similar (D and E).

Pevsner, Bioinformatics and Functional Genomics, 2009

One way to do this is with BLAST (Basic Local Alignment Search Tool)

•Allows rapid sequence comparison of a query sequence against a database.•The BLAST algorithm is fast, accurate, and web-accessible.•BLAST lets user select from a variety of scoring matrices to evaluate sequence relatedness.

BLAST is…• Basic Local Alignment Search Tool• NCBI's sequence similarity search tool• supports analysis of DNA and protein databases

NCBI key features: BLASTNCBI key features: BLAST

3CLN

BLASTBLAST

BLAST allows user to search a sequence (the query) against millions of sequences in the NCBI database (the target).

Global alignments (e.g., Needleman-Wunsch) would be time consuming and computationally intensive for this amount of data.

BLAST is designed for local alignment, not global alignment. Allows for faster searches, can match subsets of proteins (e.g.,

domains).

Ca2+

13

57

912 F helix

F helix

8

12

8

1357

9

Ca2+

C-terminal domain of CaM (from 3cln.pdb)

BLAST Output from DB SearchBLAST Output from DB Search Graphic Summary includes conserved domains, when

applicable.

Ca2+

13

57

912F helix

F helix

8

12

8

1357

9

Ca2+

BLAST Output from DB SearchBLAST Output from DB Search

Graphic Summary includes distribution of blast hits. Color coded by bit Score. Higher score related to higher sequence identity.

Sequence Analyses: RNASequence Analyses: RNA

• Codons (3 RNA bases in sequence) determine each amino acid that will build the protein expressed

• Many amino acids are encoded by more than 1 codon (change in 3rd base). Change of single base may not be significant.

Comparing protein sequencesComparing protein sequences

• Comparing protein sequences usually more informative than nucleotide sequences.– Changing base at 3rd position in codon does not always

change AA (Ex: Both UUU and UUC encode for phenylalanine)– Different AAs may share similar chemical properties (Ex:

hydrophobic residues A, V, L, I)– Relationships between related but mismatched AAs in

sequence analysis can be accounted for using scoring systems (matrices).

– Protein sequence comparisons can ID sequence homologies from proteins sharing a common ancestor as far back as 1 × 109 years ago (vs. 600 × 106 for DNA).

Amino acids by similar biophysical propertiesAmino acids by similar biophysical properties

http://kimwootae.com.ne.kr/apbiology/chap2.htm



These have useful fluorescent properties







Sequence Identity and SimilaritySequence Identity and Similarity• Identity: How closely two sequences match one another.

– Unlike homology, identity can be measured quantitatively

• Similarity: Pairs of residues that are structurally or functionally related (conservative substitutions).


>lcln|28245 3CLN:A|PDBID|CHAIN|SEQUENCELength=148

Score = 268 bits (684), Expect = 3e-97, Method: Compositional matrix adjust. Identities = 130/148 (88%), Positives = 143/148 (97%), Gaps = 0/148 (0%)

Query 1 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 A+QLTEEQIAEFKEAF+LFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNSbjct 1 ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60

Query 61 GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 120 GTIDFPEFL++MARKMK+ DSEEE+ EAF+VFD+DGNG ISAAELRHVMTNLGEKLTD+ESbjct 61 GTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEE 120

Query 121 VDEMIREADIDGDGHINYEEFVRMMVSK 148 VDEMIREA+IDGDG +NYEEFV+MM +KSbjct 121 VDEMIREANIDGDGQVNYEEFVQMMTAK 148

88% of sequences include the same amino acids (Identities). This increases to 97% (Positives) when you include amino acids that are different, but with similar properties.

Sequence HomologySequence Homology• Homology: Two sequences are homologous if they share a

common ancestor.• No “degrees of homology”: only homologous or not• Almost always share similar 3D structure

– Ex. myoglobin and beta globin– Sequences can change significantly over time, but 3D

structure changes more slowly


Beta-globin sub-unit of adult hemoglobin (2H35.pdb, in blue), superimposed over myoglobin (3RGK.pdb, in red).These sequences probably separated 600 million years ago.

Percent Identity and HomologyPercent Identity and Homology

• For an alignment of 70 amino acids, 40% sequence identity is a reasonable threshold for homology.

• Above 20% (more than 70 amino acids) may indicate homology.

• Below 20% probably indicates chance alignment.


Orthologs and ParalogsOrthologs and Paralogs

• Orthologs: Homologous sequences in different species that arose from a common ancestral gene during speciation.– Ex. Humans and rats diverged around 80 million years ago

divergence of myoglobin genes occurred.– Orthologs frequently have similar biological functions.

• Human and rat myoglobin (oxygen transport)• Human and rat CaM

• Paralogs: Homologous sequences that arose by a mechanism such as gene duplication.

• Within same organism/species• Ex. Myoglobin and beta globin are paralogs

– Have distinct but related functions.


Conservative Substitutions in MatricesConservative Substitutions in Matrices

Scoring may also vary based on conserved substitutions of amino acids: i.e., amino acids with similar properties will not lose as many points as AAs with very different properties.

Basic AAs: K, R, HAcidic AAs: D, EHydroxylated AAs: S, THydrophobic AAs: G, A, V, L, I, M, F, P, W, Y


These relationships would be considered when calculating “Positives” in BLAST alignment.

Dayhoff Model: Building a Scoring MatrixDayhoff Model: Building a Scoring Matrix 1978, Margaret Dayhoff provided one of the first models of a scoring matrix Model was based on rules by which evolutionary changes occur in proteins Catalogued 1000’s of proteins, considered which specific amino acid

substitutions occurred when 2 homologous proteins aligned Assumes substitution patterns in closely-related proteins can be

extrapolated to more distantly-related proteins An accepted point mutation (PAM) is an AA replacement accepted by

natural selection Based on observed mutations, not necessarily on related AA properties Probable mutations are rewarded, while unlikely mutations are penalized Scores for comparison of 2 residues (i, j) based on the following equation:

Here, qi,j is the probability of an observed substitution (from mutation probability matrix), while p is the likelihood of observing the replacement AA (i) as a result of chance (normalized frequency of AA table).


PAM250 Mutation Probability MatrixPAM250 Mutation Probability Matrix

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y VAla A 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11 11 11 2 4 9Arg R 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4 4 3 7 2 2Asn N 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4 5 4 2 3 3Asp D 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4 5 5 1 2 3Cys C 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2 3 2 1 4 2Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4 3 3 1 2 3Glu E 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4 5 5 1 2 3Gly G 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8 11 9 2 3 7His H 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3 3 2 2 3 2Ile I 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2 3 4 1 3 9Leu L 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5 4 6 6 7 13Lys K 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6 8 8 4 3 5Met M 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1 1 1 1 1 2Phe F 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1 2 2 4 20 3Pro P 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20 6 5 1 2 4Ser S 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9 10 9 4 4 6Thr T 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6 8 11 2 3 6Trp W 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 55 1 0Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1 2 2 3 31 2Val V 7 4 4 4 4 4 4 4 5 4 15 10 4 10 5 5 5 72 4 17

Think of these values as percentages (columns sum to 100).For example, there is an 18% (0.18) probability of R being replaced by K.This probability matrix needs to be converted into a scoring matrix.

Original AA

Repl

acem

ent

AA

http://www.icp.ucl.ac.be/~opperd/private/pam250.html

Normalized Frequencies of Amino AcidsNormalized Frequencies of Amino Acids

Normalized Frequencies of Amino AcidsAla 0.096 Asn 0.042 Gly 0.090 Pro 0.041 Lys 0.085 Ile 0.035 Leu 0.085 His 0.034 Val 0.078 Arg 0.034 Thr 0.062 Gin 0.032 Ser 0.057 Tyr 0.030 Asp 0.053 Cys 0.025 Glu 0.053 Met 0.012 Phe 0.045 Trp 0.012


**How often a given amino acid appears in a protein (determined by empirical analyses)

Purpose of PAM MatricesPurpose of PAM Matrices

• Derive a scoring system to determine relatedness of 2 sequences.

• PAM mutation probability matrix must be converted to a scoring matrix (log odds matrix).

PAM250 Log-Odds MatrixPAM250 Log-Odds MatrixCys C 12Ser S 0 2Thr T -2 1 3Pro P -3 1 0 6Ala A -2 1 1 1 2Gly G -3 1 0 -1 1 5Asn N -4 1 0 -1 0 0 2Asp D -5 0 0 -1 0 1 2 4Glu E -5 0 0 -1 0 0 1 3 4Gln Q -5 -1 -1 0 0 -1 1 2 2 4His H -3 -1 -1 0 -1 -2 2 1 1 3 6Arg R -4 0 -1 0 -2 -3 0 -1 -1 1 2 8Lys K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5Met M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6Ile I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5Leu L -8 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 8Val V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4Phe F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9Tyr Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10Trp W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W Cys Ser Thr Pro Ala Gly Asn Asp Glu Gln His Arg Lys Met Ile Leu Val Phe Tyr Trp

This is the PAM250 scoring matrix, calculated as follows:


Pairwise Alignment and HomologyPairwise Alignment and HomologyPAM Value Distance(%)

80 50 100 60 200 75

250 85 <- Twilight zone

300 92

Think of PAM value as total number of mutations. This included multiple mutations over time at a single position.Currently, we accept that once the percent distance reaches ~85%, homology is indeterminate.PAM250 works best for more distantly related protein sequences.

http://www.icp.ucl.ac.be/~opperd/private/pam.html

Seq1 AGDFWYGGDGEYLLVSeq2 AGQFWYGGEGEKLLVSeq3 AGEFWYGGEGEKLLV

Seq1 and Seq2 separated by 3 units, while Seq1 and Seq3 separated by 4 PAM units

Practical Lessons from the Dayhoff ModelPractical Lessons from the Dayhoff Model

Less mutable amino acids likely play more important structural and functional roles

Mutable amino acids fulfill functions that can be filled by other amino acids with similar properties

Common substitutions tend to require only a single nucleotide change in codon

Amino acids that can be created from more than 1 codon are more likely to be created as a substitute (See p. 63, textbook)

Changes to sequence that do not alter structure and function of protein likely to be more tolerated in nature


BLOSUM62 Scoring MatrixBLOSUM62 Scoring Matrix BLOck SUbstitution Matrix By Henikoff and Henikoff (1992) Default scoring matrix for pairwise alignment

of sequences using BLAST (local alignments) Based on empirical observations of distantly-

related proteins organized into blocks


A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -1 1 1 -2 -1 -3 -2 5 M -1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V

In BLOSUM62, proteins are arranged in blocks sharing at least 62% identity

General Trends in Scoring MatricesGeneral Trends in Scoring Matrices

Less divergent

More divergent

BLOSUM90PAM30

BLOSUM45PAM250

BLOSUM62PAM120

Human vs. chimp

Human vs. bacteria

Choose a matrix that is consistent with the level sequence identity you are investigating. I.E., if you are looking at/for more closely related sequences, use BLOSUM90. If you are not sure, use BLOSUM62.

Sequence Alignments: General ConceptsSequence Alignments: General Concepts

• Global Alignment: Tries to match the entire length of the sequence.

• Local Alignment: Tries to find the longest section that matches.

Both are examples of dynamic programming: precise but slow

Global AlignmentGlobal Alignment

Input: two sequences over the same alphabet (either nucleotide or amino acid sequences)

Output: The alignment of the sequencesExample:• GADEGYFGPVILAADGEVA and GGAEGDYFGPAIAEGEVA• A possible alignment might look like this:

ins

ins

del

del

del

mut

mut

-GADEG-YFGPVILAADGEVAGGA-EGDYFGPAI--AEGEVA

Each position is scored independently:• Match: +1• Mismatch: -1• Insertions or deletions (gaps): -2

The alignment score is the sum of the position scores

Global Alignment – A Simple Scoring SchemeGlobal Alignment – A Simple Scoring Scheme

-GADEG-YFGPVILAADGEVAGGA-EGDYFGPAI--AEGEVAGlobal Alignment Score: (14 ×(+1)) + (5 × (-2)) + (2 × (-1)) = 2

-----GADEG-YFGPVILAADGEVA---DLGNVGA-EGDYFGPAI--AEGEVARPLGlobal Alignment Score: (14 ×(+1)) + (12 × (-2)) + (2 × (-1)) = -12

-----GADEG-YFGPVILAADGEVA---dlgnvGA-EGDYFGPAI--AEGEVArpl

Local Alignment Score: (14 ×(+1)) + (4 × (-2)) + (2 × (-1)) = 4

Matrices and Gap CostsMatrices and Gap Costs

Query Length

Substitution Matrix

Gap Costs

<35 PAM-30 (9,1)35-50 PAM-70 (10,1)50-85 BLOSUM-80 (10,1)

85 BLOSUM-62 (10,1)

The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. Gapped BLAST and PSI-BLAST use "affine gap costs" which charge the score -a for the existence of a gap, and the score -b for each residue in the gap. Thus a gap of k residues receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b).Your total raw score for the alignment is reduced when you introduce gaps into the query sequence.

Calculate the score in BLOSUM-62 for a gap with 7 residues…

http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/

Global Sequence AlignmentsGlobal Sequence Alignments

• Global Alignment: Entire sequence of each protein or DNA.• Needleman and Wunsch (1970)• Reduces problem to series of smaller alignments on a residue-

by-residue basis. • How this approach works

1. Setting up a matrix2. Score the matrix3. ID the optimal alignment

Local Sequence AlignmentLocal Sequence Alignment

• Local Alignment: Longest matching regions (subsets) between 2 sequences.

• Smith and Waterman Algorithm (1981)• Scoring is similar to global alignment

1. Set up a matrix2. Score the matrix

• No negative values allowed: If negative values are the only choices, then answer defaults to zero (0).

• Mismatches and gaps at ends score 0.

3. ID the optimal alignment

• More sensitive but much slower than heuristic methods (FASTA, BLAST)

Heuristic (word or k-tuple based) algorithmsHeuristic (word or k-tuple based) algorithms

• Uses initial query to make reasonable guesses about sequence alignments, then evaluates those considered “most likely”

• Alignment then extended until:– One of the sequences ends– Score falls below some threshold

• In BLAST, search depends on word size

KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit)

Hit!extendextend

FASTA (Pearson and Lippman 1988)FASTA (Pearson and Lippman 1988)

• Combines Smith and Waterman algorithm with word (k-tup) search faster, heuristic approach

• Query sequence divided into small words (usually k=2 for proteins)– Words used to initially compare and match sequences– If words located on same diagonal, surrounding region is then

selected for analysis

Seq 1 FYGKLHMEGDSeq 2 FWGKLHMEGSNE

Seq 1 Search words (k-tup = 2) FY YG GK KL LH HM ME EG GD

http://www.incogen.com/bioinfo_tutorials/Bioinfo-Lecture_2-pairwise-align.html

Statistical Measures of AlgorithmsStatistical Measures of Algorithms• Objective of alignment algorithms is to maximize sensitivity and specificity of alignments. • Sensitivity: Measure of how well algorithm correctly predicts sequences that are related.• Specificity: Measure of how well algorithm correctly predicts sequences that are unrelated.

TP: Positive identified as positiveFP: Negative identified as positiveTN: Negative identified as negativeFN: Positive identified as negative

Relationships between biological sequencesRelationships between biological sequences• Biological sequences tend to occur in families

– These may be related genes within an organism (paralogs) or between species (orthologs)

– Presumably derived from common ancestor• Nucleotides corresponding to coding regions are typically less well

conserved than proteins due to degeneracy of genetic code– More difficult to align

Sequences evolve faster than structures, but homologous sequences tend to retain similar structure and function (e.g., rat vs. human CaM)

Multiple sequence alignmentsMultiple sequence alignments

• Homology can be observed through multiple sequence alignments (MSA)

• MSA: 3 or more protein (or nucleic acid) sequences that are partially or completely aligned

• Homologous residues are aligned in columns across the length of the sequences

1exr_A -EQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 59 1N0Y_A AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 3cln_ ----TEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 56 :************:******************************************* 1exr_A GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 119 1N0Y_A GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 120 3cln_ GTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEE 116 *********::******: *****: ***:***:**** *******************:* 1exr_A VDEMIREADIDGDGHINYEEFVRMMVS- 146 1N0Y_A VDEMIREADIDGDGHINYEEFVRMMVSK 148 3cln_ VDEMIREANIDGDGQVNYEEFVQMMTA- 143 ********:*****::******:**.:

Multiple sequence alignmentsMultiple sequence alignments

• MSAs are powerful because they can reveal relationships between 2 sequences that can only be observed by their relationships with a third sequence

AVGYDFGEKMLSGADDWLVGERADLTGAEIDE

AVGYDFGEKMLSGA--DDWLVGYDRADK-LTGAE-DD-LVG-ERAD--LTGAEIDE-

Seq 1

Seq 2

Seq 1

Seq 3

Seq 2

How MSAs are determined?How MSAs are determined?MSAs can be determined based on:

•Presence of highly-conserved residues such as cysteine•Conserved motifs and domains•Conserved features of protein secondary structure•Regions showing consistent patterns of insertions or deletions

C-terminal domain of CaM (from 3cln.pdb)

Conserved 2° structure (α-helices)

1. Color coding indicates AA property class2. * Indicates 100% conserved over entire alignment3. : Conservative mutations4. . Less conservative mutations5. [blank] gap or least conserved mutations

ClustalW Output for CD2 ProteinClustalW Output for CD2 Protein1 2 3 4 5

SC, 65.3Carbonyl,

21.4

HOH, 13.3

Asp, 29.7

Thr, 0.3

Glu, 26.6

Ser, 2.6

Gln, 0.0

Asn, 6.1

Ca: EF-Hand

Asp, 24.5HOH, 33.1

Carbonyl, 23.9

SC, 42.9

Tyr, 0.1Thr, 1.0

Asn, 4.3Gln, 1.3Ser, 1.3

Glu, 10.4

Ca: Non-EF-Hand

SC O, 61.0

Thr, 0.6

Glu, 38.4

Gln, 0.6

Asp, 20.3

Asn, 1.1

HOH, 20.3

S, 7.3

Carbonyl, 5.6

MC N,0.6SC N, 5.1

Pb: Ligand Distribution

(Kirberger, Wang et al. 2008; Kirberger and Yang 2008; Glusker et al. 1998)

Statistical Analysis of PDB Data: CaStatistical Analysis of PDB Data: Ca2+2+ vs. Pb vs. Pb2+2+

M

L

L

L

LL

L

L

Holo- and Hemi-directed geometriesPentagonal bipyramidal

geometry

Develop Algorithms/Programs to Address Specific Develop Algorithms/Programs to Address Specific ProblemsProblems

• Identify calcium-binding proteins by matching patterns of known calcium-binding sites in sequences.

Descriptive ID Sequence Pattern Prosite

PS00018: EF-Hand D-X-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]-X(2)-[DE]-[LIVMFYW]

Yang (Pattern 1) EFH Helix E X-{DNQ}-X-X-{GP}-{ENSPQ}-X-X-{DQRP} EFH Loop [DNS]-X-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-

[DENQSTAGC]-X(2)-[ED] EFH Helix F [FLMYVIW]-X-X-{NPS}-{DNEQ}-X(3) Yang (Pattern 2) YY00018 X(1)-{DNQ}-X(2)-{GP}-{ENSPQ}-X(2)-{DQRP}-[DNS]-X(1)-[DNS]-{ILVFYW}-

[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]-X(2)-[ED]-[FLMYVIW]-X(2)-{NPS}-{DNEQ}-X(3)

Protein Engineering by Rational DesignProtein Engineering by Rational Design1. Computer aided design – May include statistical & structural parameters

2. Site-directed mutagenesis – changing one or more nucleic acids in plasmid to change AA in protein

3. Transformation – Alteration of bacterial cell through introduction of exogenous genetic material

4. Protein expression – Manufacturing the protein(s)

5. Protein purification – separate target protein from other biomolecules

6. Biochemical testing

Engineered Proteins: TherapyEngineered Proteins: Therapy

• Abatacept: Fusion protein composed of the Fc region of the immunoglobulin IgG1 fused to the extracellular domain of CTLA-4.

• Abatacept binds to the CD80 and CD86 molecule, and inhibits T cell activation by blocking signal from antigen presenting cell. Prevents immune response.

• Developed by Bristol-Myers Squibb and is licensed in the United States for the treatment of rheumatoid arthritis.

Engineered Proteins: ResearchEngineered Proteins: Research

Ca.CD2 is a protein engineered by Dr. Jenny Yang’s research group at Georgia State University.Cell Adhesion Molecule CD2 was modified by insertion of a calcium binding site. The binding site was observed to bind calcium selectively over other mono- and di-valent biological metals, and to bind several other metals including lanthanum and terbium, while still retaining the ability to bind it’s natural target molecule.The objectives of this research were to see if a metal binding site could be engineered into a small protein without significantly altering the protein, to study an isolated calcium binding site, and to develop a model for the development of proteins with specific functions.

Design of a calcium-binding protein with desired structure in a cell adhesion molecule, JACS, 2005.

1T6W

Engineered Proteins: ResearchEngineered Proteins: Research

GFP and other FP’s, fused to other proteins, have found a variety of uses in cellular and tissue imaging.

http://www.conncoll.edu/ccacad/zimmer/GFP-ww/prasher.html

Mutations to GFP produce different colorsMutations to GFP produce different colors

http://zeiss-campus.magnet.fsu.edu/articles/probes/jellyfishfps.html

The availability of a different FP colors has also enabled researchers to develop methods to probe whether two proteins are within a distance of less than 10 nm of each other using the phenomenon of Förster (or fluorescence) resonance energy transfer (FRET) (Förster 1948). FRET is the distance- and orientation-dependent radiationless transfer of excitation energy from a donor fluorophore to an acceptor chromophore.

Protein Design AlgorithmsProtein Design Algorithms

• Two major classes: • Exact algorithms (e.g., Dead-end elimination),

provided optimal solutions but long run times• Heuristic algorithms (e.g., Monte Carlo), faster run

times but may not provide optimal solutions.

DEE Algorithm (Exact)DEE Algorithm (Exact)• The DEE (dead-end elimination): Compares all possible side chain rotamers on fixed protein

backbone and removes those that cannot be part of the global lowest energy conformation (GMEC).

• DEE cannot guarantee convergence. If, after a certain number of iterations, DEE cannot remove any more rotamers, then either rotamers have to be merged or another search algorithm must be used to search the remaining search space. In such cases, the dead-end elimination acts as a pre-filtering algorithm to reduce the search space.

https://www.cs.duke.edu/brd/papers/Proteins12/

Branch and Bound Algorithms (Exact)Branch and Bound Algorithms (Exact)• The protein design conformational space can be represented as a tree,

where the protein residues are ordered in an arbitrary way, and the tree branches at each of the rotamers in a residue. Branch and bound algorithms use this representation to efficiently explore the conformation tree: At each branching, branch and bound algorithms bound the conformation space and explore only the promising branches.

• Tests multiple conformational changes (global changes), retaining lowest energy conformations. Can be very slow process.

Monte Carlo and Simulated Annealing Algorithm Monte Carlo and Simulated Annealing Algorithm (Heuristic)(Heuristic)

• A starting structure is needed for a molecular dynamics calculation, which is generated from all constraints for the molecular structure, such as bond-lengths and bond-angles.

• This starting structure may be any conformation such as an extended strand or an already folded protein.

• Starting at theoretical high temperatures (meaning energy put into system) approximately 20 different random, simulated protein folds are allowed to “cool” to lowest localized energies, to observe folding.

• These results are used for another set of iterations with different input parameters, until energy can no longer be minimized (global energy minimum is achieved).

Simons, JMB, 1997

Documents

Chapter 17 Prediction, Engineering and Design of Protein Structures