33
Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Embed Size (px)

Citation preview

Page 1: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Protein Folding Pathway Prediction

Supervised by

Prof. Ibrahim M.El-HenawyDr. Ahmed H.Kamal

Dr. Hisham Al-Shishiny

by

Haitham Ahmad Gamal

Page 2: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Problem Statement Motivation Approach Previous Work Biological Background What Affects Folding Why is it difficult Data Set Methodology (the 4 stages) Hypothesis (formally stated) Results Conclusion

Page 3: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Proteins are the most vital agents in living bodies.

Their function is what concerns scientists

Function 3D Structure

Hydrophobicity

Much effort in structure prediction but limited success:

Result are:

• premature due to the huge conformations search space.

• or, insufficiently accurate due to simplifications.

Page 4: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal
Page 5: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Knowledge of how a protein can fold enables us to understand how it is functioning.

With this level of understanding we can affect a protein either by enhancement or by suppression.

Drugs can be built to affect certain proteins directly or through other proteins interacting with the protein under investigation.

Page 6: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

The approach used in this study is a statistical,

machine learning approach. We try using this

approach to answer the previous questions.

Clustering Distribution Fitting

Page 7: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

In our study we are not developing a prediction algorithm.

We are proving some hypothesis that can improve several

types of prediction algorithms.

Prediction algorithms/techniques can be classified based on

different criteria.Ab intio Homology

On-lattice Off-lattice

Heuristic Statistics

Protein-based Subsequence-based

Our study fits in the coloured classes across all these

criteria.

Page 8: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal
Page 9: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal
Page 10: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

The tertiary structure is the minimum free energy structure of a protein (for single chain proteins)

Page 11: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

It has been proven that the function of a protein depends on its 3D structure not its primary structure.

The most effective factor is folding proteins (specially globular proteins) is the hydrophobicity of its constituents amino acids.

Amino acids are either charged(soluble) or contains aromatic groups(insoluble).

Hydrophobicity of all the 20 known amino acids is called the Hydrophobicity scale.

Page 12: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal
Page 13: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

An exact simulation of a short peptide folding may take months on a super computer.

The number of possible conformations is huge.

bond peptide theoflength theis

20 lthatsuchl

Scientists proved that solving the problem for the HP model (simplified model) is NP-Complete.

Current technologies cannot keep pace with this God created miracle.

Page 14: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

A collection of more than 1000 proteins is taken randomly from the SCOP protein databank

Each SCOP entry (file) represents one protein with all its features including its exact atom coordinates.

Angles are extracted using the three dimensional coordinates of each Cα atom

Page 15: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Angle Extraction

Chopping to Subsequences

K-means Clustering

Distribution Fitting

Page 16: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Atom Serial NumberResidue NameResidue Sequence Number

X - coordinate

Y - coordinate

Z - coordinate

the 3rd residue

the 4th residue

the 5th residue

Continue doing the same until the end

Page 17: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

The angle that lies between each three consecutive Cα atoms is called angle θ.

As shown in the figure the angles are calculated at each Cα atom starting from Cα1 until CαL-1, such that (L)is the protein length.

θ1

θ2

θ3

.

.

.

.

Cαi-1

Cαi

Cαi+1

( , , )

( , , )

( , , )

Let (a) be a vector such that: a = (Cαi,Cαi-1)

Let (b) be a vector such that: b = (Cαi,Cαi+1)

Cαi-1

Cαi

Cαi+1

θ

θ can then be calculated using the cosine law:

Page 18: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

After all the angles of all of the proteins are extracted in each protein sequence is divided into subsequences of length n.

A subsequence must contain an odd number of residues.

A sliding window technique is used to chop the whole protein sequence into pieces.

The value of n is crucial in our study as will be shown in the results section.

Page 19: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Let’s take n = 5 as an example

aa0

aa1

aa2

aa3

aa4

aa5

aa6

aa7

aa8

Θ0 Θ1

Θ2

Θ3

Θ4Θ6

Θ7

The first subsequence starts from aa0 to aa4 and the effect of this subsequence on the central angle Θ1 is

what concerns us in this study.

Similarity the effect of all the next subsequences starting generally from aai to aai+n-1 on the

measurement of the central angle Θi+floor(n/2)-1 is studied.

Page 20: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Since hydrophobicity is the main factor affecting protein folding. The centroids were determined accordingly.

The choice of centroids is meant to cover all the possible hydrophobicity patterns of a subsequence of length n.

Let’s take n = 3 as an example

All Hydrophillic

All Hydrophobic

No. ofinitial centroids is

2n

Hydrophobic

Hydrophillic

Page 21: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Clustered as well as the unclustered data are compared using Kolmogrov-Smirnov test against 66 continuous probability distributions, which are:

Beta, Burr, Burr (4P), Cauchy, Chi-Squared, Chi-Squared (2P), Dagum, Dagum (4P), Erlang, Erlang (3P), Error, Error Function, Exponential, Exponential (2P), Fatigue Life, Fatigue Life (3P), Frechet, Frechet (3P), Gamma, Gamma (3P), Gen. Extreme Value, Gen. Gamma, Gen. Gamma (4P), Gen. Logistic, Gen. Pareto, Gumbel Max, Gumbel Min, Hypersecant, Inv. Gaussian, Inv. Gaussian (3P), Johnson SB, Johnson SU, Kumaraswamy, Laplace, Levy, Levy (2P), Log-Gamma, Log-Logistic, Log-Logistic (3P), Log-Pearson 3, Logistic, Lognormal, Lognormal (3P), Nakagami, Normal, Pareto, Pareto 2, Pearson 5, Pearson 5 (3P), Pearson 6, Pearson 6 (4P), Pert, Phased Bi-Exponential, Phased Bi-Weibull, Power Function, Rayleigh, Rayleigh (2P), Reciprocal, Rice, Student's t, Triangular, Uniform, Wakeby, Weibull and Weibull (3P).

Page 22: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal
Page 23: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

n = 3

DistributionCentroids in this distribution (i = Ci)

Burr1, 4

Burr(4p)7

Gen. Extreme Value6

Gen. Pareto2, 3, 5

Johnson SB0

Page 24: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

n = 5

DistributionCentroids in this distribution (i = Ci)

Dagum(4p)0, 5, 7, 19

Gumbel Min.1, 2, 3, 17, 20

Gen. Extreme Value4, 32

Burr(4p)6, 8, 10, 11, 14, 18, 21, 22, 23, 24, 27, 30, 31

Weibull(3p)9, 12, 13, 15, 16, 25, 26, 28, 29

Page 25: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

n = 7

DistributionCentroids in this distribution (i = Ci)

Weibull(3p)3, 21, 79Burr(4p)20, 32, 40, 60, 67, 71, 74, 75, 83, 85, 105Dagum4, 80Dagum(4p)41, 90Gen. Gamma(4p)69, 84, 106

Gen. Logistic2, 6, 7, 9, 12, 14, 15, 19, 33, 34, 35, 36, 37, 45, 46, 47,

49, 79, 87, 89, 94, 95, 107, 117, 125

Gumbel Min.66Log-Logistic42, 116, 118

Wakeby

1, 5, 8, 10, 11, 13, 16, 17, 18, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 38, 39, 43, 44, 48, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 63, 64, 65, 68, 70, 72, 73,

76, 77, 78, 81, 82, 86, 88, 91, 92, 93, 96, 98, 99, 100, 101, 102, 103, 104, 108, 109, 110, 111, 112, 113, 114,

115, 119, 120, 121, 122, 123, 124, 126, 127

Page 26: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Tricky KS-statistic value are not enough for completeinterpretation

KS statistic for Unclustered data

KS statistic for Clustered data

n = 30.090410.0937

n = 50.0120.0243

n = 70.0130.0202

Page 27: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

No. of rejected values for Un-Clustered data

No. of rejected values for Clustered

data

n = 3All 5 valuesAll 5 values

n = 5All 5 values2.94

n = 7All 5 valuesZero

The number of rejected critical values shows that the fits of Un-clustered data are fake fits

Number of tested critical values is 5

Page 28: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Obviously the KS-statistic shows that the larger the value of n the better the fit.

Page 29: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Looking deeper at the rejected value test, all the 5 test values are rejected for n = 3 while n = 7 gives ZERO rejected values, the thing that emphasizes

the truth of our hypothesis.

Page 30: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

it is now clear that there exists a direct relationship between the hydrophobicity of the residues of a subsequence (local neighbours) and the measurements of the backbone angles. Classifying a subsequence into one of the available clusters will give a good insight of the angles measurements and consequently the structure of the subsequence.

Also the length of the subsequence is an effective factor in angle measurement prediction process. Longer subsequences achieve better fits in one of the standard continuous probability distributions.

Page 31: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

These results can be used to guide the search process in a complete protein structure prediction algorithm.

Local angle-hydrophobicity relationship can be used combined with heuristic techniques like genetic algorithm to restrict the initial population to statistically familiar conformation.

Approximations of our results can be applied to crystalline lattices protein models like cube octahedron lattice model which allows the use of several possible angles 60", 90", 120" and 180".

it is possible to investigate applying the same approach on subsequences of length more than 7 residues and try to minimize the required processing time.

Page 32: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal

Title

A CENTRAL-3-RESIDUES-BASED CLUSTERING APPROACH FOR STUDYING THE EFFECT OF HYDROPHOBICITY ON PROTEIN

BACKBONE ANGLES

Authors

Prof. Ibrahim M.El-Henawy Dr. Ahmed H.KamalDr. Hisham Al-Shishiny Haitham Gamal

Has been published in Egyptian Computer Science Journal (ECS Journal), ISSN-1110-2586, Volume 32, Number 1, May, 2009

Page 33: Protein Folding Pathway Prediction Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny by Haitham Ahmad Gamal