54
Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration at RPI http://www.cs.rpi.edu/~zaki

Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Embed Size (px)

DESCRIPTION

How do Proteins Form?

Citation preview

Page 1: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Data Mining for Protein Structure Prediction

Mohammed J. ZakiSPIDER Data Mining Project:

Scalable, Parallel and Interactive Data Mining and Exploration at RPI

http://www.cs.rpi.edu/~zaki

Page 2: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Outline of the Talk

How do proteins form? Protein folding problem Contact map mining Using HMMs based on local motifs Mining “physical” dense frequent patterns

(non-local motifs) Future directions

Heuristic rules Folding pathways

Page 3: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

How do Proteins Form?

Page 4: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Building Blocks of Biological Systems DNA (nucleotides, 4 types): information carrier/encoder RNA: bridge from DNA to protein Protein (amino acids, 20 types): action molecules.

Processes Replication of DNA Transcription of gene (DNA) to messenger RNA (mRNA)

Splicing of non-coding regions of the genes (introns) Translation of mRNA into proteins Folding of proteins into 3D structure Biochemical or structural functions of proteins

How do Proteins Form?

Page 5: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration
Page 6: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration
Page 7: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration
Page 8: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration
Page 9: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Protein Folding Problem

Page 10: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Protein Structures

Primary structure Un-branched polymer 20 side chains (residues or amino acids)

Higher order structures Secondary: local (consecutive) in sequence Tertiary: 3D fold of one polypeptide chain Quaternary: Chains packing together

Page 11: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Amino Acid

Page 12: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Polypeptide Chain

Page 13: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Torsion Angles

Page 14: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

The Protein Folding Problem

Page 15: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Contact Map Mining

Page 16: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Contact Map

Amino acids Ai and Aj are in contact if their 3D distance is less than threshold (7å)

Sequence separation is given as |i-j| Contact map C is an N x N matrix, where

C(i,j) = 1 if Ai and Aj are in contact C(i,j) = 0 otherwise

Consider all pairs with |i-j| >= 4

Page 17: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Protein 2igd: 3D Structure

Anti-parallel Beta Sheets

Parallel Beta Sheets

Alpha Helix

Page 18: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Contact Map (2igd PDB)Anti-parallel Beta Sheets

Alpha Helix

Parallel Beta Sheets

Amino Acid Ai

Am

ino

Aci

d A

j

Page 19: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

How much information in Amino Acids Alone: Classification Problem

A pair of amino acids (Ai,Aj) is an instance The class: C (1) or NC (0), i.e., contact or

non-contact Highly skewed class distribution

1.7% C and 98.3% NC; 300K C vs 17,3M NC Features for each instance

Ai and Aj Class: C or NC

Page 20: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Predicting Protein Contacts Predict contacts for new sequence

A D T S K C PA 0 1 0 0 1 0D 0 1 0 0 1T 1 0 0 0S 0 1 0K 0 1C 0P

Ai Aj F1 FnA T .. ..A C .. ..D S .. ..D P .. ..T S .. ..S C .. ..K P .. ..

Page 21: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Classification via Association Mining Association mining good for skewed data Mining: Mine frequent itemsets in C data (Dc)

P(X | Dc) = Frequency(X | Dc) / |Dc| Counting: find P(X | Dnc) Pruning

Likelihood of a contact: r = P(X|Dc) / P(X|Dnc) Prune pattern X if ratio r of contact to non-contact

probability is less then some threshold i.e., keep only the patterns highly predictive of

contacts

Page 22: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Testing Phase 90-10 split into training and testing 2.4 million pairs, with 36K contacts (1.5%) Evidence calculation:

Find matching patterns P for each instance Compute cumulative frequency in C and NC

Sc = Sum of frequency (X | Dc) where X in P Snc = Sum of frequency (X | Dnc) where X in P

Compute evidence: ratio of Sc / Snc Prediction: Sort instances on evidence

Predict top PR fraction as contacts

Page 23: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Experiments

794 Proteins from Protein Data Bank Distinct structures (< 25% similarity) Longest: 907, Smallest: 35 amino acids 90-10 split for training-testing Total pairs: 20 million (> 2.5 GB) Contacts: 330 thousand (1.6%) Highly uneven class distribution

Page 24: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Evaluation Metrics Na: set of all pairs Na*: all pairs with positive evidence Ntc: true contacts in test data Ntc*: true contacts with positive evidence Npc: predicted contacts Ntpc: correctly predicted contacts Accuracy = Ntpc / Npc Coverage = Ntpc / Ntc Prediction Ratio (PR): Ntc*/Na* Random Predictor Accuracy: Ntc/Na

Page 25: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Results (Amino Acids; All Lengths)

Crossover: 7% accuracy and 7% coverage; 2 times over Random

Page 26: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Results (Amino Acids; by length)

1-100: 12% accuracy(A) and coverage (C); 100-170: 6% A and C

170-300: 4.5% A and C; 300+: 2% A and C

Page 27: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Using HMMs based on Local Motifs to Improve Classification

Page 28: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

An HMM for Local Predictions HMMSTR (Chris Bystroff, Biology, RPI) Build a library of short sequences that tend to

fold uniquely across protein families: the I-Sites Library

Treat each motif as a Markov chain Merge the motifs into a global HMM for local

structure prediction

Page 29: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Training the HMM Build I-sites Library

Short sequence motifs (3 to 19) Exhaustive clustering of sequences Non-redundant PDB dataset (< 25% similarity)

Build an HMM Each of 262 motifs is a chain of Markov states Each state has sequence and structure for one

position Merge I-sites motifs hierarchically to get one global

HMM for all the motifs

Page 30: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

HMM Output

Total of 282 States in the HMM Each state produces or “emits”:

Amino acid profile (20 probability values) Secondary structure (D) (helix, strand or loop) Backbone angles (R) (11 dihedral angle symbols) Finer structural context (C) (10 context symbols)

Page 31: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

I-Sites Motifs (Initiation Sites)Beta Hairpin Beta to Alpha Helix C-Cap

Page 32: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration
Page 33: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Data Format and Preparation Take the 794 PDB proteins Compute optimal alignment to HMM

Find best state sequence for the observed acids Output probability distribution of a residue over all

the 282 HMM states Integrate the 3 datasets

Alignment probability distribution (Nx282) Amino acid and context information (D, R, C) Contact map (NxN)

Page 34: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

HMMSTR Output (per Protein)PDB Name: 153l_

Sequence Length: 185

Position: 1Residue: RCoordinates: 0.0, -73.2, 177.6AA Profile (20 values): 0.0 1.0 0.0HMMSTR State Probabilities (282 values):0.0 0.7 0.3 0.0Distances (185 values): 0 3 5 18 15 13...Position: 185Residue: YCoordinates: -88.7 , 0.0, 0.0AA Profile (20 values): 0.0 0.4 0.6 0.0HMMSTR State Probabilities (282 values):0.0 0.2 0.5 0.3 0.0Distances (185 values): 15 13 10 5 3 0

Page 35: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Adding features from HMMSTR The class: C (1) or NC (0)

Highly skewed class distribution Approx 1.5% C and 98.5% NC

Features for each instance Ai Aj Di Dj Ri Rj Ci Cj Profile: pi1 pi2 … pi20 pj1 pj2 … pj20 HMM States: qi1 qi2 .. qi282 qj1 qj2 .. qj282 Class: C or NC

Page 36: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

HMM and AA + (R,D,C) ; All Lengths

Left Crossover: 19% accuracy and coverage; 5.3 times over Random

Right Crossover (+RDC): 17% accuracy and coverage; 5 times over Random

Page 37: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

HMM + AA + R,D,C (by length)

1-100: 30% accuracy(A) and coverage (C); 100-170: 17% A and C170-300: 10% A and C; 300+: 6% A and C

Page 38: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Predicted Contact Map (2igd)

Page 39: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Summary of Classification Results Challenging prediction problem In essence, we have to predict a contact matrix

for a new protein Hybrid HMM/Associations approach Best results to-date: 19% overall

accuracy/coverage, 30% for short proteins 14.4% Accuracy (Fariselli, Casadio ‘99; NN) 13% Accuracy (Thomas et al ‘96) Short proteins: 26% (Olmea, Valencia, ‘97)

Page 40: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Mining “Physical” Dense Frequent Patterns (non-local motifs)

Page 41: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Characterizing Physical, Protein-like Contact Maps A very small subset of all contact maps code

for physically possible proteins (self-avoiding, globular chains)

A contact map must: Satisfy geometric constraints Represent low-energy structure

What are the typical non-local interactions? Frequent dense 0/1 submatrices in contact maps 3-step approach: 1) data generation, 2) dense

pattern mining, and 3) mapping to structure space

Page 42: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Dense Pattern Mining 12,524 protein-like 60 residue structures

Use HMMSTR to generate protein-like sequences Use ROSETTA to generate their structures

Monte Carlo fragment insertion (from I-sites library) Up to 5 possible low-energy structures retained

Frequent 2D Pattern Mining Use WxW sliding window; W window size Measure density under each window (N-W)^2 / 2 possible windows per N length protein Look for “minimum density”; scale away from diag Try different window sizes

Page 43: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Counting Dense Patterns

Naïve Approach:for W=5, N=60 there are 1485 windows per protein. Total 15 Million possible windows for 12,524 proteins Test if two submatrices are equal

Linear search: O(P x W^2) with P current dense patterns Hash based: O(W^2)

Our Approach: 2-level Hashing O(W) time

Page 44: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Pattern (WxW Submatrix) Encoding Encode submatrix as string (W integers)

Submatrix Integer Value 00000 0 01100 12 01000 8 01000 8 00000 0Concatenated String: 0.12.8.8.0

Page 45: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

String ID (M) =

Level 1 (approximate):

Level2 (exact) : h2 (M) = StringID (M)

Two-level Hashing

W

i ivMh1

)(1

Wvvv ...... 21

Page 46: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Binding Patterns to Proteins Sequence and Structure Using window size, W=5 StringID:0.12.8.8.0, Support = 170

0000001100010000100000000

Occurrences:pdb-name (X,Y) X_sequence Y_sequence Interaction1070.0 52,30 ILLKN TFVRI alpha::beta1145.0 51,13 VFALH GFHIA alpha::strand1251.2 42,6 EVCLR GSKFG alpha::strand1312.0 54,11 HGYDE ATFAK alpha::beta1732.0 49,6 HRFAK KELAG alpha::beta2895.0 49,7 SRCLD DTIYY alpha::beta...

Page 47: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Frequent Dense Local PatternsSubmatrix 0 0 0 0 0 0

0 00 0 0 0 0 0

0 00 0 0 0 0 0

0 00 0 0 0 0 0

0 00 0 0 0 0 0

0 10 0 0 0 0 0

1 10 0 0 0 0 1

1 10 0 0 0 1 1

1 0

0 1 1 1 0 0 0 0

1 1 1 0 0 0 0 0

1 1 0 0 0 0 0 0

1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0

0 0 1 0 0 0 0 0

0 0 0 1 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0

0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1

Frequency 2.0% 2.2% 1.9%Physical

Phenomenon antiparallel beta

sheet secondary structure

antiparallel beta sheet secondary structure

parallel beta sheet secondary structure

Page 48: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Frequent Dense Non-Local Patterns

Alpha – Alpha Alpha – Beta Sheet

Page 49: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Frequent Dense Non-Local Patterns

Alpha – Beta Turn Beta Sheet – Beta Turn

Page 50: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Future Directions

Page 51: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Mining Physicality Rules

Comprehensive list of non-local motifs I-sites library catalogs local motifs

Mining heuristic rules for “physicality” Based on simple geometric constraints Rules governing contacts and non-contacts

Parallel Beta Sheets: If C(i,j) = 1 and C(i+2,j+2) = 1, then C(i,j+2) = 0 and C(i+2,j) = 0

Anti-parallel Beta Sheets: If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0

Alpha Helices: If C(i,i+4) = 1, C(i,j) = 1, and C(i+4,j) = 1, then C(i+2,j) = 0

Page 52: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Heuristic Rules of Physicality

i

i+2 j

j+2

If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0

If C(i,j) = 1 and C(i+2,j+2) = 1, then C(i,j+2) = 0 and C(i+2,j) = 0

Parallel Beta SheetsAnti-parallel Beta Sheets

i

i+2

j

j+2

Page 53: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Protein Folding Pathways

Rules for Pathways in Contact Map Space Pathway is time-ordered sequence of contacts Condensation rule: New contact within Smax

U(i,j) <= Smax; U(i,j) is unfolded residues from i to j Pathway prediction is complementary to structure

prediction

Page 54: Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration

Contact Map Folding Pathways