54
Data Mining for Protein Structure Prediction Mohammed J. Zaki SPIDER Data Mining Project: Scalable, Parallel and Interactive Data Mining and Exploration at RPI http://www.cs.rpi.edu/~zaki

Presentation (PowerPoint File) - UCLA | Institute For Pure and

  • Upload
    tommy96

  • View
    418

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Data Mining for Protein Structure Prediction

Mohammed J. Zaki

SPIDER Data Mining Project: Scalable, Parallel and Interactive

Data Mining and Exploration at RPI

http://www.cs.rpi.edu/~zaki

Page 2: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Outline of the Talk

How do proteins form? Protein folding problem Contact map mining Using HMMs based on local motifs Mining “physical” dense frequent patterns

(non-local motifs) Future directions

Heuristic rules Folding pathways

Page 3: Presentation (PowerPoint File) - UCLA | Institute For Pure and

How do Proteins Form?

Page 4: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Building Blocks of Biological Systems DNA (nucleotides, 4 types): information carrier/encoder RNA: bridge from DNA to protein Protein (amino acids, 20 types): action molecules.

Processes Replication of DNA Transcription of gene (DNA) to messenger RNA (mRNA)

Splicing of non-coding regions of the genes (introns) Translation of mRNA into proteins Folding of proteins into 3D structure Biochemical or structural functions of proteins

How do Proteins Form?

Page 5: Presentation (PowerPoint File) - UCLA | Institute For Pure and
Page 6: Presentation (PowerPoint File) - UCLA | Institute For Pure and
Page 7: Presentation (PowerPoint File) - UCLA | Institute For Pure and
Page 8: Presentation (PowerPoint File) - UCLA | Institute For Pure and
Page 9: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Protein Folding Problem

Page 10: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Protein Structures

Primary structure Un-branched polymer 20 side chains (residues or amino acids)

Higher order structures Secondary: local (consecutive) in sequence Tertiary: 3D fold of one polypeptide chain Quaternary: Chains packing together

Page 11: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Amino Acid

Page 12: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Polypeptide Chain

Page 13: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Torsion Angles

Page 14: Presentation (PowerPoint File) - UCLA | Institute For Pure and

The Protein Folding Problem

Page 15: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Contact Map Mining

Page 16: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Contact Map

Amino acids Ai and Aj are in contact if their 3D distance is less than threshold (7å)

Sequence separation is given as |i-j| Contact map C is an N x N matrix, where

C(i,j) = 1 if Ai and Aj are in contact C(i,j) = 0 otherwise

Consider all pairs with |i-j| >= 4

Page 17: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Protein 2igd: 3D Structure

Anti-parallel Beta Sheets

Parallel Beta Sheets

Alpha Helix

Page 18: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Contact Map (2igd PDB)Anti-parallel Beta Sheets

Alpha Helix

Parallel Beta Sheets

Amino Acid Ai

Am

ino

Aci

d A

j

Page 19: Presentation (PowerPoint File) - UCLA | Institute For Pure and

How much information in Amino Acids Alone: Classification Problem

A pair of amino acids (Ai,Aj) is an instance The class: C (1) or NC (0), i.e., contact or

non-contact Highly skewed class distribution

1.7% C and 98.3% NC; 300K C vs 17,3M NC Features for each instance

Ai and Aj Class: C or NC

Page 20: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Predicting Protein Contacts

Predict contacts for new sequence

A D T S K C PA 0 1 0 0 1 0D 0 1 0 0 1T 1 0 0 0S 0 1 0K 0 1C 0P

Ai Aj F1 FnA T .. ..A C .. ..D S .. ..D P .. ..T S .. ..S C .. ..K P .. ..

Page 21: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Classification via Association Mining Association mining good for skewed data Mining: Mine frequent itemsets in C data (Dc)

P(X | Dc) = Frequency(X | Dc) / |Dc| Counting: find P(X | Dnc) Pruning

Likelihood of a contact: r = P(X|Dc) / P(X|Dnc) Prune pattern X if ratio r of contact to non-contact

probability is less then some threshold i.e., keep only the patterns highly predictive of

contacts

Page 22: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Testing Phase 90-10 split into training and testing 2.4 million pairs, with 36K contacts (1.5%) Evidence calculation:

Find matching patterns P for each instance Compute cumulative frequency in C and NC

Sc = Sum of frequency (X | Dc) where X in P Snc = Sum of frequency (X | Dnc) where X in P

Compute evidence: ratio of Sc / Snc Prediction: Sort instances on evidence

Predict top PR fraction as contacts

Page 23: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Experiments

794 Proteins from Protein Data Bank Distinct structures (< 25% similarity) Longest: 907, Smallest: 35 amino acids 90-10 split for training-testing Total pairs: 20 million (> 2.5 GB) Contacts: 330 thousand (1.6%) Highly uneven class distribution

Page 24: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Evaluation Metrics Na: set of all pairs Na*: all pairs with positive evidence Ntc: true contacts in test data Ntc*: true contacts with positive evidence Npc: predicted contacts Ntpc: correctly predicted contacts Accuracy = Ntpc / Npc Coverage = Ntpc / Ntc Prediction Ratio (PR): Ntc*/Na* Random Predictor Accuracy: Ntc/Na

Page 25: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Results (Amino Acids; All Lengths)

Crossover: 7% accuracy and 7% coverage; 2 times over Random

Page 26: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Results (Amino Acids; by length)

1-100: 12% accuracy(A) and coverage (C); 100-170: 6% A and C

170-300: 4.5% A and C; 300+: 2% A and C

Page 27: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Using HMMs based on Local Motifs to Improve Classification

Page 28: Presentation (PowerPoint File) - UCLA | Institute For Pure and

An HMM for Local Predictions HMMSTR (Chris Bystroff, Biology, RPI) Build a library of short sequences that tend to

fold uniquely across protein families: the I-Sites Library

Treat each motif as a Markov chain Merge the motifs into a global HMM for local

structure prediction

Page 29: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Training the HMM Build I-sites Library

Short sequence motifs (3 to 19) Exhaustive clustering of sequences Non-redundant PDB dataset (< 25% similarity)

Build an HMM Each of 262 motifs is a chain of Markov states Each state has sequence and structure for one

position Merge I-sites motifs hierarchically to get one global

HMM for all the motifs

Page 30: Presentation (PowerPoint File) - UCLA | Institute For Pure and

HMM Output

Total of 282 States in the HMM Each state produces or “emits”:

Amino acid profile (20 probability values) Secondary structure (D) (helix, strand or loop) Backbone angles (R) (11 dihedral angle symbols) Finer structural context (C) (10 context symbols)

Page 31: Presentation (PowerPoint File) - UCLA | Institute For Pure and

I-Sites Motifs (Initiation Sites)Beta Hairpin Beta to Alpha Helix C-Cap

Page 32: Presentation (PowerPoint File) - UCLA | Institute For Pure and
Page 33: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Data Format and Preparation Take the 794 PDB proteins Compute optimal alignment to HMM

Find best state sequence for the observed acids Output probability distribution of a residue over all

the 282 HMM states Integrate the 3 datasets

Alignment probability distribution (Nx282) Amino acid and context information (D, R, C) Contact map (NxN)

Page 34: Presentation (PowerPoint File) - UCLA | Institute For Pure and

HMMSTR Output (per Protein)PDB Name: 153l_

Sequence Length: 185

Position: 1Residue: RCoordinates: 0.0, -73.2, 177.6AA Profile (20 values): 0.0 1.0 0.0HMMSTR State Probabilities (282 values):0.0 0.7 0.3 0.0Distances (185 values): 0 3 5 18 15 13...Position: 185Residue: YCoordinates: -88.7 , 0.0, 0.0AA Profile (20 values): 0.0 0.4 0.6 0.0HMMSTR State Probabilities (282 values):0.0 0.2 0.5 0.3 0.0Distances (185 values): 15 13 10 5 3 0

Page 35: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Adding features from HMMSTR The class: C (1) or NC (0)

Highly skewed class distribution Approx 1.5% C and 98.5% NC

Features for each instance Ai Aj Di Dj Ri Rj Ci Cj Profile: pi1 pi2 … pi20 pj1 pj2 … pj20 HMM States: qi1 qi2 .. qi282 qj1 qj2 .. qj282 Class: C or NC

Page 36: Presentation (PowerPoint File) - UCLA | Institute For Pure and

HMM and AA + (R,D,C) ; All Lengths

Left Crossover: 19% accuracy and coverage; 5.3 times over Random

Right Crossover (+RDC): 17% accuracy and coverage; 5 times over Random

Page 37: Presentation (PowerPoint File) - UCLA | Institute For Pure and

HMM + AA + R,D,C (by length)

1-100: 30% accuracy(A) and coverage (C); 100-170: 17% A and C

170-300: 10% A and C; 300+: 6% A and C

Page 38: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Predicted Contact Map (2igd)

Page 39: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Summary of Classification Results Challenging prediction problem In essence, we have to predict a contact

matrix for a new protein Hybrid HMM/Associations approach Best results to-date: 19% overall

accuracy/coverage, 30% for short proteins 14.4% Accuracy (Fariselli, Casadio ‘99; NN) 13% Accuracy (Thomas et al ‘96) Short proteins: 26% (Olmea, Valencia, ‘97)

Page 40: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Mining “Physical” Dense Frequent Patterns (non-local motifs)

Page 41: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Characterizing Physical, Protein-like Contact Maps A very small subset of all contact maps code

for physically possible proteins (self-avoiding, globular chains)

A contact map must: Satisfy geometric constraints Represent low-energy structure

What are the typical non-local interactions? Frequent dense 0/1 submatrices in contact maps 3-step approach: 1) data generation, 2) dense

pattern mining, and 3) mapping to structure space

Page 42: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Dense Pattern Mining 12,524 protein-like 60 residue structures

Use HMMSTR to generate protein-like sequences Use ROSETTA to generate their structures

Monte Carlo fragment insertion (from I-sites library) Up to 5 possible low-energy structures retained

Frequent 2D Pattern Mining Use WxW sliding window; W window size Measure density under each window (N-W)^2 / 2 possible windows per N length protein Look for “minimum density”; scale away from diag Try different window sizes

Page 43: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Counting Dense Patterns

Naïve Approach:for W=5, N=60 there are 1485 windows per protein. Total 15 Million possible windows for 12,524 proteins Test if two submatrices are equal

Linear search: O(P x W^2) with P current dense patterns Hash based: O(W^2)

Our Approach: 2-level Hashing O(W) time

Page 44: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Pattern (WxW Submatrix) Encoding Encode submatrix as string (W integers)

Submatrix Integer Value

00000 0

01100 12

01000 8

01000 8

00000 0

Concatenated String: 0.12.8.8.0

Page 45: Presentation (PowerPoint File) - UCLA | Institute For Pure and

String ID (M) =

Level 1 (approximate):

Level2 (exact) : h2 (M) = StringID (M)

Two-level Hashing

W

iivMh

1)(1

Wvvv ...... 21

Page 46: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Binding Patterns to Proteins Sequence and Structure Using window size, W=5 StringID:0.12.8.8.0, Support = 170

0000001100010000100000000

Occurrences:pdb-name (X,Y) X_sequence Y_sequence Interaction1070.0 52,30 ILLKN TFVRI alpha::beta1145.0 51,13 VFALH GFHIA alpha::strand1251.2 42,6 EVCLR GSKFG alpha::strand1312.0 54,11 HGYDE ATFAK alpha::beta1732.0 49,6 HRFAK KELAG alpha::beta2895.0 49,7 SRCLD DTIYY alpha::beta...

Page 47: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Frequent Dense Local PatternsSubmatrix 0 0 0 0 0 0

0 00 0 0 0 0 0

0 00 0 0 0 0 0

0 00 0 0 0 0 0

0 00 0 0 0 0 0

0 10 0 0 0 0 0

1 10 0 0 0 0 1

1 10 0 0 0 1 1

1 0

0 1 1 1 0 0 0 0

1 1 1 0 0 0 0 0

1 1 0 0 0 0 0 0

1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0

0 0 1 0 0 0 0 0

0 0 0 1 0 0 0 0

0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0

0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1

Frequency 2.0% 2.2% 1.9%

Physical Phenomenon

antiparallel beta sheet secondary structure

antiparallel beta sheet secondary structure

parallel beta sheet secondary structure

Page 48: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Frequent Dense Non-Local Patterns

Alpha – Alpha Alpha – Beta Sheet

Page 49: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Frequent Dense Non-Local Patterns

Alpha – Beta Turn Beta Sheet – Beta Turn

Page 50: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Future Directions

Page 51: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Mining Physicality Rules

Comprehensive list of non-local motifs I-sites library catalogs local motifs

Mining heuristic rules for “physicality” Based on simple geometric constraints Rules governing contacts and non-contacts

Parallel Beta Sheets: If C(i,j) = 1 and C(i+2,j+2) = 1, then C(i,j+2) = 0 and C(i+2,j) = 0

Anti-parallel Beta Sheets: If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0

Alpha Helices: If C(i,i+4) = 1, C(i,j) = 1, and C(i+4,j) = 1, then C(i+2,j) = 0

Page 52: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Heuristic Rules of Physicality

i

i+2 j

j+2

If C(i,j+2) = 1 and C(i+2,j) = 1, then C(i,j) = 0 and C(i+2,j+2) = 0

If C(i,j) = 1 and C(i+2,j+2) = 1, then C(i,j+2) = 0 and C(i+2,j) = 0

Parallel Beta SheetsAnti-parallel Beta Sheets

i

i+2

j

j+2

Page 53: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Protein Folding Pathways

Rules for Pathways in Contact Map Space Pathway is time-ordered sequence of contacts Condensation rule: New contact within Smax

U(i,j) <= Smax; U(i,j) is unfolded residues from i to j Pathway prediction is complementary to structure

prediction

Page 54: Presentation (PowerPoint File) - UCLA | Institute For Pure and

Contact Map Folding Pathways