64
Techniques for Improved Probabilistic Inference in Protein-Structure Determination via X-Ray Crystallography Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

  • Upload
    tegan

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

Techniques for Improved Probabilistic Inference in Protein-Structure Determination via X-Ray Crystallography. Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011. Protein-Structure Determination. Proteins essential to cellular function Structural support - PowerPoint PPT Presentation

Citation preview

Page 1: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Techniques for Improved Probabilistic Inference

in Protein-Structure Determination

via X-Ray CrystallographyAmeet Soni

Department of Computer SciencesDoctoral DefenseAugust 10, 2011

Page 2: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Protein-Structure Determination

2

Proteins essential to cellular function Structural support Catalysis/enzymatic activity Cell signaling

Protein structures determine function

X-ray crystallography main technique for determining structures

X-ray; 88.1%

NMR; 11.3%

Other; 0.6%

Page 3: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Sequences vs Structure Growth

3

Page 4: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Task Overview4

Given A protein sequence Electron-density map

(EDM) of protein

Do Automatically produce a

protein structure that Contains all atoms Is physically feasible

SAVRVGLAIM...

Page 5: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

5

Using biochemical domain knowledge and enhanced algorithms for probabilistic inference will produce more accurate and more complete protein structures.

Thesis Statement

Page 6: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Challenges & Related Work6

1 Å 2 Å 3 Å 4 Å

Our Method: ACMI

ARP/wARPTEXTAL & RESOLVE

Resolution is a

property of the protein

Higher Resolution : Better Image Quality

Page 7: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Outline7

Background and Motivation ACMI Roadmap and My Contributions Inference in ACMI Guided Belief Propagation Probabilistic Ensembles in ACMI (PEA) Conclusions and Future Directions

Page 8: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Outline8

Background and Motivation ACMI Roadmap and My Contributions Inference in ACMI Guided Belief Propagation Probabilistic Ensembles in ACMI (PEA) Conclusions and Future Directions

Page 9: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

ACMI Roadmap(Automated Crystallographic Map Interpretation)9

Perform Local Match Apply Global Constraints Sample Structure

Phase 1 Phase 2 Phase 3

prior probability of

each AA’s location

posterior probabilityof each AA’s location

all-atom protein structures

bk

bk-1

bk+1*1…M

Page 10: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Analogy: Face Detection10

Phase 1 Find Nose Find Eyes Find Mouth

Phase 2 Combine and Apply Constraints

Phase 3 Infer Face

Page 11: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Phase 1: Local Match Scores11

General CS area: 3D shape matching/object recognition

Given: EDM, sequenceDo: For each amino acid in the sequence, score its match to every location in the EDM

My Contributions Spherical-harmonic decompositions for local match

[DiMaio, Soni, Phillips, and Shavlik, BIBM 2007] {Ch. 7} Filtering methods using machine learning

[DiMaio, Soni, Phillips, and Shavlik, IJDMB 2009] {Ch. 7} Structural homology using electron density [Ibid.] {Ch.

7}

Page 12: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Phase 2: Apply Global Constraints

12

General CS area: Approximate probabilistic inference

Given: Sequence, Phase 1 scores, constraintsDo: Posterior probability for each amino acid’s

3D location given all evidence

My Contributions Guided belief propagation using domain knowledge

[Soni, Bingman, and Shavlik, ACM BCB 2010] {Ch. 5} Residual belief propagation in ACMI [Ibid.] {Ch. 5} Probabilistic ensembles for improved inference

[Soni and Shavlik, ACM BCB 2011] {Ch. 6}

Page 13: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Phase 3: Sample Protein Structure

13

General CS area: Statistical sampling

Given: Sequence, EDM, Phase 2 posteriorsDo: Sample all-atom protein structure(s)

My Contributions Sample protein structures using particle filters

[DiMaio, Kondrashov, Bitto, Soni, Bingman, Phillips, Shavlik, Bioinformatics 2007] {Ch. 8}

Informed sampling using domain knowledge [Unpublished elsewhere] {Ch. 8}

Aggregation of probabilistic ensembles in sampling[Ibid. ACM BCB 2011] {Ch. 6}

Page 14: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Comparison to Related Work[DiMaio, Kondrashov, Bitto, Soni, Bingman, Phillips, and Shavlik, Bioinformatics 2007]

14

[Ch. 8 of dissertation]

Page 15: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Outline15

Background and Motivation ACMI Roadmap and My Contributions Inference in ACMI Guided Belief Propagation Probabilistic Ensembles in ACMI (PEA) Conclusions and Future Directions

Page 16: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

ACMI Roadmap16

Perform Local Match Apply Global Constraints Sample Structure

Phase 1 Phase 2 Phase 3

prior probability of

each AA’s location

posterior probabilityof each AA’s location

all-atom protein structures

bk

bk-1

bk+1*1…M

Page 17: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Phase 2 – Probabilistic Model

17

ACMI models the probability of all possible traces using a pairwise Markov Random Field (MRF)

LEU4 SER5GLY2 LYS3ALA1

Page 18: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Size of Probabilistic Model18

# nodes: ~1,000# edges:

~1,000,000

Page 19: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Approximate Inference19

Best structure intractable to calculateie, we cannot infer the underlying structure analytically

Phase 2 uses Loopy Belief Propagation (BP) to approximate solution Local, message-passing scheme Distributes evidence among nodes

Convergence not guaranteed

Page 20: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Example: Belief Propagation20

LYS31 LEU32

mLYS31→LEU32

pLEU32pLYS31

Page 21: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Example: Belief Propagation21

LYS31 LEU32

mLEU32→LEU31

pLEU32pLYS31

Page 22: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Shortcomings of Phase 222

Inference is very difficult ~106 possible locations for each amino acid ~100-1000s of amino acids in one protein Evidence is noisy O(N2) constraints

Solutions are approximate,room for improvement

Page 23: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Outline23

Background and Motivation ACMI Roadmap and My Contributions Inference in ACMI Guided Belief Propagation Probabilistic Ensembles in ACMI (PEA) Conclusions and Future Directions

Page 24: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Best case: wasted resources Worst case: poor information is excessive influence

Message Scheduling [ACM-BCB 2010]{Ch. 5}

24

SERLYSALA

Key design choice: message-passing schedule When BP is approximate, ordering affects

solution[Elidan et al, 2006]

Phase 2 uses a naïve, round-robin schedule

Page 25: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Using Domain Knowledge25

Biochemist insight: well-structured regions of protein correlate with strong features in density map eg, helices/strands have stable conformations

Disordered regions are more difficult to detect

General idea: prioritize what order messages are sent using expert knowledge eg, disordered amino acids receive less priority

Page 26: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Guided Belief Propagation26

Page 27: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Related Work27

Assumption: messages with largest change in value are more useful

Residual Belief Propagation [Elidan et al, UAI 2006] Calculates residual factor for each node

Each iteration, highest-residual node passes messages

General BP technique

Page 28: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Experimental Methodology28

Our previous technique: naive, round robin (ORIG)

My new technique: Guidance using disorder prediction (GUIDED) Disorder prediction using DisEMBL [Linding et

al, 2003] Prioritize residues with high stability (ie, low

disorder)

Residual factor (RESID) [Elidan et al, 2006]

Page 29: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Experimental Methodology29

Run whole ACMI pipeline Phase 1: Local amino-acid finder (prior

probabilities) Phase 2: Either ORIG, GUIDED, RESID Phase 3: Sample all-atom structures from

Phase 2 results

Test set of 10 poor-resolution electron-density maps From UW Center for Eukaryotic Structural

Genomics Deemed the most difficult of a large set of

proteins

Page 30: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Phase 2 Accuracy: Percentile Rank30

x P(x)

A 0.10

B 0.30

C 0.35

D 0.20

E 0.05

Truth 100%60%Truth

Page 31: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Phase 2 Marginal Accuracy31

Page 32: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Protein-Structure Results Do these better marginals produce more

accurate protein structures?

RESID fails to produce structures in Phase 3 Marginals are high in entropy (28.48 vs 5.31) Insufficient sampling of correct locations

32

Page 33: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Phase 3 Accuracy:Correctness and Completeness

33

Correctness akin to precision – percent of predicted structure that is accurate

Completeness akin to recall – percent of true structure predicted accurately

Truth Model A Model B

Page 34: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Protein-Structure Results34

Page 35: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Outline35

Background and Motivation ACMI Roadmap and My Contributions Inference in ACMI Guided Belief Propagation Probabilistic Ensembles in ACMI (PEA) Conclusions and Future Directions

Page 36: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Ensembles: the use of multiple models to improve predictive performance

Tend to outperform best single model [Dietterich ‘00] eg, 2010 Netflix prize

Ensemble Methods [ACM-BCB 2011]{Ch. 6}

36

Page 37: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Phase 2: Standard ACMI37

Protocol

MRF

P(bk)

message-scheduler: how ACMI sends messages

Page 38: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Phase 2: Ensemble ACMI38

Protocol 1

MRF

Protocol 2

Protocol C

P1(bk)

P2(bk)

PC(bk)

Page 39: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Probabilistic Ensembles in ACMI (PEA)39

New ensemble framework (PEA) Run inference multiple times, under

different conditions Output: multiple, diverse, estimates of each

amino acid’s location

Phase 2 now has several probability distributions for each amino acid, so what? Need to aggregate distributions in Phase 3

Page 40: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

ACMI Roadmap40

Perform Local Match Apply Global Constraints Sample Structure

Phase 1 Phase 2 Phase 3bk

bk-1

bk+1*1…M

prior probability of

each AA’s location

posterior probabilityof each AA’s location

all-atom protein structures

Page 41: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Place next backbone atom

Backbone Step (Prior Work)41

(1) Sample bk from empirical Ca- Ca- Ca pseudoangle distribution

bk-1b'k

bk-2

????

?

Page 42: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Place next backbone atom

Backbone Step (Prior Work)42

0.25…

bk-1

bk-2

(2) Weight each sample by its Phase 2 computed marginal

b'k0.20

0.15

Page 43: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Place next backbone atom

Backbone Step (Prior Work)43

0.25…

bk-1

bk-2

(3) Select bk with probability proportional to sample weight

b'k0.20

0.15

Page 44: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Backbone Step for PEA44

bk-1

bk-2

b'k0.23 0.15 0.04

PC(b'k)P2(b'k)P1(b'k)

? Aggregator

w(b'k)

Page 45: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Backbone Step for PEA: Average

45

bk-1

bk-2

b'k0.23 0.15 0.04

PC(b'k)P2(b'k)P1(b'k)

? AVG

0.14

Page 46: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Backbone Step for PEA: Maximum

46

bk-1

bk-2

b'k0.23 0.15 0.04

PC(b'k)P2(b'k)P1(b'k)

? MAX

0.23

Page 47: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Backbone Step for PEA: Sample

47

bk-1

bk-2

b'k0.23 0.15 0.04

PC(b'k)P2(b'k)P1(b'k)

? SAMP

0.15

Page 48: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Recap of ACMI (Prior Work)48

Prot

ocol

P(bk)

0.25

bk-1

bk-2

0.20

0.15

Phase 2 Phase 3

Page 49: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Prot

ocol

Prot

ocol

Recap of PEA49

Prot

ocol

bk-1

bk-2

0.14

0.26

0.05

Phase 2 Phase 3Ag

greg

ato

r

Page 50: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Results: Impact of Ensemble Size

50

Page 51: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Experimental Methodology51

PEA (Probabilistic Ensembles in ACMI) 4 ensemble components Aggregators: AVG, MAX, SAMP

ACMI ORIG – standard ACMI (prior work) EXT – run inference 4 times as long BEST – test best of 4 PEA components

Page 52: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Phase 2 Results: PEA vs ACMI

52

*p-value < 0.01

Page 53: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Protein-Structure Results: PEA vs ACMI53

*p-value < 0.05

Page 54: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Protein-Structure Results: PEA vs ACMI54

Page 55: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Outline55

Background and Motivation ACMI Roadmap and My Contributions Inference in ACMI Guided Belief Propagation Probabilistic Ensembles in ACMI (PEA) Conclusions and Future Directions

Page 56: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

My Contributions56

Perform Local Match Apply Global Constraints Sample Structure

• Local matching with spherical harmonics

• First-pass filtering

• Machine-learning search filter

• Structural homology detection

• Guided BP using domain knowledge

• Residual BP in ACMI

• Probabilistic Ensembles in ACMI

• All-atom structure sampling using particle filters

• Incorporating domain knowledge into sampling

• Aggregation of ensemble estimates

Page 57: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Overall Conclusions57

ACMI is the state-of-the-art method for determining protein structures in low-quality images

Broader implications Phase 1: Shape Matching, Signal Processing,

Search Filtering Phase 2: Graphical models, Statistical Inference Phase 3: Sampling, Video Tracking

Structural biology is a good example of a challenging probabilistic inference problem Guiding BP and PEA are general solutions

Page 58: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

UCH37 [PDB 3IHR]58

E. S. Burgie et al. Proteins: Structure, Function, and Bioinformatics. In-Press

Page 59: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Further Work on ACMI59

Advanced Filtering in Phase 1 Generalize Guided BP

Requires domain knowledge priority function

Generalize PEA Learning; Compare to other approaches

More structures (membrane proteins) Domain knowledge in Phase 3 scoring

Page 60: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Future Work60

Inference in complex domains Non-independent data Combining multiple object types Relations among data sets

Biomedical applications Medical diagnosis Brain imaging Cancer screening Health record analysis

Page 61: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Acknowledgements61

Advisor: Jude Shavlik

Committee: George Phillips, David Page, Mark Craven, Vikas Singh

Collaborators: Frank DiMaio and Sriraam Natarajan, Craig Bingman, Sethe Burgie, Dmitry Kondrashov

Funding: NLM R01-LM008796, NLM Training Grant T15- LM007359, NIH PSI Grant GM074901

Practice Talk Attendees: Craig, Trevor, Deborah, Debbie, Aubrey

ML Group

Page 62: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Acknowledgements62

Friends: Nick, Amy, Nate, Annie, Greg, Ila, 2*(Joe and Heather), Dana, Dave, Christine, Emily, Matt, Jen,

Mike, Angela, Scott, Erica, and others

Family: Bharat, Sharmistha, Asha, Ankoor, and EmilyDale, Mary, Laura, and Jeff

Page 63: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Thank you!

Page 64: Ameet Soni Department of Computer Sciences Doctoral Defense August 10, 2011

Publications•A. Soni and J. Shavlik, “Probabilistic ensembles for improved inference in protein-

structure determination,” in Proceedings of the ACM International Conference on Bioinformatics and Computational Biology, 2011

•A. Soni, C. Bingman, and J. Shavlik, “Guiding belief propagation using domain knowledge for protein-structure determination,” in Proceedings of ACM International Conference on Bioinformatics and Computational Biology, 2010.

•E. S. Burgie, C. A. Bingman, S. L. Grundhoefer, A. Soni, and G. N. Phillips, Jr., “Structural characterization of Uch37 reveals the basis of its auto-inhibitory mechanism.” Proteins: Structure, Function, and Bioinformatics, In-Press. PDB ID: 3IHR.

•F. DiMaio, A. Soni, G. N. Phillips, and J. Shavlik, “Spherical-harmonic decomposition for molecular recognition in electron-density maps,” International Journal of Data Mining and Bioinformatics, 2009.

•F. DiMaio, A. Soni, and J. Shavlik, “Machine learning in structural biology: Interpreting 3D protein images,” in Introduction to Machine Learning and Bioinformatics, ed. Sushmita Mitra, Sujay Datta, Theodore Perkins, and George Michailidis, Ch. 8. 2008.

•F. DiMaio, A. Soni, G. N. Phillips, and J. Shavlik, “Improved methods for template matching in electron-density maps using spherical harmonics,” in Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, 2007.

•F. DiMaio, D. Kondrashov, E. Bitto, A. Soni, C. Bingman, G. Phillips, and J. Shavlik, “Creating protein models from electron-density maps using particle-filtering methods,” Bioinformatics, 2007.

64