23
Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 Introduction to Probabilistic Models for Computational Biology 1

Introduction to Probabilistic Models for Computational Biology

  • Upload
    ninon

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Introduction to Probabilistic Models for Computational Biology. Lectures 2 – Oct 3, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022. DNA. - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to Probabilistic Models for Computational Biology

Lectures 2 – Oct 3, 2011CSE 527 Computational Biology, Fall 2011

Instructor: Su-In LeeTA: Christopher Miles

Monday & Wednesday 12:00-1:20Johnson Hall (JHN) 022

Introduction to Probabilistic Models for Computational Biology

1

Page 2: Introduction to Probabilistic Models for Computational Biology

Review: Gene Regulation

AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC

DNA

AUGUGGAUUGUU

AUGCGCGUC

AUGUUACGCACCUAC

AUGAUUGAURNA

Protein MWIV MRV MLRTYMID

GeneAGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC

Genes regulate each others’ expression and activity.

AUGCGCGUC

MRV

Genetic regulatory network

gene

RNA degradatio

nMID

AUGAUUAUAUGAUUGAU

MID

“Gene Expression”

a switch! (“transcription factor binding site”)

Gene regulation

transcription

translation

Page 3: Introduction to Probabilistic Models for Computational Biology

Review: Variations in the DNA

AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC

Genetic regulatory network

“Single nucleotide polymorphism (SNP)”

AUGUGGAUUGUU

AUGCGCGUC

AUGUUACGCACCUAC

AUGAUUGAURNA

Protein MWIV MRV MLRTYMID

gene

CX

TX X X

A GXT

XC X

L

CX X

TXU

X X

Sequence variations perturb the regulatory network.

Page 4: Introduction to Probabilistic Models for Computational Biology

4

Outline Probabilistic models in biology

Model selection problems

Mathematical foundations

Bayesian networks Probabilistic Graphical Models: Principles and

Techniques, Koller & Friedman, The MIT Press

Learning from data Maximum likelihood estimation Expectation and maximization

Page 5: Introduction to Probabilistic Models for Computational Biology

5

Example 1 How a change in a nucleotide in DNA, blood

pressure and heart disease are related?

There can be several “models”…

Bloodpressure

Heartdisease

OR

DNAalteration

Bloodpressure

Heartdisease

DNAalteration

Bloodpressure

Heartdisease

DNAalteration

Page 6: Introduction to Probabilistic Models for Computational Biology

6

Example 2 How genes A, B and C regulate each other’s

expression levels (mRNA levels) ?

There can be several models…

A

B C

A

B C

A

B C

OR ?

Page 7: Introduction to Probabilistic Models for Computational Biology

7

Gene A

Gene B

Gene C

Exp 1 Exp 2 Exp N…

A

B C

A

B C

A

B C

OR ?

Statistical dependencies between expression levels of genes A, B, C?

Probability that model x is true given the data Model selection: argmaxx P(model x is true |

Data)

N instances

Model I Model II Model III

Probabilistic graphical models A graphical representation of statistical

dependencies.

Page 8: Introduction to Probabilistic Models for Computational Biology

8

Outline Probabilistic models in biology

Model selection problem

Mathematical foundations

Bayesian networks

Learning from data Maximum likelihood estimation Expectation and maximization

Page 9: Introduction to Probabilistic Models for Computational Biology

9

Probability Theory Review Assume random variables Val(A)={a1,a2,a3},

Val(B)={b1,b2}

Conditional probability Definition

Chain rule

Bayes’ rule

Probabilistic independence

Page 10: Introduction to Probabilistic Models for Computational Biology

10

Probabilistic Representation Joint distribution P over {x1,…, xn}

xi is binary 2n-1 entries

If x’s are independent P(x) = p(x1) … p(xn)

Page 11: Introduction to Probabilistic Models for Computational Biology

11

Conditional Parameterization The Diabetes example

Genetic risk (G), Diabetes (D) Val (G) = {g1,g0}, Val (D) =

{d1,d0}

P(G,D) = P(G) P(D|G) P(G): Prior distribution P(D|G): Conditional

probabilistic distribution (CPD)

Genetic risk

Diabetes

Page 12: Introduction to Probabilistic Models for Computational Biology

12

Naïve Bayes Model - Example Elaborating the diabetes example,

Genetic Risk (G), Diabetes (D), Hypertension (H) Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) =

{h1,h0} 8 entries

If S and G are independent given I, P(G,D,H) = P(G)P(D|G)P(H|G) 5 entries; more compact than joint

Genetic risk

Diabetes Hypertension

Page 13: Introduction to Probabilistic Models for Computational Biology

13

Naïve Bayes Model A class C where Val (C) = {c1,…,ck}.

Finding variables x1,…,xn

Naïve Bayes assumption The findings are conditionally independent

given the individual’s class. The model factorizes as:

The Diabetes example class: Genetic risk, findings: Diabetes,

Hypertension

Page 14: Introduction to Probabilistic Models for Computational Biology

14

Naïve Bayes Model - Example Medical diagnosis system

Class C: disease Findings X: symptoms

Computing the confidence:

Drawbacks Strong assumptions

Page 15: Introduction to Probabilistic Models for Computational Biology

15

Bayesian Network Directed acyclic graph (DAG)

Node: a random variable Edge: direct influence of one node on another

The Diabetes example revisited Genetic risk (G), Diabetes (D), Hypertension (H) Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) =

{h1,h0}Genetic risk

Diabetes Hypertension

Page 16: Introduction to Probabilistic Models for Computational Biology

Bayesian Network Semantics A Bayesian network structure G is a directed acyclic graph

whose nodes represent random variables X1,…,Xn. PaXi: parents of Xi in G NonDescendantsXi: variables in G that are not descendants of Xi.

G encodes the following set of conditional independence assumptions, called the local Markov assumptions, and denoted by IL(G):

For each variable Xi: x1

x2

x3

x4

x5

x6

x3

x7

x11

x10

x8

x9

16

Page 17: Introduction to Probabilistic Models for Computational Biology

17

The Genetics Example Variables

B: blood type (a phenotype) G: genotype of the gene that encodes a

person’s blood type; <A,A>, <A,B>, <A,O>, <B,B>, <B,O>, <O,O>

Page 18: Introduction to Probabilistic Models for Computational Biology

18

Bayesian Network Joint Distribution

Let G be a Bayesian network graph over the variables X1,…,Xn. We say that a distribution P factorizes according to G if P can be expressed as:

A Bayesian network is a pair (G,P) where P factorizes over G, and where P is specified as a set of CPDs associated with G’s nodes.

Page 19: Introduction to Probabilistic Models for Computational Biology

19

The Student Example More complex scenario

Course difficulty (D), quality of the recommendation letter (L), Intelligence (I), SAT (S), Grade (G)

Val(D) = {easy, hard}, Val(L) = {strong, weak},

Val(I) = {i1,i0}, Val (S) = {s1,s0}, Val (G) = {g1,g2,g3}

Joint distribution requires 47 entries

Page 20: Introduction to Probabilistic Models for Computational Biology

20

The Student Bayesian network Joint distribution

P(I,D,G,S,L) =

from Koller & Friedman

Page 21: Introduction to Probabilistic Models for Computational Biology

21

Parameter Estimation Assumptions

Fixed network structure Fully observed instances of the network variables: D={d[1],

…,d[M]} Maximum likelihood estimation (MLE)!

“Parameters” of the Bayesian network

For example, {i0,d1,g1,l0,s0

}

from Koller & Friedman

Page 22: Introduction to Probabilistic Models for Computational Biology

22

Outline Probabilistic models in biology

Model selection problem

Mathematical foundations

Bayesian networks

Learning from data Maximum likelihood estimation Expectation and maximization

Page 23: Introduction to Probabilistic Models for Computational Biology

23

Acknowledgement

Profs Daphne Koller & Nir Friedman,“Probabilistic Graphical Models”