59
David Heckerman Microsoft Research Graphical Models Graphical Models

David Heckerman Microsoft Research Graphical Models

Embed Size (px)

Citation preview

Page 1: David Heckerman Microsoft Research Graphical Models

David HeckermanMicrosoft Research

Graphical ModelsGraphical Models

Page 2: David Heckerman Microsoft Research Graphical Models

OverviewOverview

Intro to graphical modelsIntro to graphical models– Application: Data explorationApplication: Data exploration

– Dependency networks Dependency networks undirected graphs undirected graphs

– Directed acyclic graphs (“Bayes nets”)Directed acyclic graphs (“Bayes nets”)

ApplicationsApplications– ClusteringClustering

– Evolutionary history/phylogenyEvolutionary history/phylogeny

Page 3: David Heckerman Microsoft Research Graphical Models

Decision tree:

Logistic regression:log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old

Neural network:

Using classification/regression for Using classification/regression for data explorationdata exploration

male female

young old

p(cust)=0.2 p(cust)=0.7

p(cust)=0.8

Page 4: David Heckerman Microsoft Research Graphical Models

Decision tree:

Logistic regression:log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old

Neural network:

Using classification/regression for Using classification/regression for data explorationdata exploration

male female

young old

p(cust)=0.2 p(cust)=0.7

p(cust)=0.8

p(target|inputs)

Page 5: David Heckerman Microsoft Research Graphical Models

Decision tree:

Conditional independenceConditional independence

male female

young old

p(cust)=0.2 p(cust)=0.7

p(cust)=0.8

p(cust | gender, age, month born)=p(cust | gender, age)

p(target | all inputs) = p(target | some inputs)

Page 6: David Heckerman Microsoft Research Graphical Models

Learning conditional independence Learning conditional independence from data: Model Selectionfrom data: Model Selection

Cross validationCross validation Bayesian methodsBayesian methods Penalized likelihoodPenalized likelihood Minimum description lengthMinimum description length

Page 7: David Heckerman Microsoft Research Graphical Models

Using classification/regression for Using classification/regression for data explorationdata exploration

Suppose you have thousands of variables Suppose you have thousands of variables and you’re not sure about the interactions and you’re not sure about the interactions among those variablesamong those variables

Build a classification/regression model for Build a classification/regression model for each variable, using the rest of the variables each variable, using the rest of the variables as inputsas inputs

Page 8: David Heckerman Microsoft Research Graphical Models

Example with three variables X, Y, and ZExample with three variables X, Y, and Z

Target: XInputs: Y,Z

Target: YInputs: X,Z

Target: ZInputs: X,Y

Y=0 Y=1

p(x|y=0) p(x|y=1)

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0,z=0)

p(y|x=1)

Z=0 Z=1

p(y|x=0,z=1)

Page 9: David Heckerman Microsoft Research Graphical Models

Summarize the trees with a single graphSummarize the trees with a single graph

Target: XInputs: Y,Z

Target: YInputs: X,Z

Target: ZInputs: X,Y

Y=0 Y=1

p(x|y=0) p(x|y=1)

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0,z=0)

p(y|x=1)

Z=0 Z=1

p(y|x=0,z=1)

X Y Z

Page 10: David Heckerman Microsoft Research Graphical Models

Dependency NetworkDependency Network

Build a classification/regression model for Build a classification/regression model for every variable given the other variables as every variable given the other variables as inputsinputs

Construct a graph whereConstruct a graph where– Nodes correspond to variablesNodes correspond to variables

– There is an arc from X to Y if X helps to predict YThere is an arc from X to Y if X helps to predict Y

The graph along with the individual The graph along with the individual classification/regression model is a classification/regression model is a “dependency network” “dependency network” (Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000)(Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000)

Page 11: David Heckerman Microsoft Research Graphical Models

Example: TV viewing Example: TV viewing

Age Show1 Show2 Show3viewer 1 73 y n nviewer 2 16 n y y ...viewer 3 35 n n n

etc.

~400 shows, ~3000 viewers

Nielsen data: 2/6/95-2/19/95

Goal: exploratory data analysis (acausal)

Page 12: David Heckerman Microsoft Research Graphical Models
Page 13: David Heckerman Microsoft Research Graphical Models

A bit of historyA bit of history

Julian Besag (and others) invented Julian Besag (and others) invented dependency networks (under another name) dependency networks (under another name) in the mid 1970sin the mid 1970s

But they didn’t like them, because they could But they didn’t like them, because they could be inconsistentbe inconsistent

Page 14: David Heckerman Microsoft Research Graphical Models

A consistent dependency networkA consistent dependency network

Target: XInputs: Y,Z

Target: YInputs: X,Z

Target: ZInputs: X,Y

Y=0 Y=1

p(x|y=0) p(x|y=1)

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0,z=0)

p(y|x=1)

Z=0 Z=1

p(y|x=0,z=1)

X Y Z

Page 15: David Heckerman Microsoft Research Graphical Models

An inconsistent dependency networkAn inconsistent dependency network

Target: XInputs: Y,Z

Target: YInputs: X,Z

Target: ZInputs: X,Y

Y=0 Y=1

p(x|y=0) p(x|y=1)

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0,z=0)

p(y|x=1)

Z=0 Z=1

p(y|x=0,z=1)

X Y Z

Page 16: David Heckerman Microsoft Research Graphical Models

A bit of historyA bit of history

Julian Besag (and others) invented Julian Besag (and others) invented dependency networks (under the name dependency networks (under the name “Markov graphs”) in the mid 1970s“Markov graphs”) in the mid 1970s

But they didn’t like them, because they could But they didn’t like them, because they could be inconsistentbe inconsistent

So they used a property of consistent So they used a property of consistent dependency networks to develop a new dependency networks to develop a new characterization of themcharacterization of them

Page 17: David Heckerman Microsoft Research Graphical Models

Conditional independenceConditional independence

Target: XInputs: Y,Z

Target: YInputs: X,Z

Target: ZInputs: X,Y

Y=0 Y=1

p(x|y=0) p(x|y=1)

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0,z=0)

p(y|x=1)

Z=0 Z=1

p(y|x=0,z=1)

X Y Z

X Z | Y

Page 18: David Heckerman Microsoft Research Graphical Models

Conditional independence in a Conditional independence in a dependency networkdependency network

Each variable is independent of all other Each variable is independent of all other variables given its immediate neighborsvariables given its immediate neighbors

Page 19: David Heckerman Microsoft Research Graphical Models

Hammersley-Clifford TheoremHammersley-Clifford Theorem(Besag 1974)(Besag 1974)

Given a set of variables which has a positive Given a set of variables which has a positive joint distributionjoint distribution

Where each variable is independent of all Where each variable is independent of all other variables given its immediate neighbors other variables given its immediate neighbors in some graph Gin some graph G

It follows thatIt follows that

where cwhere c11, c, c22, …, c, …, cnn are the maximal cliques in are the maximal cliques in the graph G.the graph G.

N

iiifp

1

)()( cx

“cliquepotentials”

Page 20: David Heckerman Microsoft Research Graphical Models

ExampleExample

X Y Z

),(),(),,( 21 zyfyxfzyxp

Page 21: David Heckerman Microsoft Research Graphical Models

Consistent dependency networks: Consistent dependency networks: Directed arcs not neededDirected arcs not needed

X Y Z

X Y Z

),(),(),,( 21 zyfyxfzyxp

Page 22: David Heckerman Microsoft Research Graphical Models

A bit of historyA bit of history

Julian Besag (and others) invented Julian Besag (and others) invented dependency networks (under the name dependency networks (under the name “Markov graphs”) in the mid 1970s“Markov graphs”) in the mid 1970s

But they didn’t like them, because they could But they didn’t like them, because they could be inconsistentbe inconsistent

So they used a property of consistent So they used a property of consistent dependency networks to develop a new dependency networks to develop a new characterization of themcharacterization of them

““Markov Random Fields” aka “undirected Markov Random Fields” aka “undirected graphs” were borngraphs” were born

Page 23: David Heckerman Microsoft Research Graphical Models

Inconsistent dependency networks Inconsistent dependency networks aren’t that badaren’t that bad

They are *almost consistent* because each They are *almost consistent* because each classification/regression model is learned classification/regression model is learned from the same data set (can be formalized)from the same data set (can be formalized)

They are easy to learn from data (build They are easy to learn from data (build separate classification/regression model for separate classification/regression model for each variable)each variable)

Conditional distributions (e.g., trees) are Conditional distributions (e.g., trees) are easier to understand than clique potentialseasier to understand than clique potentials

Page 24: David Heckerman Microsoft Research Graphical Models

Inconsistent dependency networks Inconsistent dependency networks aren’t that badaren’t that bad

They are *almost consistent* because each They are *almost consistent* because each classification/regression model is learned classification/regression model is learned from the same data set (can be formalized)from the same data set (can be formalized)

They are easy to learn from data (build They are easy to learn from data (build separate classification/regression model for separate classification/regression model for each variable)each variable)

Conditional distributions (e.g., trees) are Conditional distributions (e.g., trees) are easier to understand than clique potentialseasier to understand than clique potentials

Over the last decade, has proven to be a very Over the last decade, has proven to be a very useful tool for data explorationuseful tool for data exploration

Page 25: David Heckerman Microsoft Research Graphical Models

Shortcomings of undirected graphsShortcomings of undirected graphs

Lack a generative story (e.g., Lat Dir Alloc)Lack a generative story (e.g., Lat Dir Alloc) Lack a causal storyLack a causal story

cold lung cancer

coughsore throat weight loss

Page 26: David Heckerman Microsoft Research Graphical Models

Solution: Build trees in some orderSolution: Build trees in some order

1. Target: XInputs: none

2. Target: YInputs: X

3. Target: ZInputs: X,Y

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0) p(y|x=1)

p(x)

.

Page 27: David Heckerman Microsoft Research Graphical Models

Solution: Build trees in some orderSolution: Build trees in some order

1. Target: XInputs: none

2. Target: YInputs: X

3. Target: ZInputs: X,Y

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0) p(y|x=1)

p(x)

.

X Y Z

Page 28: David Heckerman Microsoft Research Graphical Models

Some orders are better than othersSome orders are better than others

Random ordersRandom orders Greedy searchGreedy search Monte-Carlo methodsMonte-Carlo methods

X Y Z

X Z Y

Page 29: David Heckerman Microsoft Research Graphical Models

Joint distribution is easy to obtainJoint distribution is easy to obtain

1. Target: XInputs: none

2. Target: YInputs: X

3. Target: ZInputs: X,Y

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0) p(y|x=1)

p(x)

.

X Y Z

),,(),|()|()( zyxpyxzpxypxp

Page 30: David Heckerman Microsoft Research Graphical Models

Directed Acyclic Graphs (aka Bayes Nets)Directed Acyclic Graphs (aka Bayes Nets)

Many inventors: Many inventors: Wright 1921; Good 1961; Howard & Wright 1921; Good 1961; Howard & Matheson 1976, Pearl 1982Matheson 1976, Pearl 1982

))(parents|(

),...,|(),...,( 111

iii

iiin

xxp

xxxpxxp

Page 31: David Heckerman Microsoft Research Graphical Models

The power of graphical modelsThe power of graphical models

Easy to understandEasy to understand Useful for adding prior knowledge to an Useful for adding prior knowledge to an

analysis (e.g., causal knowledge)analysis (e.g., causal knowledge) The conditional independencies they express The conditional independencies they express

make make inferenceinference more computationally more computationally efficientefficient

Page 32: David Heckerman Microsoft Research Graphical Models

InferenceInference

1. Target: XInputs: none

2. Target: YInputs: X

3. Target: ZInputs: X,Y

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0) p(y|x=1)

p(x)

.

X Y Z

What is p(z|x=1)?

Page 33: David Heckerman Microsoft Research Graphical Models

Inference: ExampleInference: Example

yxw

yzpxypwxpwpzp,,

)|()|()|()()(

X Y ZW

Page 34: David Heckerman Microsoft Research Graphical Models

Inference: ExampleInference: Example(“Elimination Algorithm”)(“Elimination Algorithm”)

xw y

yxw

yzpxypwxpwp

yzpxypwxpwpzp

,

,,

)|()|()|()(

)|()|()|()()(

X Y ZW

Page 35: David Heckerman Microsoft Research Graphical Models

Inference: ExampleInference: Example(“Elimination Algorithm”)(“Elimination Algorithm”)

w x y

xw y

yxw

yzpxypwxpwp

yzpxypwxpwp

yzpxypwxpwpzp

)|()|()|()(

)|()|()|()(

)|()|()|()()(

,

,,

X Y ZW

Page 36: David Heckerman Microsoft Research Graphical Models

InferenceInference

Inference also important because it is the E Inference also important because it is the E step of EM algorithm (when learning with step of EM algorithm (when learning with missing data and/or hidden variables)missing data and/or hidden variables)

Exact methods for inference that exploit Exact methods for inference that exploit conditional independence are well developed conditional independence are well developed (e.g., Shachter, Lauritzen & Spiegelhalter, Dechter)(e.g., Shachter, Lauritzen & Spiegelhalter, Dechter)

Exact methods fail when there are many Exact methods fail when there are many cycles in the graphcycles in the graph

– MCMC (e.g., Geman and Geman 1984)MCMC (e.g., Geman and Geman 1984)– Loopy propagation (e.g., Murphy et al. 1999)Loopy propagation (e.g., Murphy et al. 1999)– Variational methods (e.g., Jordan et al. 1999)Variational methods (e.g., Jordan et al. 1999)

Page 37: David Heckerman Microsoft Research Graphical Models

Applications of Graphical ModelsApplications of Graphical Models

DAGs and UGs:DAGs and UGs: Data explorationData exploration Density estimationDensity estimation ClusteringClustering

UGs:UGs: Spatial processesSpatial processes

DAGs:DAGs: Expert systemsExpert systems Causal discoveryCausal discovery

Page 38: David Heckerman Microsoft Research Graphical Models

ApplicationsApplications

ClusteringClustering Evolutionary history/phylogenyEvolutionary history/phylogeny

Page 39: David Heckerman Microsoft Research Graphical Models

ClusteringClustering

User Sequence1 frontpage news travel travel2 news news news news news3 frontpage news frontpage news frontpage4 news news5 frontpage news news travel travel travel6 news weather weather weather weather weather7 news health health business business business8 frontpage sports sports sports weatherEtc.

Millions of users per day

Example: msnbc.com

Goal: understand what is and isn’t working on the site

Page 40: David Heckerman Microsoft Research Graphical Models

SolutionSolution

data

User clusters

Cluster

• Cluster users based on their behavior on the site• Display clusters somehow

Page 41: David Heckerman Microsoft Research Graphical Models

Generative model for clusteringGenerative model for clustering(e.g., AutoClass, Cheeseman & Stutz 1995)(e.g., AutoClass, Cheeseman & Stutz 1995)

Cluster

1stpage

2ndpage

3rdpage

Discrete, hidden

Page 42: David Heckerman Microsoft Research Graphical Models

Sequence ClusteringSequence Clustering (Cadez, Heckerman, Meek, & Smyth, 2000)(Cadez, Heckerman, Meek, & Smyth, 2000)

Cluster

1stpage

2ndpage

3rdpage

Discrete, hidden

Page 43: David Heckerman Microsoft Research Graphical Models

Learning parameters (with missing data)Learning parameters (with missing data)

Principles:Principles: Find the parameters that maximize the (log) Find the parameters that maximize the (log)

likelihood of the datalikelihood of the data Find the parameters whose posterior Find the parameters whose posterior

probability is a maximumprobability is a maximum Find distributions for quantities of interest by Find distributions for quantities of interest by

averaging over the unknown parametersaveraging over the unknown parameters

Gradient methods or EM algorithm typically Gradient methods or EM algorithm typically used for first twoused for first two

Page 44: David Heckerman Microsoft Research Graphical Models

Expectation-Maximization (EM) algorithmExpectation-Maximization (EM) algorithmDempster, Laird, Rubin 1977Dempster, Laird, Rubin 1977

Initialize parameters (e.g., at random)Initialize parameters (e.g., at random)

Expectation step:Expectation step: compute probabilities for values of unobserved compute probabilities for values of unobserved

variable using the current values of the parameters and the variable using the current values of the parameters and the

incomplete data [THIS IS INFERENCE]; reinterpret data as set of incomplete data [THIS IS INFERENCE]; reinterpret data as set of

fractional cases based on these probabilitiesfractional cases based on these probabilities

Maximization step:Maximization step: choose parameters so as to maximize the log choose parameters so as to maximize the log

likelihood of the fractional datalikelihood of the fractional data

Parameters will converge to a local maximum of log p(data)

Page 45: David Heckerman Microsoft Research Graphical Models

E-stepE-step

Suppose cluster model has 2 clusters, and thatSuppose cluster model has 2 clusters, and that

p(cluster=1|case,current params) = 0.7p(cluster=1|case,current params) = 0.7

p(cluster=2|case,current params) = 0.3p(cluster=2|case,current params) = 0.3

Then, writeThen, write

q(case) = 0.7 log p(case,cluster=1|params) +q(case) = 0.7 log p(case,cluster=1|params) +

0.3 log p(case,cluster=2|params)0.3 log p(case,cluster=2|params)

Do this for each case and then find the parameters that Do this for each case and then find the parameters that maximize Q=maximize Q=casecase q(case). These parameters also q(case). These parameters also maximize the log likelihood.maximize the log likelihood.

Page 46: David Heckerman Microsoft Research Graphical Models

Demo: SQL Server 2005Demo: SQL Server 2005

User Sequence1 frontpage news travel travel2 news news news news news3 frontpage news frontpage news frontpage4 news news5 frontpage news news travel travel travel6 news weather weather weather weather weather7 news health health business business business8 frontpage sports sports sports weatherEtc.

Example: msnbc.com

Page 47: David Heckerman Microsoft Research Graphical Models
Page 48: David Heckerman Microsoft Research Graphical Models

Sequence clusteringSequence clustering

Other applications at Microsoft:Other applications at Microsoft: Analyze how people use programs (e.g. Office)Analyze how people use programs (e.g. Office) Analyze web traffic for intruders (anomaly Analyze web traffic for intruders (anomaly

detection)detection)

Page 49: David Heckerman Microsoft Research Graphical Models

Computational biology applicationsComputational biology applications

Evolutionary history/phylogenyEvolutionary history/phylogeny Vaccine for AIDSVaccine for AIDS

Page 50: David Heckerman Microsoft Research Graphical Models

DonkeyHorseIndian rhinoWhite rhinoGrey sealHarbor sealDogCatBlue whaleFin whaleSperm whaleHippopotamusSheepCowAlpacaPigLittle red flying foxRyukyu flying foxHorseshoe batJapanese pipistrelleLong-tailed batJamaican fruit-eating bat

Asiatic shrewLong-clawed shrew

MoleSmall Madagascar hedgehogAardvarkElephantArmadilloRabbitPikaTree shrewBonoboChimpanzeeManGorillaSumatran orangutanBornean orangutanCommon gibbonBarbary apeBaboon

White-fronted capuchinSlow lorisSquirrelDormouseCane-ratGuinea pigMouseRatVoleHedgehogGymnureBandicootWallarooOpossumPlatypus

Perissodactyla

Carnivora

Cetartiodactyla

Rodentia 1

HedgehogsRodentia 2

Primates

ChiropteraMoles+ShrewsAfrotheria

XenarthraLagomorpha

+ Scandentia

Evolutionary History/PhylogenyEvolutionary History/PhylogenyJojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman 2004Jojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman 2004

Page 51: David Heckerman Microsoft Research Graphical Models

Probabilistic Model of EvolutionProbabilistic Model of Evolution

species1

species2

species3

hidden

hidden

Page 52: David Heckerman Microsoft Research Graphical Models

Learning phylogeny from dataLearning phylogeny from data

For a given tree, find max likelihood For a given tree, find max likelihood parametersparameters

Search over structure to find best likelihood Search over structure to find best likelihood (penalized to avoid over fitting)(penalized to avoid over fitting)

Page 53: David Heckerman Microsoft Research Graphical Models

01h

11x

21h

31x

41x

02h

12x

22h

32x

42x

0Jh

1Jx

2Jh

3Jx

4Jx

Strong simplifying assumptionStrong simplifying assumption

Evolution at each DNA nucleotide is independent EM is computationally efficient

Nucleotideposition 1

Nucleotideposition 2

Nucleotideposition N

Page 54: David Heckerman Microsoft Research Graphical Models

Relaxing the assumptionRelaxing the assumption

Each substitution depends on the substitution Each substitution depends on the substitution at the previous positionat the previous position

This structure captures context specific This structure captures context specific effects during evolutioneffects during evolution

EM is computationally intractableEM is computationally intractable

01h

11x

21h

31x

41x

02h

12x

22h

32x

42x

0Jh

1Jx

2Jh

3Jx

4Jx

…0Jh

1Jx

2Jh

3Jx

4Jx

…01h

11x

21h

31x

41x

02h

12x

22h

32x

42x

Page 55: David Heckerman Microsoft Research Graphical Models

Variational approximation for inferenceVariational approximation for inference

h

h

h

ohq

ohpohq

ohq

ohpohq

ohpop

),|(

),,(ln),|(

),|(

),,(),|(ln

)|,(ln)|(ln

Lower bound good enough for EM-like algorithm

Page 56: David Heckerman Microsoft Research Graphical Models

Product of trees

Product of chains

h

o h

o o

…h

o h

o o

h

o h

o o

h

o h

o

…h

o h

o o

h

o h

o o

Two simple q distributionsTwo simple q distributions

Page 57: David Heckerman Microsoft Research Graphical Models
Page 58: David Heckerman Microsoft Research Graphical Models

Things I didn’t have time to talk aboutThings I didn’t have time to talk about

Factor graphs, mixed graphs, etc.Factor graphs, mixed graphs, etc. Relational learning: PRMs, Plates, PERsRelational learning: PRMs, Plates, PERs Bayesian methods for learningBayesian methods for learning ScalabilityScalability Causal modelingCausal modeling Variational methodsVariational methods Non-parametric distributionsNon-parametric distributions

Page 59: David Heckerman Microsoft Research Graphical Models

To learn moreTo learn more

Main conferences:Main conferences: Uncertainty in Artificial Intelligence (UAI)Uncertainty in Artificial Intelligence (UAI) Neural information Processing Systems (NIPS)Neural information Processing Systems (NIPS)