David Heckerman Microsoft Research Graphical Models

David HeckermanMicrosoft Research

Graphical ModelsGraphical Models

OverviewOverview

Intro to graphical modelsIntro to graphical models– Application: Data explorationApplication: Data exploration

– Dependency networks Dependency networks undirected graphs undirected graphs

– Directed acyclic graphs (“Bayes nets”)Directed acyclic graphs (“Bayes nets”)

ApplicationsApplications– ClusteringClustering

– Evolutionary history/phylogenyEvolutionary history/phylogeny

Decision tree:

Logistic regression:log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old

Neural network:

Using classification/regression for Using classification/regression for data explorationdata exploration

male female

young old

p(cust)=0.2 p(cust)=0.7

p(cust)=0.8

Decision tree:

Logistic regression:log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old

Neural network:


male female

young old


p(cust)=0.8

p(target|inputs)

Decision tree:

Conditional independenceConditional independence

male female

young old


p(cust)=0.8

p(cust | gender, age, month born)=p(cust | gender, age)

p(target | all inputs) = p(target | some inputs)

Learning conditional independence Learning conditional independence from data: Model Selectionfrom data: Model Selection

Cross validationCross validation Bayesian methodsBayesian methods Penalized likelihoodPenalized likelihood Minimum description lengthMinimum description length


Suppose you have thousands of variables Suppose you have thousands of variables and you’re not sure about the interactions and you’re not sure about the interactions among those variablesamong those variables

Build a classification/regression model for Build a classification/regression model for each variable, using the rest of the variables each variable, using the rest of the variables as inputsas inputs

Example with three variables X, Y, and ZExample with three variables X, Y, and Z

Target: XInputs: Y,Z

Target: YInputs: X,Z

Target: ZInputs: X,Y

Y=0 Y=1

p(x|y=0) p(x|y=1)

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0,z=0)

p(y|x=1)

Z=0 Z=1

p(y|x=0,z=1)

Summarize the trees with a single graphSummarize the trees with a single graph




Y=0 Y=1

p(x|y=0) p(x|y=1)

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0,z=0)

p(y|x=1)

Z=0 Z=1

p(y|x=0,z=1)

X Y Z

Dependency NetworkDependency Network

Build a classification/regression model for Build a classification/regression model for every variable given the other variables as every variable given the other variables as inputsinputs

Construct a graph whereConstruct a graph where– Nodes correspond to variablesNodes correspond to variables

– There is an arc from X to Y if X helps to predict YThere is an arc from X to Y if X helps to predict Y

The graph along with the individual The graph along with the individual classification/regression model is a classification/regression model is a “dependency network” “dependency network” (Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000)(Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000)

Example: TV viewing Example: TV viewing

Age Show1 Show2 Show3viewer 1 73 y n nviewer 2 16 n y y ...viewer 3 35 n n n

etc.

~400 shows, ~3000 viewers

Nielsen data: 2/6/95-2/19/95

Goal: exploratory data analysis (acausal)

A bit of historyA bit of history

Julian Besag (and others) invented Julian Besag (and others) invented dependency networks (under another name) dependency networks (under another name) in the mid 1970sin the mid 1970s

But they didn’t like them, because they could But they didn’t like them, because they could be inconsistentbe inconsistent

A consistent dependency networkA consistent dependency network




Y=0 Y=1

p(x|y=0) p(x|y=1)

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0,z=0)

p(y|x=1)

Z=0 Z=1

p(y|x=0,z=1)

X Y Z

An inconsistent dependency networkAn inconsistent dependency network




Y=0 Y=1

p(x|y=0) p(x|y=1)

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0,z=0)

p(y|x=1)

Z=0 Z=1

p(y|x=0,z=1)

X Y Z


Julian Besag (and others) invented Julian Besag (and others) invented dependency networks (under the name dependency networks (under the name “Markov graphs”) in the mid 1970s“Markov graphs”) in the mid 1970s


So they used a property of consistent So they used a property of consistent dependency networks to develop a new dependency networks to develop a new characterization of themcharacterization of them

Conditional independenceConditional independence




Y=0 Y=1

p(x|y=0) p(x|y=1)

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0,z=0)

p(y|x=1)

Z=0 Z=1

p(y|x=0,z=1)

X Y Z

X Z | Y

Conditional independence in a Conditional independence in a dependency networkdependency network

Each variable is independent of all other Each variable is independent of all other variables given its immediate neighborsvariables given its immediate neighbors

Hammersley-Clifford TheoremHammersley-Clifford Theorem(Besag 1974)(Besag 1974)

Given a set of variables which has a positive Given a set of variables which has a positive joint distributionjoint distribution

Where each variable is independent of all Where each variable is independent of all other variables given its immediate neighbors other variables given its immediate neighbors in some graph Gin some graph G

It follows thatIt follows that

where cwhere c11, c, c22, …, c, …, cnn are the maximal cliques in are the maximal cliques in the graph G.the graph G.

N

iiifp

1

)()( cx

“cliquepotentials”

ExampleExample

X Y Z

),(),(),,( 21 zyfyxfzyxp

Consistent dependency networks: Consistent dependency networks: Directed arcs not neededDirected arcs not needed

X Y Z

X Y Z

),(),(),,( 21 zyfyxfzyxp


Julian Besag (and others) invented Julian Besag (and others) invented dependency networks (under the name dependency networks (under the name “Markov graphs”) in the mid 1970s“Markov graphs”) in the mid 1970s


So they used a property of consistent So they used a property of consistent dependency networks to develop a new dependency networks to develop a new characterization of themcharacterization of them

““Markov Random Fields” aka “undirected Markov Random Fields” aka “undirected graphs” were borngraphs” were born

Inconsistent dependency networks Inconsistent dependency networks aren’t that badaren’t that bad

They are *almost consistent* because each They are *almost consistent* because each classification/regression model is learned classification/regression model is learned from the same data set (can be formalized)from the same data set (can be formalized)

They are easy to learn from data (build They are easy to learn from data (build separate classification/regression model for separate classification/regression model for each variable)each variable)

Conditional distributions (e.g., trees) are Conditional distributions (e.g., trees) are easier to understand than clique potentialseasier to understand than clique potentials

Inconsistent dependency networks Inconsistent dependency networks aren’t that badaren’t that bad

They are *almost consistent* because each They are *almost consistent* because each classification/regression model is learned classification/regression model is learned from the same data set (can be formalized)from the same data set (can be formalized)

They are easy to learn from data (build They are easy to learn from data (build separate classification/regression model for separate classification/regression model for each variable)each variable)

Conditional distributions (e.g., trees) are Conditional distributions (e.g., trees) are easier to understand than clique potentialseasier to understand than clique potentials

Over the last decade, has proven to be a very Over the last decade, has proven to be a very useful tool for data explorationuseful tool for data exploration

Shortcomings of undirected graphsShortcomings of undirected graphs

Lack a generative story (e.g., Lat Dir Alloc)Lack a generative story (e.g., Lat Dir Alloc) Lack a causal storyLack a causal story

cold lung cancer

coughsore throat weight loss

Solution: Build trees in some orderSolution: Build trees in some order

1. Target: XInputs: none

2. Target: YInputs: X

3. Target: ZInputs: X,Y

Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0) p(y|x=1)

p(x)

.

Solution: Build trees in some orderSolution: Build trees in some order




Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0) p(y|x=1)

p(x)

.

X Y Z

Some orders are better than othersSome orders are better than others

Random ordersRandom orders Greedy searchGreedy search Monte-Carlo methodsMonte-Carlo methods

X Y Z

X Z Y

Joint distribution is easy to obtainJoint distribution is easy to obtain




Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0) p(y|x=1)

p(x)

.

X Y Z

),,(),|()|()( zyxpyxzpxypxp

Directed Acyclic Graphs (aka Bayes Nets)Directed Acyclic Graphs (aka Bayes Nets)

Many inventors: Many inventors: Wright 1921; Good 1961; Howard & Wright 1921; Good 1961; Howard & Matheson 1976, Pearl 1982Matheson 1976, Pearl 1982

))(parents|(

),...,|(),...,( 111

iii

iiin

xxp

xxxpxxp

The power of graphical modelsThe power of graphical models

Easy to understandEasy to understand Useful for adding prior knowledge to an Useful for adding prior knowledge to an

analysis (e.g., causal knowledge)analysis (e.g., causal knowledge) The conditional independencies they express The conditional independencies they express

make make inferenceinference more computationally more computationally efficientefficient

InferenceInference




Y=0 Y=1

p(z|y=0) p(z|y=1)

X=0 X=1

p(y|x=0) p(y|x=1)

p(x)

.

X Y Z

What is p(z|x=1)?

Inference: ExampleInference: Example

yxw

yzpxypwxpwpzp,,

)|()|()|()()(

X Y ZW

Inference: ExampleInference: Example(“Elimination Algorithm”)(“Elimination Algorithm”)

xw y

yxw

yzpxypwxpwp

yzpxypwxpwpzp

,

,,

)|()|()|()(

)|()|()|()()(

X Y ZW

Inference: ExampleInference: Example(“Elimination Algorithm”)(“Elimination Algorithm”)

w x y

xw y

yxw

yzpxypwxpwp

yzpxypwxpwp

yzpxypwxpwpzp

)|()|()|()(

)|()|()|()(

)|()|()|()()(

,

,,

X Y ZW

InferenceInference

Inference also important because it is the E Inference also important because it is the E step of EM algorithm (when learning with step of EM algorithm (when learning with missing data and/or hidden variables)missing data and/or hidden variables)

Exact methods for inference that exploit Exact methods for inference that exploit conditional independence are well developed conditional independence are well developed (e.g., Shachter, Lauritzen & Spiegelhalter, Dechter)(e.g., Shachter, Lauritzen & Spiegelhalter, Dechter)

Exact methods fail when there are many Exact methods fail when there are many cycles in the graphcycles in the graph

– MCMC (e.g., Geman and Geman 1984)MCMC (e.g., Geman and Geman 1984)– Loopy propagation (e.g., Murphy et al. 1999)Loopy propagation (e.g., Murphy et al. 1999)– Variational methods (e.g., Jordan et al. 1999)Variational methods (e.g., Jordan et al. 1999)

Applications of Graphical ModelsApplications of Graphical Models

DAGs and UGs:DAGs and UGs: Data explorationData exploration Density estimationDensity estimation ClusteringClustering

UGs:UGs: Spatial processesSpatial processes

DAGs:DAGs: Expert systemsExpert systems Causal discoveryCausal discovery

ApplicationsApplications

ClusteringClustering Evolutionary history/phylogenyEvolutionary history/phylogeny

ClusteringClustering

User Sequence1 frontpage news travel travel2 news news news news news3 frontpage news frontpage news frontpage4 news news5 frontpage news news travel travel travel6 news weather weather weather weather weather7 news health health business business business8 frontpage sports sports sports weatherEtc.

Millions of users per day

Example: msnbc.com

Goal: understand what is and isn’t working on the site

SolutionSolution

data

User clusters

Cluster

• Cluster users based on their behavior on the site• Display clusters somehow

Generative model for clusteringGenerative model for clustering(e.g., AutoClass, Cheeseman & Stutz 1995)(e.g., AutoClass, Cheeseman & Stutz 1995)

Cluster

1stpage

2ndpage

3rdpage

Discrete, hidden

…

Sequence ClusteringSequence Clustering (Cadez, Heckerman, Meek, & Smyth, 2000)(Cadez, Heckerman, Meek, & Smyth, 2000)

Cluster

1stpage

2ndpage

3rdpage

Discrete, hidden

…

Learning parameters (with missing data)Learning parameters (with missing data)

Principles:Principles: Find the parameters that maximize the (log) Find the parameters that maximize the (log)

likelihood of the datalikelihood of the data Find the parameters whose posterior Find the parameters whose posterior

probability is a maximumprobability is a maximum Find distributions for quantities of interest by Find distributions for quantities of interest by

averaging over the unknown parametersaveraging over the unknown parameters

Gradient methods or EM algorithm typically Gradient methods or EM algorithm typically used for first twoused for first two

Expectation-Maximization (EM) algorithmExpectation-Maximization (EM) algorithmDempster, Laird, Rubin 1977Dempster, Laird, Rubin 1977

Initialize parameters (e.g., at random)Initialize parameters (e.g., at random)

Expectation step:Expectation step: compute probabilities for values of unobserved compute probabilities for values of unobserved

variable using the current values of the parameters and the variable using the current values of the parameters and the

incomplete data [THIS IS INFERENCE]; reinterpret data as set of incomplete data [THIS IS INFERENCE]; reinterpret data as set of

fractional cases based on these probabilitiesfractional cases based on these probabilities

Maximization step:Maximization step: choose parameters so as to maximize the log choose parameters so as to maximize the log

likelihood of the fractional datalikelihood of the fractional data

Parameters will converge to a local maximum of log p(data)

E-stepE-step

Suppose cluster model has 2 clusters, and thatSuppose cluster model has 2 clusters, and that

p(cluster=1|case,current params) = 0.7p(cluster=1|case,current params) = 0.7

p(cluster=2|case,current params) = 0.3p(cluster=2|case,current params) = 0.3

Then, writeThen, write

q(case) = 0.7 log p(case,cluster=1|params) +q(case) = 0.7 log p(case,cluster=1|params) +

0.3 log p(case,cluster=2|params)0.3 log p(case,cluster=2|params)

Do this for each case and then find the parameters that Do this for each case and then find the parameters that maximize Q=maximize Q=casecase q(case). These parameters also q(case). These parameters also maximize the log likelihood.maximize the log likelihood.

Demo: SQL Server 2005Demo: SQL Server 2005

User Sequence1 frontpage news travel travel2 news news news news news3 frontpage news frontpage news frontpage4 news news5 frontpage news news travel travel travel6 news weather weather weather weather weather7 news health health business business business8 frontpage sports sports sports weatherEtc.

Example: msnbc.com

Sequence clusteringSequence clustering

Other applications at Microsoft:Other applications at Microsoft: Analyze how people use programs (e.g. Office)Analyze how people use programs (e.g. Office) Analyze web traffic for intruders (anomaly Analyze web traffic for intruders (anomaly

detection)detection)

Computational biology applicationsComputational biology applications

Evolutionary history/phylogenyEvolutionary history/phylogeny Vaccine for AIDSVaccine for AIDS

DonkeyHorseIndian rhinoWhite rhinoGrey sealHarbor sealDogCatBlue whaleFin whaleSperm whaleHippopotamusSheepCowAlpacaPigLittle red flying foxRyukyu flying foxHorseshoe batJapanese pipistrelleLong-tailed batJamaican fruit-eating bat

Asiatic shrewLong-clawed shrew

MoleSmall Madagascar hedgehogAardvarkElephantArmadilloRabbitPikaTree shrewBonoboChimpanzeeManGorillaSumatran orangutanBornean orangutanCommon gibbonBarbary apeBaboon

White-fronted capuchinSlow lorisSquirrelDormouseCane-ratGuinea pigMouseRatVoleHedgehogGymnureBandicootWallarooOpossumPlatypus

Perissodactyla

Carnivora

Cetartiodactyla

Rodentia 1

HedgehogsRodentia 2

Primates

ChiropteraMoles+ShrewsAfrotheria

XenarthraLagomorpha

+ Scandentia

Evolutionary History/PhylogenyEvolutionary History/PhylogenyJojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman 2004Jojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman 2004

http://globin.cse.psu.edu/html/huisman/variants/image.html




http://www.sciencemag.org/content/vol287/issue5461/cover.shtml

Probabilistic Model of EvolutionProbabilistic Model of Evolution

species1

species2

species3

…

hidden

hidden

Learning phylogeny from dataLearning phylogeny from data

For a given tree, find max likelihood For a given tree, find max likelihood parametersparameters

Search over structure to find best likelihood Search over structure to find best likelihood (penalized to avoid over fitting)(penalized to avoid over fitting)

01h

11x

21h

31x

41x

02h

12x

22h

32x

42x

0Jh

1Jx

2Jh

3Jx

4Jx

…

Strong simplifying assumptionStrong simplifying assumption

Evolution at each DNA nucleotide is independent EM is computationally efficient

Nucleotideposition 1

Nucleotideposition 2

Nucleotideposition N

Relaxing the assumptionRelaxing the assumption

Each substitution depends on the substitution Each substitution depends on the substitution at the previous positionat the previous position

This structure captures context specific This structure captures context specific effects during evolutioneffects during evolution

EM is computationally intractableEM is computationally intractable

01h

11x

21h

31x

41x

02h

12x

22h

32x

42x

0Jh

1Jx

2Jh

3Jx

4Jx

…0Jh

1Jx

2Jh

3Jx

4Jx

…01h

11x

21h

31x

41x

02h

12x

22h

32x

42x

Variational approximation for inferenceVariational approximation for inference

h

h

h

ohq

ohpohq

ohq

ohpohq

ohpop

),|(

),,(ln),|(

),|(

),,(),|(ln

)|,(ln)|(ln

Lower bound good enough for EM-like algorithm

Product of trees

Product of chains

h

o h

o o

…h

o h

o o

h

o h

o o

h

o h

o

…h

o h

o o

h

o h

o o

Two simple q distributionsTwo simple q distributions

Things I didn’t have time to talk aboutThings I didn’t have time to talk about

Factor graphs, mixed graphs, etc.Factor graphs, mixed graphs, etc. Relational learning: PRMs, Plates, PERsRelational learning: PRMs, Plates, PERs Bayesian methods for learningBayesian methods for learning ScalabilityScalability Causal modelingCausal modeling Variational methodsVariational methods Non-parametric distributionsNon-parametric distributions

To learn moreTo learn more

Main conferences:Main conferences: Uncertainty in Artificial Intelligence (UAI)Uncertainty in Artificial Intelligence (UAI) Neural information Processing Systems (NIPS)Neural information Processing Systems (NIPS)

Documents

David Heckerman Microsoft Research Graphical Models