Upload
trevor-garrett
View
226
Download
0
Tags:
Embed Size (px)
Citation preview
David HeckermanMicrosoft Research
Graphical ModelsGraphical Models
OverviewOverview
Intro to graphical modelsIntro to graphical models– Application: Data explorationApplication: Data exploration
– Dependency networks Dependency networks undirected graphs undirected graphs
– Directed acyclic graphs (“Bayes nets”)Directed acyclic graphs (“Bayes nets”)
ApplicationsApplications– ClusteringClustering
– Evolutionary history/phylogenyEvolutionary history/phylogeny
Decision tree:
Logistic regression:log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old
Neural network:
Using classification/regression for Using classification/regression for data explorationdata exploration
male female
young old
p(cust)=0.2 p(cust)=0.7
p(cust)=0.8
Decision tree:
Logistic regression:log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old
Neural network:
Using classification/regression for Using classification/regression for data explorationdata exploration
male female
young old
p(cust)=0.2 p(cust)=0.7
p(cust)=0.8
p(target|inputs)
Decision tree:
Conditional independenceConditional independence
male female
young old
p(cust)=0.2 p(cust)=0.7
p(cust)=0.8
p(cust | gender, age, month born)=p(cust | gender, age)
p(target | all inputs) = p(target | some inputs)
Learning conditional independence Learning conditional independence from data: Model Selectionfrom data: Model Selection
Cross validationCross validation Bayesian methodsBayesian methods Penalized likelihoodPenalized likelihood Minimum description lengthMinimum description length
Using classification/regression for Using classification/regression for data explorationdata exploration
Suppose you have thousands of variables Suppose you have thousands of variables and you’re not sure about the interactions and you’re not sure about the interactions among those variablesamong those variables
Build a classification/regression model for Build a classification/regression model for each variable, using the rest of the variables each variable, using the rest of the variables as inputsas inputs
Example with three variables X, Y, and ZExample with three variables X, Y, and Z
Target: XInputs: Y,Z
Target: YInputs: X,Z
Target: ZInputs: X,Y
Y=0 Y=1
p(x|y=0) p(x|y=1)
Y=0 Y=1
p(z|y=0) p(z|y=1)
X=0 X=1
p(y|x=0,z=0)
p(y|x=1)
Z=0 Z=1
p(y|x=0,z=1)
Summarize the trees with a single graphSummarize the trees with a single graph
Target: XInputs: Y,Z
Target: YInputs: X,Z
Target: ZInputs: X,Y
Y=0 Y=1
p(x|y=0) p(x|y=1)
Y=0 Y=1
p(z|y=0) p(z|y=1)
X=0 X=1
p(y|x=0,z=0)
p(y|x=1)
Z=0 Z=1
p(y|x=0,z=1)
X Y Z
Dependency NetworkDependency Network
Build a classification/regression model for Build a classification/regression model for every variable given the other variables as every variable given the other variables as inputsinputs
Construct a graph whereConstruct a graph where– Nodes correspond to variablesNodes correspond to variables
– There is an arc from X to Y if X helps to predict YThere is an arc from X to Y if X helps to predict Y
The graph along with the individual The graph along with the individual classification/regression model is a classification/regression model is a “dependency network” “dependency network” (Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000)(Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000)
Example: TV viewing Example: TV viewing
Age Show1 Show2 Show3viewer 1 73 y n nviewer 2 16 n y y ...viewer 3 35 n n n
etc.
~400 shows, ~3000 viewers
Nielsen data: 2/6/95-2/19/95
Goal: exploratory data analysis (acausal)
A bit of historyA bit of history
Julian Besag (and others) invented Julian Besag (and others) invented dependency networks (under another name) dependency networks (under another name) in the mid 1970sin the mid 1970s
But they didn’t like them, because they could But they didn’t like them, because they could be inconsistentbe inconsistent
A consistent dependency networkA consistent dependency network
Target: XInputs: Y,Z
Target: YInputs: X,Z
Target: ZInputs: X,Y
Y=0 Y=1
p(x|y=0) p(x|y=1)
Y=0 Y=1
p(z|y=0) p(z|y=1)
X=0 X=1
p(y|x=0,z=0)
p(y|x=1)
Z=0 Z=1
p(y|x=0,z=1)
X Y Z
An inconsistent dependency networkAn inconsistent dependency network
Target: XInputs: Y,Z
Target: YInputs: X,Z
Target: ZInputs: X,Y
Y=0 Y=1
p(x|y=0) p(x|y=1)
Y=0 Y=1
p(z|y=0) p(z|y=1)
X=0 X=1
p(y|x=0,z=0)
p(y|x=1)
Z=0 Z=1
p(y|x=0,z=1)
X Y Z
A bit of historyA bit of history
Julian Besag (and others) invented Julian Besag (and others) invented dependency networks (under the name dependency networks (under the name “Markov graphs”) in the mid 1970s“Markov graphs”) in the mid 1970s
But they didn’t like them, because they could But they didn’t like them, because they could be inconsistentbe inconsistent
So they used a property of consistent So they used a property of consistent dependency networks to develop a new dependency networks to develop a new characterization of themcharacterization of them
Conditional independenceConditional independence
Target: XInputs: Y,Z
Target: YInputs: X,Z
Target: ZInputs: X,Y
Y=0 Y=1
p(x|y=0) p(x|y=1)
Y=0 Y=1
p(z|y=0) p(z|y=1)
X=0 X=1
p(y|x=0,z=0)
p(y|x=1)
Z=0 Z=1
p(y|x=0,z=1)
X Y Z
X Z | Y
Conditional independence in a Conditional independence in a dependency networkdependency network
Each variable is independent of all other Each variable is independent of all other variables given its immediate neighborsvariables given its immediate neighbors
Hammersley-Clifford TheoremHammersley-Clifford Theorem(Besag 1974)(Besag 1974)
Given a set of variables which has a positive Given a set of variables which has a positive joint distributionjoint distribution
Where each variable is independent of all Where each variable is independent of all other variables given its immediate neighbors other variables given its immediate neighbors in some graph Gin some graph G
It follows thatIt follows that
where cwhere c11, c, c22, …, c, …, cnn are the maximal cliques in are the maximal cliques in the graph G.the graph G.
N
iiifp
1
)()( cx
“cliquepotentials”
ExampleExample
X Y Z
),(),(),,( 21 zyfyxfzyxp
Consistent dependency networks: Consistent dependency networks: Directed arcs not neededDirected arcs not needed
X Y Z
X Y Z
),(),(),,( 21 zyfyxfzyxp
A bit of historyA bit of history
Julian Besag (and others) invented Julian Besag (and others) invented dependency networks (under the name dependency networks (under the name “Markov graphs”) in the mid 1970s“Markov graphs”) in the mid 1970s
But they didn’t like them, because they could But they didn’t like them, because they could be inconsistentbe inconsistent
So they used a property of consistent So they used a property of consistent dependency networks to develop a new dependency networks to develop a new characterization of themcharacterization of them
““Markov Random Fields” aka “undirected Markov Random Fields” aka “undirected graphs” were borngraphs” were born
Inconsistent dependency networks Inconsistent dependency networks aren’t that badaren’t that bad
They are *almost consistent* because each They are *almost consistent* because each classification/regression model is learned classification/regression model is learned from the same data set (can be formalized)from the same data set (can be formalized)
They are easy to learn from data (build They are easy to learn from data (build separate classification/regression model for separate classification/regression model for each variable)each variable)
Conditional distributions (e.g., trees) are Conditional distributions (e.g., trees) are easier to understand than clique potentialseasier to understand than clique potentials
Inconsistent dependency networks Inconsistent dependency networks aren’t that badaren’t that bad
They are *almost consistent* because each They are *almost consistent* because each classification/regression model is learned classification/regression model is learned from the same data set (can be formalized)from the same data set (can be formalized)
They are easy to learn from data (build They are easy to learn from data (build separate classification/regression model for separate classification/regression model for each variable)each variable)
Conditional distributions (e.g., trees) are Conditional distributions (e.g., trees) are easier to understand than clique potentialseasier to understand than clique potentials
Over the last decade, has proven to be a very Over the last decade, has proven to be a very useful tool for data explorationuseful tool for data exploration
Shortcomings of undirected graphsShortcomings of undirected graphs
Lack a generative story (e.g., Lat Dir Alloc)Lack a generative story (e.g., Lat Dir Alloc) Lack a causal storyLack a causal story
cold lung cancer
coughsore throat weight loss
Solution: Build trees in some orderSolution: Build trees in some order
1. Target: XInputs: none
2. Target: YInputs: X
3. Target: ZInputs: X,Y
Y=0 Y=1
p(z|y=0) p(z|y=1)
X=0 X=1
p(y|x=0) p(y|x=1)
p(x)
.
Solution: Build trees in some orderSolution: Build trees in some order
1. Target: XInputs: none
2. Target: YInputs: X
3. Target: ZInputs: X,Y
Y=0 Y=1
p(z|y=0) p(z|y=1)
X=0 X=1
p(y|x=0) p(y|x=1)
p(x)
.
X Y Z
Some orders are better than othersSome orders are better than others
Random ordersRandom orders Greedy searchGreedy search Monte-Carlo methodsMonte-Carlo methods
X Y Z
X Z Y
Joint distribution is easy to obtainJoint distribution is easy to obtain
1. Target: XInputs: none
2. Target: YInputs: X
3. Target: ZInputs: X,Y
Y=0 Y=1
p(z|y=0) p(z|y=1)
X=0 X=1
p(y|x=0) p(y|x=1)
p(x)
.
X Y Z
),,(),|()|()( zyxpyxzpxypxp
Directed Acyclic Graphs (aka Bayes Nets)Directed Acyclic Graphs (aka Bayes Nets)
Many inventors: Many inventors: Wright 1921; Good 1961; Howard & Wright 1921; Good 1961; Howard & Matheson 1976, Pearl 1982Matheson 1976, Pearl 1982
))(parents|(
),...,|(),...,( 111
iii
iiin
xxp
xxxpxxp
The power of graphical modelsThe power of graphical models
Easy to understandEasy to understand Useful for adding prior knowledge to an Useful for adding prior knowledge to an
analysis (e.g., causal knowledge)analysis (e.g., causal knowledge) The conditional independencies they express The conditional independencies they express
make make inferenceinference more computationally more computationally efficientefficient
InferenceInference
1. Target: XInputs: none
2. Target: YInputs: X
3. Target: ZInputs: X,Y
Y=0 Y=1
p(z|y=0) p(z|y=1)
X=0 X=1
p(y|x=0) p(y|x=1)
p(x)
.
X Y Z
What is p(z|x=1)?
Inference: ExampleInference: Example
yxw
yzpxypwxpwpzp,,
)|()|()|()()(
X Y ZW
Inference: ExampleInference: Example(“Elimination Algorithm”)(“Elimination Algorithm”)
xw y
yxw
yzpxypwxpwp
yzpxypwxpwpzp
,
,,
)|()|()|()(
)|()|()|()()(
X Y ZW
Inference: ExampleInference: Example(“Elimination Algorithm”)(“Elimination Algorithm”)
w x y
xw y
yxw
yzpxypwxpwp
yzpxypwxpwp
yzpxypwxpwpzp
)|()|()|()(
)|()|()|()(
)|()|()|()()(
,
,,
X Y ZW
InferenceInference
Inference also important because it is the E Inference also important because it is the E step of EM algorithm (when learning with step of EM algorithm (when learning with missing data and/or hidden variables)missing data and/or hidden variables)
Exact methods for inference that exploit Exact methods for inference that exploit conditional independence are well developed conditional independence are well developed (e.g., Shachter, Lauritzen & Spiegelhalter, Dechter)(e.g., Shachter, Lauritzen & Spiegelhalter, Dechter)
Exact methods fail when there are many Exact methods fail when there are many cycles in the graphcycles in the graph
– MCMC (e.g., Geman and Geman 1984)MCMC (e.g., Geman and Geman 1984)– Loopy propagation (e.g., Murphy et al. 1999)Loopy propagation (e.g., Murphy et al. 1999)– Variational methods (e.g., Jordan et al. 1999)Variational methods (e.g., Jordan et al. 1999)
Applications of Graphical ModelsApplications of Graphical Models
DAGs and UGs:DAGs and UGs: Data explorationData exploration Density estimationDensity estimation ClusteringClustering
UGs:UGs: Spatial processesSpatial processes
DAGs:DAGs: Expert systemsExpert systems Causal discoveryCausal discovery
ApplicationsApplications
ClusteringClustering Evolutionary history/phylogenyEvolutionary history/phylogeny
ClusteringClustering
User Sequence1 frontpage news travel travel2 news news news news news3 frontpage news frontpage news frontpage4 news news5 frontpage news news travel travel travel6 news weather weather weather weather weather7 news health health business business business8 frontpage sports sports sports weatherEtc.
Millions of users per day
Example: msnbc.com
Goal: understand what is and isn’t working on the site
SolutionSolution
data
User clusters
Cluster
• Cluster users based on their behavior on the site• Display clusters somehow
Generative model for clusteringGenerative model for clustering(e.g., AutoClass, Cheeseman & Stutz 1995)(e.g., AutoClass, Cheeseman & Stutz 1995)
Cluster
1stpage
2ndpage
3rdpage
Discrete, hidden
…
Sequence ClusteringSequence Clustering (Cadez, Heckerman, Meek, & Smyth, 2000)(Cadez, Heckerman, Meek, & Smyth, 2000)
Cluster
1stpage
2ndpage
3rdpage
Discrete, hidden
…
Learning parameters (with missing data)Learning parameters (with missing data)
Principles:Principles: Find the parameters that maximize the (log) Find the parameters that maximize the (log)
likelihood of the datalikelihood of the data Find the parameters whose posterior Find the parameters whose posterior
probability is a maximumprobability is a maximum Find distributions for quantities of interest by Find distributions for quantities of interest by
averaging over the unknown parametersaveraging over the unknown parameters
Gradient methods or EM algorithm typically Gradient methods or EM algorithm typically used for first twoused for first two
Expectation-Maximization (EM) algorithmExpectation-Maximization (EM) algorithmDempster, Laird, Rubin 1977Dempster, Laird, Rubin 1977
Initialize parameters (e.g., at random)Initialize parameters (e.g., at random)
Expectation step:Expectation step: compute probabilities for values of unobserved compute probabilities for values of unobserved
variable using the current values of the parameters and the variable using the current values of the parameters and the
incomplete data [THIS IS INFERENCE]; reinterpret data as set of incomplete data [THIS IS INFERENCE]; reinterpret data as set of
fractional cases based on these probabilitiesfractional cases based on these probabilities
Maximization step:Maximization step: choose parameters so as to maximize the log choose parameters so as to maximize the log
likelihood of the fractional datalikelihood of the fractional data
Parameters will converge to a local maximum of log p(data)
E-stepE-step
Suppose cluster model has 2 clusters, and thatSuppose cluster model has 2 clusters, and that
p(cluster=1|case,current params) = 0.7p(cluster=1|case,current params) = 0.7
p(cluster=2|case,current params) = 0.3p(cluster=2|case,current params) = 0.3
Then, writeThen, write
q(case) = 0.7 log p(case,cluster=1|params) +q(case) = 0.7 log p(case,cluster=1|params) +
0.3 log p(case,cluster=2|params)0.3 log p(case,cluster=2|params)
Do this for each case and then find the parameters that Do this for each case and then find the parameters that maximize Q=maximize Q=casecase q(case). These parameters also q(case). These parameters also maximize the log likelihood.maximize the log likelihood.
Demo: SQL Server 2005Demo: SQL Server 2005
User Sequence1 frontpage news travel travel2 news news news news news3 frontpage news frontpage news frontpage4 news news5 frontpage news news travel travel travel6 news weather weather weather weather weather7 news health health business business business8 frontpage sports sports sports weatherEtc.
Example: msnbc.com
Sequence clusteringSequence clustering
Other applications at Microsoft:Other applications at Microsoft: Analyze how people use programs (e.g. Office)Analyze how people use programs (e.g. Office) Analyze web traffic for intruders (anomaly Analyze web traffic for intruders (anomaly
detection)detection)
Computational biology applicationsComputational biology applications
Evolutionary history/phylogenyEvolutionary history/phylogeny Vaccine for AIDSVaccine for AIDS
DonkeyHorseIndian rhinoWhite rhinoGrey sealHarbor sealDogCatBlue whaleFin whaleSperm whaleHippopotamusSheepCowAlpacaPigLittle red flying foxRyukyu flying foxHorseshoe batJapanese pipistrelleLong-tailed batJamaican fruit-eating bat
Asiatic shrewLong-clawed shrew
MoleSmall Madagascar hedgehogAardvarkElephantArmadilloRabbitPikaTree shrewBonoboChimpanzeeManGorillaSumatran orangutanBornean orangutanCommon gibbonBarbary apeBaboon
White-fronted capuchinSlow lorisSquirrelDormouseCane-ratGuinea pigMouseRatVoleHedgehogGymnureBandicootWallarooOpossumPlatypus
Perissodactyla
Carnivora
Cetartiodactyla
Rodentia 1
HedgehogsRodentia 2
Primates
ChiropteraMoles+ShrewsAfrotheria
XenarthraLagomorpha
+ Scandentia
Evolutionary History/PhylogenyEvolutionary History/PhylogenyJojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman 2004Jojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman 2004
Probabilistic Model of EvolutionProbabilistic Model of Evolution
species1
species2
species3
…
hidden
hidden
Learning phylogeny from dataLearning phylogeny from data
For a given tree, find max likelihood For a given tree, find max likelihood parametersparameters
Search over structure to find best likelihood Search over structure to find best likelihood (penalized to avoid over fitting)(penalized to avoid over fitting)
01h
11x
21h
31x
41x
02h
12x
22h
32x
42x
0Jh
1Jx
2Jh
3Jx
4Jx
…
Strong simplifying assumptionStrong simplifying assumption
Evolution at each DNA nucleotide is independent EM is computationally efficient
Nucleotideposition 1
Nucleotideposition 2
Nucleotideposition N
Relaxing the assumptionRelaxing the assumption
Each substitution depends on the substitution Each substitution depends on the substitution at the previous positionat the previous position
This structure captures context specific This structure captures context specific effects during evolutioneffects during evolution
EM is computationally intractableEM is computationally intractable
01h
11x
21h
31x
41x
02h
12x
22h
32x
42x
0Jh
1Jx
2Jh
3Jx
4Jx
…0Jh
1Jx
2Jh
3Jx
4Jx
…01h
11x
21h
31x
41x
02h
12x
22h
32x
42x
Variational approximation for inferenceVariational approximation for inference
h
h
h
ohq
ohpohq
ohq
ohpohq
ohpop
),|(
),,(ln),|(
),|(
),,(),|(ln
)|,(ln)|(ln
Lower bound good enough for EM-like algorithm
Product of trees
Product of chains
h
o h
o o
…h
o h
o o
h
o h
o o
h
o h
o
…h
o h
o o
h
o h
o o
Two simple q distributionsTwo simple q distributions
Things I didn’t have time to talk aboutThings I didn’t have time to talk about
Factor graphs, mixed graphs, etc.Factor graphs, mixed graphs, etc. Relational learning: PRMs, Plates, PERsRelational learning: PRMs, Plates, PERs Bayesian methods for learningBayesian methods for learning ScalabilityScalability Causal modelingCausal modeling Variational methodsVariational methods Non-parametric distributionsNon-parametric distributions
To learn moreTo learn more
Main conferences:Main conferences: Uncertainty in Artificial Intelligence (UAI)Uncertainty in Artificial Intelligence (UAI) Neural information Processing Systems (NIPS)Neural information Processing Systems (NIPS)