Upload
yanka
View
36
Download
0
Embed Size (px)
DESCRIPTION
Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm. Nir Friedman Dana Pe'er Iftach Nachman Institute of Computer Science Hebrew University Jerusalem. Data. Inducer. B. E. R. A. C. Learning Bayesian Network Structure (Complete Data). - PowerPoint PPT Presentation
Citation preview
Learning Bayesian Network Structure from Massive Datasets:
The ``Sparse Candidate'' Algorithm
Nir Friedman Dana Pe'er Iftach NachmanInstitute of Computer Science
Hebrew University
Jerusalem
.
Learning Bayesian Network Structure(Complete Data)
Set a scoring function that evaluates networks Find the highest scoring network
This optimization problem is NP-hard [Chickering]
use heuristic search
InducInducerer
InducInducerer
E
R
B
A
C
Data
.
Our Contribution
We suggest a new heuristic Builds on simple ideas Easy to implement Can be combined with existing heuristic search
procedures Reduces learning time significantly
Also gain some insight on the complexity of learning problem
.
Learning Bayesian Network Structure: Score
Various variants of scores We focus here on Bayesian score
[Cooper & Hershkovitz; Heckerman, Geiger & Chickering]Key property for search:
The score decomposes:
where N (Xi,PaiG) is a vector of counts of joint
values of Xi and its parents in G the data
))(:|():( Gii
i
Gi PaXNPaXScoreDGScore
i
.
Search over network structuresStandard operations: add, delete, reverseNeed to check acyclicty
A B
C
A B
C
A B
C
Add A B
Reverse B CRemove B C
A B
C
Heuristic Search in Learning Networks
Use standard search method in this space:greedy hill climbing, simulated annealing, etc.
.
Computational Problem
Cost of evaluating a single move Collecting counts N (xi,pai) is O(M) (M = # of examples) Using caching we can save some of these computations
Number of possible moves Number of possible moves is O(N2) (N = number of vars.) After performing a move, O(N) new moves to be evaluated
TotalEach iteration of greedy HC costs O(MN)
Most of the time spent on evaluating irrelevant moves
.
Idea #1: Restrict to Few Candidates
For each X, select a small set of candidates C(X) Consider arcs Y X only if Y is in C(X)
A B
C
C(A) = { B }
C(B) = {A}
C(C) = {A, B}
If we restrict to k candidate for each variable, then only O(kN) possible moves for each network in greedy HC, only O(k) new moves to evaluate in each
iteration Cost of each iteration is O(M k)
BA CA XA->C C->B X
.
How to Select Candidates?
Simple proposal: Rank candidates by mutual information to X
This measures how many bits, we can save in encoding of X if we take Y into account
Select top k ranking variables for C(X)
)|()()()(
),(log);( YXHXH
YPXPYXP
EYXI
.
Effect of Candidate Number on Search
-56.5
-56
-55.5
-55
-54.5
-54
-53.5
-53
-52.5
-52
Time (sec)0 200 400 600 800 1000 1200
Sco
re (
BD
e/M
)
HCHC k=5HC k=10HC k=15C+L
HC
k=5k=10k=15
C+L
Empty
Text domain with 100 vars, 10,000 instances
Computation of all pairwise statistics
.
HRBP HREKG HRSAT
ERRCAUTERHR
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG
ANAPHYLAXIS
MINVOL
PVSAT
FIO2
INSUFFANESTH
TPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUME
HYPOVOLEMIA
BP
Problems with Candidate Selection
Fragment of “alarm” network
CO
INTUBATION
PULMEMBOLUS
PAP SHUNT
.
Idea #2: Iteratively Improve Candidates
Once we have partial understanding of the domain, we might use it select new candidates:
“current” parents + most promising candidates given the current
structure
If INTUBATION isparent of SHUNT, thenMINVOL is less informativeabout SHUNT
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG
MINVOL
PVSAT
FIO2
INTUBATION
PULMEMBOLUS
PAP SHUNT
.
Comparing Potential Candidates
Intuition: X should be Markov shielded by its parents PaX
Shielding: use conditional information Does adding Y to X’s parents improves prediction? I(X;Y|PaX) = 0 iff X is independent from Y given PaX
Score: use difference in score
Use Score(X|Y) as an estimate of -H(X|Y) in generating distribution
),|()|()|;( XXX PaYXHPaXHPaYXI
)|(),|()|;( XXX PaXScorePaYXScorePaYXS
.
“Alarm” example revisited
SAO2
HRBP HREKG HRSAT
ERRCAUTERHR
CATECHOL
EXPCO2
ARTCO2
VENTALV
VENTLUNG
ANAPHYLAXIS
MINVOL
PVSAT
FIO2
INSUFFANESTH
TPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUME
HYPOVOLEMIA
BP CO
INTUBATION
PULMEMBOLUS
PAP SHUNT
.
“Alarm” example revisited
SAO2
HRBP HREKG HRSAT
ERRCAUTERHR
CATECHOL
EXPCO2
ARTCO2
VENTALV
VENTLUNG
ANAPHYLAXIS
MINVOL
PVSAT
FIO2
INSUFFANESTH
TPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUME
HYPOVOLEMIA
BP CO
INTUBATION
PULMEMBOLUS
PAP SHUNT
.
Alternative Criterion: Discrepancy
Idea: Measure how well the network models the joint P(X,Y) We can improve this prediction by making X a candidate
parent of YNatural definition:
Note, if PB(X,Y) = P(X)P(Y), then d(X,Y|B) = I(X;Y)
),();,(),(),(
log)|;( YXPYXPKLYXPYXP
EBYXd BB
.
Text with 100 words
-54
-53.5
-53
-52.5
-52
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Greedy HCDisc k=15
Score k=15Shld k=15
Time (sec)
Sco
re (
BD
e/M
)
.
-83.4
-83.2
-83
-82.8
-82.6
-82.4
-82.2
-82
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Greedy HCDisc k=15
Score k=15Shld k=15
Text with 200 words
Time (sec)
Sco
re (
BD
e/L
)
.
Cell Cycle (800 vars)
Time (sec)
Sco
re (
BD
e/L
)
Greedy HCDisc k=20
Score k=20Shld k=20
-500
-490
-480
-470
-460
-450
-440
-430
-420
-410
0 5,000 10,000 15,000 20,000
-418
-417
-416
-415
-414
4000 6000 8000
.
Complexity of Structure Learning
Without restriction of the candidate sets:Restricting |Pai| 1
Problem is easy [Chow+Liu; Heckerman+al]No restriction
Problem is NP-Hard [Chickering]
Even when restricting |Pai| 2 We do not know of interesting intermediate
problemsSuch behavior is often called the “exponential cliff”
.
Complexity with Small Candidate Sets
In each iteration, we solve an optimization problem:Given candidate sets C(X1),…, C(XN), find best scoring network that respects these candidates
Is this problem easier than unconstrained structure learning?
.
Complexity with Small Candidate Sets
Theorem: If |C(Xi) | > 1 finding best-scoring structure is NP-Hard
But… The complexity function is gradually growingThere is a parameter c, s.t. time complexity is
Exponential in c Linear in N
Fix d. There is polynomial procedure that can solve all instances with c < d Similar situation in inference: exponential in the size of largest clique in triangulated graph, linear in N
.
Complexity Proof Outline
In fact, the algorithm is motivated by inference
Define the “candidate graph” where Y X if Y C(X)
Then, create a clique tree (moralize & triangulate)
We then define a dynamic programming algorithm for constructing the best scoring structure
Messages assign values to different ordering of variables in a separator
Ordering ensures acyclicity of the network
Order Score
AB -18.5
BA -13.2
Order Score
EB -4.7
BE -12.1
A,B,E,F
A,B,C,D
A,B
B,E,G
B,E
.
Future Work
Quadratic cost of candidate selection Initial step requires O(N2) pairwise statistics Can we select candidates by looking at smaller number,
e.g., O(N log N), of pairwise statistics
Choice of number of candidates We used a fixed number of candidates Can we decide on candidate number more intelligently? Deal with variables that have large in+out degree
Combine candidates with PDAG search
.
SummaryHeuristic for structure search
Incorporates understanding of BNs into blind search Drastically reduces the size of the search space
faster search that requires fewer statisticsEmpirical Evaluation
We present evaluation on several datasets Variants of the algorithm used in
[Boyen,Friedman&Koller] for temporal models with SEM[Friedman,Getoor,Koller&Pfeffer] for relational models
Complexity Analysis Computational subproblem where structure search might
be tractable even beyond trees