Upload
beatrix-kelly
View
215
Download
1
Embed Size (px)
Citation preview
Information Bottleneck EM
School of Engineering & Computer Science
The Hebrew University, Jerusalem, Israel
Gal Elidan and Nir Friedman
Problem: No closed-form solution for ML estimation
Use Expectation Maximization (EM)
Problem: Stuck in inferior local Maxima Random Restarts Deterministic Simulated annealing
Learning with Hidden VariablesInput: Output: A model P(X,T)
DATA
??????
X1 … XN T
Params
Lik
elih
oo
d
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
EM + information regularizationfor learning parameters
T
X2 X3X1
Learning Parameters
Input: Output: A model P(X)
DATA
X1 … XN
Empirical distribution Q(X)
Parametrization of P
P(X1) = Q(X1)P(X2|X1) = Q(X2|X1) P(X3|X1) = Q(X3|X1)
X1
X2 X3
M
x,x,x#)x,x,x(Q 321
321
Empirical distribution Q(X,T) = ?
Learning with Hidden Variables
DATA
X1 … XN
??????
T
4321
M
Y
Parametrization for P
T
X2 X3X1
guess of
Desired structure:
EM Iterations
Empirical distribution Q(X,T,Y) =
Empirical distributionQ(X,T,Y)=Q(X,T)Q(T|Y)
Input:
For each instance ID, complete value of T
The EM Algorithm:
E-Step: Generate empirical distribution
M-Step: Maximize using
EM is equivalent to optimizing function of Q,P
Each step increases value of functional
EM Functional
)Y|T(H)T,X(PlogE QQ
[Neal and Hinton, 1998]
])Y[X|T(P)Y|T(Q
)X,T(PlogEargmax Q
]P,Q[F
Information Bottleneck EM
Target:
In the rest of the talk… Understanding this objective How to use it to learn better models
)Y;T(I)1(L QEMIB ]P,Q[F
EM target
Information between hidden and ID
Information RegularizationMotivating idea:Fit training data: Set T to be instance ID to “predict” X
Generalization: “Forget” ID and keep essence of X
Objective:
parameter free regularization of Q[Tishby et. al, 1999]
(lower bound of) Likelihood of P
Compression of instance IDvs.
)Y;T(I)1(]P,Q[FL QEMIB
1
7
5
3
11
9
2
6
108
4 total compression
=0
1
7
5
3
11
9
2
6
108
4
Clustering example
)Y;T(I)1(]P,Q[FL QEMIB
EMTarget
Compressionmeasure
=0EMTarget
Compressionmeasure
Clustering example
1
7
5
3
11
9
2
6
108
4 total preservation
=1
1
7
5
311
9
2
6
10
8
4
)Y;T(I)1(]P,Q[FL QEMIB
Compressionmeasure
EMTarget
=1
T ID
Clustering example
1
7
5
3
11
9
2
6
108
4
1 7
5
3
119
2
6
84 10
Desired
=?
)Y;T(I)1(]P,Q[FL QEMIB
Compressionmeasure
EMTarget
=?
|T| = 2
Information Bottleneck EM
Formal equivalence with Information Bottleneck
at =1 EM and Information Bottleneck coincide [Generalizing result of Slonim and Weiss for univariate case]
)Y;T(I)1(L QEMIB
EM functional
]P,Q[F
Information Bottleneck EM
Formal equivalence with Information Bottleneck
)Y;T(I)1(L QEMIB
EM functional
]P,Q[F
1)t(Q)Y(Z
1)y|t(Q
])y[x|t(P
Prediction ofT using P
Marginal ofT in QNormalization
Maximum of Q(T|Y) is obtained when
The IB-EM Algorithm for fixed Iterate until convergance
E-Step: Maximize LIB-EM by optimizing Q
M-Step: Maximize LIB-EM by optimizing P
(same as standard M-step)
Each step improves LIB-EM
Guaranteed to converge
])y[x|t(P)t(Q)Y(Z
1)y|t(Q 1
)X,T(PlogEmaxarg Q
Information Bottleneck EM
Target:
In the rest of the talk… Understanding this objective How to use it to learn better models
)Y;T(I)1(L QEMIB
EM target
Information between hidden and ID
]P,Q[F
Continuation
easy
hard
Follow ridge from optimum at =0
LIB
-EM
Q
0
1
Continuation
Recall, if Q is a local maxima of LIB-EM then
We want to follow a path in (Q, ) space so that…
for all t, and y
])y[x|t(P)t(Q)Y(Z
1)y|t(Q 1 0])y[x|t(P)t(Q
)Y(Z1
)y|t(Q 1
),(, QG yt
0),(, QG yt
QLocal maxima for all
Continuation Step
1. Start at (Q,) so that
2. Compute gradient
3. Take direction
4. Take a step in thedesired direction
G
,QG
0),(, QG yt
0),Q(G y,t
Q
0
1
start
Staying on the ridge
Q
0
1
start
Potential problem:
Direction is tangent to path
miss optimumSolution:
Use EM steps to regain path
The IB-EM Algorithm
Set =0 (start at easy solution) Iterate until =1 (EM solution is reached) Iterate (stay on the ridge)
E-Step: Maximize LIB-EM by optimizing Q
M-Step: Maximize LIB-EM by optimizing P
Step (follow the ridge) Compute gradient and direction Take the step by changing and Q
Q
0
1
Calibrating the step size
Potential problem:
Step size too small too slow
Step size too large overshoot target
Inferior solution
Non-parametric: involves only QCan be bounded: I(T;Y) ≤ log2|T|
Calibrating the step size
0 0.2 0.4 0.6 0.8 1
0.5
1
1.5
Use change in I(T;Y)
0 0.2 0.4 0.6 0.8 1
0.5
1
1.5
I(T
;Y)
Naive
Recall that I(T;Y) measures compression of ID
When I(T;Y) rises more of data is captured
Too sparse
“Interesting”area
The Stock DatasetNaive Bayes modelDaily changes of20 NASDAQ stocks. 1213 train, 303 test
IB-EM outperforms best of EM solutions I(T;Y) follows changes of likelihood Continuation ~follows region of change ( marks evaluated )
0 0.2 0.4 0.6 0.8 1
0.5
1
1.5
I(T
;Y)
0 0.2 0.4 0.6 0.8 1
-23
-21
-19
Tra
in li
ke
liho
od
IB-EM
Best of EM
[Boyen et. al, 1999]
Multiple Hidden VariablesWe want to learn a model
with many hiddens ( )
Naive: Potentially exponential in # of hiddens
Variational approximation: use factorized form (Mean Field)
LIB-EM = (Variational EM) - (1-
)Regularization [Friedman et. al, 2002]
P
i
i )Y|T(Q)Y|T(Q
)Y|TT(Q)Y|T(Q K1
Q(T|Y) Y
Percentage of random runs
Te
st lo
g-l
os
s / i
ns
tan
ce
20 40 60 80 100
-342
-338
-334
-330
Mean Field EM1 min/run
400 samples 21 hiddens
Superior to all Mean Field EM runs Time single exact EM run
The USPS Digits dataset
single IB-EM 27 min
exact EM25 min/run
Offers good value for your time!
3/50 EM runs are IB-EM:
EM needs x17 time for similar results
0 20 40 60 80 100
-151.5
-150.5
-149.5
-148.5
-147.5
Precentage of random runs
Te
st lo
g-l
os
s / i
ns
tan
ce
Mean Field EM~0.5 hours
Yeast Stress Response173 experiments (variables)
6152 genes (samples)
25 hidden variables
Superior to all Mean Field EM runs An order of magnitude faster then exact EM
Effective when exact solution becomes intractable!
IB-EM ~6 hours
Exact EM>60 hours
5-24 experiments
Summary
New framework for learning hidden variables Formal relation of Bottleneck and EM Continuation for bypassing local maxima Flexible: structure / variational approximation
• Learn optimal ≤1 for better generalization
• Explore other approximations of Q(T|Y)• Model selection: learning cardinality and
enrich structure
Future Work
Relation to Weight Annealing
[Elidan et. al, 2002]
Y
4321
M
DATA
X1 … XN W Init: temp = hotIterate until temp = cold
Perturb w temp Use QW and optimize Cool down
Similarities: Change in
empirical Q Morph towards
EM solution
Differences: IB-EM uses info. regulatization IB-EM uses continuation WA requires cooling policy WA applicable for wider range of
problems
Relation to Deterministic AnnealingY
4321
M
DATA
X1 … XNInit: temp = hotIterate until temp = cold “Insert” entropy temp into model Optimize noisy model Cool down
Similarities: Use information
measure Morph towards
EM solution
Differences: DA parameterization dependent IB-EM uses continuation DA requires cooling policy DA applicable for wider range of
problems