View
17
Download
0
Category
Preview:
Citation preview
Exponentiated Gradient versus Gradient Descent for Linear Predictors
Jyrki Kivinen and Manfred Warmuth
Presented By: Maitreyi N
Linear Predictors
)),(inf(),( SuLossOSALoss LUuL ∈=
),(inf))1(1(),( SuLossoSALoss LUuL ∈+=
A good linear predictor will satisfy the bounds:
The bounds can be improved to:
where
0)1( →⇒∞→ ol
Gradient Descent
ttttt xyyww )ˆ(21 −−=+ η
This algorithm uses the update rule:
This is the gradient of the Squared Euclidean Distance:
2
221),( swswd −=
Exponentiated Gradient
∑=
+ = N
jjtjt
ititit
wr
wrw
1,,
,,,1
∑=
=N
i i
iire s
wwswd1
ln),(
This algorithm uses the update rule:
This is the gradient of the Relative Entropy:
Algorithm GDL(s, η)
Parameters:L: a loss function from R × R to [0, ∞),s: a start vector in RN, andη: a learning rate in [0, ∞).
Initialization: Before the first trial, set w1=s.
Prediction: Upon receiving the t th instance xt, give the prediction ŷt=wt • xt .
Update: Upon receiving the t th outcome yt, update the weights according to the rule
wt+1=wt - η L'yt(ŷt) xt .
Algorithm EGL(s, η)
Parameters:L: a loss function from R × R to [0, ∞),s: a start vector with ΣN
i=1 si = 1, andη: a learning rate in [0, ∞).
Initialization: Before the first trial, set w1=s.
Prediction: Upon receiving the t th instance xt, give the prediction ŷt=wt • xt .
Update: Upon receiving the t th outcome yt, update the weights according to the rule
EG± : EG with negative weights
EG is analogous to the Weighted Majority Algorithm:Uses multiplicative update rulesIs based on minimizing relative entropyUnfortunately, it can represent only positive concepts
EG± can represent any concept in the entire sample space.
It has proven relative boundsAbsolute bounds are not proven.Works by splitting the weight vector into positive and negative weights, with separate update rules.
EG± Algorithm:
Update:
EG±
EG: Update rule EG±: Update rule
∑ =
+
−−
=
=
N
j jtjt
ititit
xyyit
rw
rww
er ittt
1 ,,
,,,1
)ˆ(2,
,η
∑ =−−++
++++
−−+
+=
=
N
j jtjt
itit
xyy
jtjt
it
ittt
it
rwrw
rww
er
1 ,,
,,1
)ˆ(2
)(,,
,
,
,
η
Variable Learning Rates
GDVWeight update rule becomes:
EGV±
Weight update rule becomes:
tttt
tt xyyx
ww )ˆ(2 2
2
1 −−=+η
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛−−=
∞
+ittt
tit Uxyy
xr ,2, )ˆ(2exp η
+− =
itit r
r,
,1
Approximated EG Algorithms
Use the approximation
))(1( oo vvaee avav −−≈ −−
So the update rule becomes
)).ˆ)(ˆ(1( ,,,1 tittyitit yxyLwwt
−′−=+ η
The approximation leads to oscillation of the weight vector for certain weight distributions
Worst Case Loss Bounds
Gradient Descent
EG
22
2211),()21()),,(( Xsuc
SuLosscSsGDLoss −⎟⎠⎞
⎜⎝⎛ +++≤η
).,(121),(
21)),,(( 2 sudR
cSuLosscSsEGLoss re⎟
⎠⎞
⎜⎝⎛ ++⎟
⎠⎞
⎜⎝⎛ +≤η
)2(2,0 2 cR
cR+
=> η
Worst Case Loss Bounds
EG±
).,/(42),(2
1)),,,(( 22 sUudXUc
SuLosscSsUEGLoss re ′′⎟⎠⎞
⎜⎝⎛ ++⎟
⎠⎞
⎜⎝⎛ +≤± η
2231
,2
XU
andUXR
Where
=
=
η
Other Algorithms
Gradient projection algorithm (GP)Has similar bounds to GDUses the constraint: weights must sum to 1
Exponentiated Gradient algorithm with Unnormalized weights (EGU)
When all outcomes, inputs and comparison vectors are positive, it has the bounds:
( ) ).,(12),(21)),,,(( suXYdc
SuLosscSYsEGULoss reu⎟⎠⎞
⎜⎝⎛ +++≤η
Experiments
Have a fixed target concept u∈RN
u is equivalent to the weightage of each inputUse ℓ instances of input xt
Drawn from a probability measure in RN
Random noise is added to the inputsRun each algorithm on the (same) inputsPlot cumulative losses for each algorithm
Results
Results
Results
Results
Results
GD vs. EG
Random errors confuse GD much moreWhen the number of relevant variables is constant:
Loss(GD) grows linearly in NLoss(EG) grows logarithmically in N
GD does better when:All variables are relevant, andInput is consistent (few or no errors)
Conclusion
Worst case loss bounds exist only for square loss.
We need loss bounds for relative entropy lossGD has provably optimal bounds
Lower bounds for EG, EG± are still required. EG, EG± perform better in error prone learning environments
Recommended