Exponentiated Gradient versus Gradient Descent for Linear

Exponentiated Gradient versus Gradient Descent for Linear Predictors

Jyrki Kivinen and Manfred Warmuth

Presented By: Maitreyi N

Linear Predictors

)),(inf(),( SuLossOSALoss LUuL ∈=

),(inf))1(1(),( SuLossoSALoss LUuL ∈+=

A good linear predictor will satisfy the bounds:

The bounds can be improved to:

0)1( →⇒∞→ ol

Gradient Descent

ttttt xyyww )ˆ(21 −−=+ η

This algorithm uses the update rule:

This is the gradient of the Squared Euclidean Distance:

221),( swswd −=

Exponentiated Gradient

ititit

iire s

wwswd1

This algorithm uses the update rule:

This is the gradient of the Relative Entropy:

Algorithm GDL(s, η)

Parameters:L: a loss function from R × R to [0, ∞),s: a start vector in RN, andη: a learning rate in [0, ∞).

Initialization: Before the first trial, set w1=s.

Prediction: Upon receiving the t th instance xt, give the prediction ŷt=wt • xt .

Update: Upon receiving the t th outcome yt, update the weights according to the rule

wt+1=wt - η L'yt(ŷt) xt .

Algorithm EGL(s, η)

Parameters:L: a loss function from R × R to [0, ∞),s: a start vector with ΣN

i=1 si = 1, andη: a learning rate in [0, ∞).

Initialization: Before the first trial, set w1=s.

Prediction: Upon receiving the t th instance xt, give the prediction ŷt=wt • xt .

Update: Upon receiving the t th outcome yt, update the weights according to the rule

EG± : EG with negative weights

EG is analogous to the Weighted Majority Algorithm:Uses multiplicative update rulesIs based on minimizing relative entropyUnfortunately, it can represent only positive concepts

EG± can represent any concept in the entire sample space.

It has proven relative boundsAbsolute bounds are not proven.Works by splitting the weight vector into positive and negative weights, with separate update rules.

EG± Algorithm:

Update:

EG: Update rule EG±: Update rule

−−

j jtjt

ititit

er ittt

)ˆ(2,

∑ =−−++

−−+

j jtjt

Variable Learning Rates

GDVWeight update rule becomes:

Weight update rule becomes:

tt xyyx

ww )ˆ(2 2

1 −−=+η

⎟⎟

⎜⎜

⎛−−=

tit Uxyy

xr ,2, )ˆ(2exp η

+− =

itit r

Approximated EG Algorithms

Use the approximation

))(1( oo vvaee avav −−≈ −−

So the update rule becomes

)).ˆ)(ˆ(1( ,,,1 tittyitit yxyLwwt

−′−=+ η

The approximation leads to oscillation of the weight vector for certain weight distributions

Worst Case Loss Bounds

Gradient Descent

2211),()21()),,(( Xsuc

SuLosscSsGDLoss −⎟⎠⎞

⎜⎝⎛ +++≤η

).,(121),(

21)),,(( 2 sudR

cSuLosscSsEGLoss re⎟

⎠⎞

⎜⎝⎛ ++⎟

⎠⎞

⎜⎝⎛ +≤η

)2(2,0 2 cR

Worst Case Loss Bounds

).,/(42),(2

1)),,,(( 22 sUudXUc

SuLosscSsUEGLoss re ′′⎟⎠⎞

⎜⎝⎛ ++⎟

⎠⎞

⎜⎝⎛ +≤± η

andUXR

Other Algorithms

Gradient projection algorithm (GP)Has similar bounds to GDUses the constraint: weights must sum to 1

Exponentiated Gradient algorithm with Unnormalized weights (EGU)

When all outcomes, inputs and comparison vectors are positive, it has the bounds:

( ) ).,(12),(21)),,,(( suXYdc

SuLosscSYsEGULoss reu⎟⎠⎞

⎜⎝⎛ +++≤η

Experiments

Have a fixed target concept u∈RN

u is equivalent to the weightage of each inputUse ℓ instances of input xt

Drawn from a probability measure in RN

Random noise is added to the inputsRun each algorithm on the (same) inputsPlot cumulative losses for each algorithm

Results

GD vs. EG

Random errors confuse GD much moreWhen the number of relevant variables is constant:

Loss(GD) grows linearly in NLoss(EG) grows logarithmically in N

GD does better when:All variables are relevant, andInput is consistent (few or no errors)

Conclusion

Worst case loss bounds exist only for square loss.

We need loss bounds for relative entropy lossGD has provably optimal bounds

Lower bounds for EG, EG± are still required. EG, EG± perform better in error prone learning environments

Exponentiated Gradient versus Gradient Descent for Linear

Documents

10-315 Recitation Review of Gradient Descent & Kernelsninamf/courses/315sp19/recitations/2_21-… · Review of Gradient Descent & Kernels Misha 21 February 2019. Gradient Descent:

Stochastic Gradient Descent - CMU Statistics

Intro Logistic+Regression Gradient+Descent+++SGD...9 SGD:+Stochastic+Gradient+Ascent+(or+Descent) • “True”gradient: • Samplebasedapproximation: • Whatifweestimategradientwithjustonesample???

Deep Learning I: Gradient Descentgd.pdf · Roadmap Intro,model,cost Gradient descent Gradient descent (GD) Gradient descent is a learning algorithm. Given: a hypothesis space (or

Learning to learn by gradient descent by gradient descent · Learning to learn by gradient descent by gradient descent Marcin Andrychowicz 1, Misha Denil , Sergio Gómez Colmenarejo

Gradient Descent: Second Order Momentum and Saturating Error · 1.1 SIMPLE GRADIENT DESCENT First, let us review the bounds on the convergence rate of simple gradient descent without

Matrix Exponentiated Gradient Updates for On-line Learning ...MATRIX EXPONENTIATED GRADIENT UPDATES In the case of symmetric matrices, the matrix exponential operation can be computed

Sargur Srihari srihari@cedar.buffalo · – This is known as the method of steepest descent or gradient descent • Steepest descent proposes a new point ... • Gradient Descent

Gradient descent GAN optimization is locally stablepapers.nips.cc/paper/7142-gradient-descent-gan-optimization-is... · Gradient descent GAN optimization is locally stable ... similarities

Gradient Descent Optimization

Exponentiated Gradient Algorithms for Conditional Random Fields and Max ...bartlett/papers/cgkcb-egacrfmmmn-07… · Exponentiated Gradient Algorithms for CRFs and Max-Margin Markov

Prior Knowledge and Preferential Structures in Gradient ...jmlr.csail.mit.edu/papers/volume1/mahony01a/mahony01a.pdf · Keywords: Gradient descent, exponentiated gradient algorithm,

Semi-Stochastic Gradient Descent Methods

1 Lecture 10: descent methods Gradient descent (reminder)

Multiple Gradient Descent Algorithm

Linear Regression & Gradient Descent

Boosting Algorithms as Gradient Descent

Gradient Descent Easy version

Learning to learn by gradient descent by gradient descentpapers.nips.cc/...to...descent-by-gradient-descent.pdf · Learning to learn by gradient descent by gradient descent Marcin

GRADIENT DESCENT - Pomona College