15
11-755 Machine Learning for Signal Processing Regression and Prediction Class 15. 23 Oct 2012 Instructor: Bhiksha Raj 23 Oct 2012 1 11755/18797 Matrix Identities The derivative of a scalar function w.r.t. a vector is a vector The derivative w.r.t. a matrix is a matrix 23 Oct 2012 11755/18797 2 D x x x f ... ) ( 2 1 D D dx dx df dx dx df dx dx df df ... ) ( 2 2 1 1 Matrix Identities The derivative of a scalar function w.r.t. a vector is a vector The derivative w.r.t. a matrix is a matrix 23 Oct 2012 11755/18797 3 DD D D D D x x x x x x x x x f .. .. .. .. .. .. .. ) ( 2 1 2 22 21 1 12 11 DD DD D D D D D D D D dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df dx dx df df .. .. .. .. .. .. .. ) ( 2 2 1 1 2 2 22 22 12 12 1 1 21 21 11 11 Matrix Identities The derivative of a vector function w.r.t. a vector is a matrix Note transposition of order 23 Oct 2012 11755/18797 4 D N x x x F F F ... ... ) ( 2 1 2 1 x F x F D D N D D D D N D N N dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dx dx dF dF dF dF .. .. .. .. .. .. .. ... 2 1 2 2 2 2 2 2 2 1 1 1 2 2 1 1 1 2 1 Derivatives In general: Differentiating an MxN function by a UxV argument results in an MxNxUxV tensor derivative 23 Oct 2012 11755/18797 5 , Nx1 UxV NxUxV , Nx1 UxV UxVxN Matrix derivative identities Some basic linear and quadratic identities 23 Oct 2012 11755/18797 6 a X X a a X Xa d d d d T T ) ( ) ( X is a matrix, a is a vector. Solution may also be X T ) ( ) ( ; ) ( ) ( A X XA X A AX d d d d A is a matrix a X X a Xa a d d T T T A X X X AA XAA XA A d trace d trace d trace d T T T T ) (

Matrix derivative identities Derivatives

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Matrix derivative identities Derivatives

11-755 Machine Learning for Signal Processing

Regression and Prediction

Class 15.  23 Oct 2012

Instructor: Bhiksha Raj

23 Oct 20121 11755/18797

Matrix Identities

The derivativeof a scalar function w.r.t. a vector is a vector

The derivativew.r.t. a matrix is a matrix

23 Oct 201211755/187972

Dx

xx

f...

) (2

1

DD

dxdxdf

dxdxdf

dxdxdf

df...

) (22

11

Matrix Identities

The derivativeof a scalar function w.r.t. a vector is a vector

The derivativew.r.t. a matrix is a matrix

23 Oct 201211755/187973

DD D D

D

D

x x x

x x xx x x

f

.... .. .. ..

..

..

) (

2 1

2 22 21

1 12 11

DDDD

DD

DD

DD

DD

dxdxdf

dxdxdf

dxdxdf

dxdxdf

dxdxdf

dxdxdf

dxdxdf

dxdxdf

dxdxdf

df.. ..

..

..

..

.. ..) (2

2

11

22

2222

1212

11

2121

1111

Matrix Identities

The derivativeof a vector function w.r.t. a vector is a matrix

Note transposition of order

23 Oct 201211755/187974

D Nx

xx

F

FF

...

... ) (2

1

2

1

x F x F

DD

N

DD

DD

N

D

NN

dxdxdF

dxdxdF

dxdxdF

dxdxdF

dxdxdF

dxdxdF

dxdxdF

dxdxdF

dxdxdF

dF

dFdF

.. ........

.. .....

2

1

22

22

2

22

1

1

12

2

11

1

2

1

Derivatives

In general:  Differentiating an MxNfunction by a UxVargument results in an MxNxUxVtensor derivative

23 Oct 201211755/187975

,Nx1

UxV

NxUxV

,Nx1 UxV

UxVxN

Matrix derivative identities

Some basic linear andquadratic identities

23 Oct 201211755/187976

a X X a a X Xad d d dT T

) ( ) (X is a matrix,a is a vector. Solution may also be XT

) ( ) ( ; ) ( ) (A X XA X A AXd d d d A is a matrix

a X X a Xa ad dT T T

A X X X AA XAA XA Ad trace d trace d trace dT T T T

) (

Page 2: Matrix derivative identities Derivatives

A Common Problem

Canyou spot the glitches?7 11755/18797 23 Oct 2012

How to fix this problem?“Glitches” in audio

Must be detected

How?

Thenwhat?

Glitches must be “fixed”Delete the glitch

Results in a “hole”

Fillin the hole

How?

8 11755/18797 23 Oct 2012

Interpolation..

23 Oct 201211755/187979

“Extend” the curve on the left to“predict” the values in the “blank” region

Forwardprediction

Extend the blue curve on the right leftwards to predict the blank region

Backwardprediction

How?

Regression analysis..

Detecting the Glitch

23 Oct 201211755/1879710

Regression­based reconstruction can be done anywhere

Reconstructed value will not match actual value

Large errorof reconstruction identifies glitches

NOTOK OK

What is a regression

Analyzing relationship between variables

Expressedin many forms

Wikipedia

Linear regression, Simple regression, Ordinary least squares, Polynomialregression, General linear model, Generalized linear model, Discretechoice, Logistic regression, Multinomial logit, Mixed logit, Probit, Multinomial probit, ….

Generally a tool to predictvariables

23 Oct 201211755/1879711

Regressions for predictiony= f(x; Q) + eDifferent possibilities

yis a scalarY is real

Y is categorical (classification)

yis a vectorxis a vector

xis a set of real valued variablesxis a set of categorical variablesxis a combination of the two

f(.)is a linear or affine functionf(.)is a non­linear functionf(.)is a time-series model

23 Oct 201211755/1879712

Page 3: Matrix derivative identities Derivatives

A linear regression

Assumption: relationship between variables is linearA linear trend may be found relating xand yy= dependent variablex= explanatory variableGiven x, ycan be predicted as an affine function of x

23 Oct 201211755/1879713

X

Y

An imaginary regression..http://pages.cs.wisc.edu/~kovar/hall.htmlCheck this shit out (Fig. 1). That's bonafide, 100%­real data, my friends. I took it myself over the course of two weeks.And this was not a leisurely two weeks, either; I busted my ass day and night in order to provide you with nothing but the best datapossible. Now, let's look a bit more closely at this data, remembering that it is absolutely first­rate. Do you see the exponential dependence? I sure don't. I see a bunch of crap.Christ, this was such a waste of my time.Banking on my hopes that whoever grades this will just look 

at the pictures, I drew an exponential through my noise. I believe the apparent legitimacy is enhanced by the fact that I used a complicated computerprogram to make the fit. Iunderstand this is the same processby which the top quark was discovered.

23 Oct 201211755/1879714

Linear Regressionsy= Ax+ b+ e

e = prediction error

Given a “training” set of {x,y}values: estimate Aand by1= Ax1+ b+ e1

y2= Ax2+ b+ e2

y3= Ax3+ b+ e3

If Aand bare well estimated, prediction error will be small

23 Oct 201211755/1879715

Linear Regression to a scalar

Rewrite

23 Oct 201211755/1879716

...] [3 2 1y y y y

...

1 1 13 2 1x x x X

ba A

...] [3 2 1e e e e

Define:

e X A y T

y1= aTx1+ b+ e1y2= aTx2+ b+ e2y3= aTx3+ b + e3

Learning the parameters

Given training data:  several x,yCandefine a “divergence”:   D(y, )

Measures how much yhatdiffers from y

Ideally, if the model is accurate this should be small

Estimate A, bto minimize D(y, )

23 Oct 201211755/1879717

X A yT

ˆAssuming no error

e X A y T

y

y

The prediction error as divergence

Define the divergence as the sum of the squared error in predicting y18

e X A y T

y1= aTx1+ b+ e1y2= aTx2+ b+ e2y3= aTx3+ b + e3

... ˆ23

22

21 e e e E ) y D(y,

... ) ( ) ( ) (2

12

12

1 b y b y b yT T T

a a a

2

X A y X A y X A y ET T T T

23 Oct 201211755/18797

Page 4: Matrix derivative identities Derivatives

Prediction error as divergence

y= aTx+ ee= prediction errorFind the “slope” asuch that the total squared lengthof the error lines is minimized

23 Oct 201211755/1879719

Solving a linear regression

Minimize squared error

Differentiating w.r.tAandequating to 0

23 Oct 201211755/1879720

e X A y T

T T T T) )( (X A y X A y || A X y|| E

2

A yX - A XX A yyT T T T

2

0 2 2 A yX - XX A Ed dT T T

X y XX yX A-1

pinvT T T

T T

Xy XX A-1

An Aside

What happens if we minimize the perpendicular instead?

23 Oct 201211755/1879721

Regression in multiple dimensions

Also called multiple regression

Equivalent of saying:

Fundamentally no differentfrom Nseparate single regressions

But we can use the relationship between ysto our benefit23 Oct 201211755/1879722

y1= ATx1+ b + e1y2= ATx2+ b + e2y3= ATx3+ b+ e3

yiis a vector

y1= ATx1+ b + e1

y11= a1Tx1+ b1+ e11

y12= a2Tx2+ b2+ e12

y13= a3Tx3+ b3+ e13

yij= jthcomponentof vectoryi

ai= ithcolumn of A

bi= ithcomponent of b

Multiple Regression

Differentiating andequating to 0

23 Oct 201211755/1879723

...] [3 2 1y y y Y

... 3 2 1

1x

1x

1x X

bA A

...] [3 2 1e e e EDx1 vector of ones

E X A Y T

T T T

ii

Titrace DIV) )( (

2X A Y X A Y b x A y

0 2 2 A YX - XX Ad dDivT T T

X Y XX YX A-1

pinvT T T

T T

XY XX A-1

A Different Perspective

yis a noisy reading  of ATx

Error eis Gaussian

Estimate Afrom 

23 Oct 201211755/1879724

e x A y T

) , 0(2I ~ e N

] ... [ ] ... [2 1 2 1N Nx x x X y y y Y

=+

Page 5: Matrix derivative identities Derivatives

The Likelihood of the data

Probability of observing a specific y, given x, for a particular matrix A

Probability of the collection:

Assuming IID for convenience(not necessary)23 Oct 201211755/1879725

e x A y T

) , 0(2I ~ e N

) , ( ) ; | (2I x A A x y

TN P

] ... [ ] ... [2 1 2 1N Nx x x X y y y Y

i

iT

N P) , ( ) ; | (2I x A A X Y

A Maximum Likelihood Estimate

Maximizing the log probability is identical to minimizing the traceIdentical to the least squares solution

23 Oct 201211755/1879726

e x A y T

) , 0(2I ~ e N] ... [ ] ... [2 1 2 1N Nx x x X y y y Y

i

iT

D P2

2 2

1exp

) 2(

1) | (x A X Y

i

iT

i C P2

22

1) ; | ( logx A y A X Y

T T T

trace C) )( (2

12X A Y X A Y

X Y XX YX A-1

pinvT T T

T T

XY XX A-1

Predicting an output

From a collection of training data, have learned AGiven xfor a new instance, but not y, what is y?Simple solution:

23 Oct 201211755/1879727

X A yT

ˆ

Applying it to our problem

Prediction by regression

Forward regression

xt= a1xt-1+ a2xt-2…akxt-k+et

Backward regression

xt= b1xt+1+ b2xt+2…bkxt+k+et

23 Oct 201211755/1879728

Applying it to our problem

Forward prediction

23 Oct 201211755/1879729

1

1

1 1

1 3 2

2 1

1

1

....

.. .. .. ......

..K

t

t

K t K t

K t t

K t t

T

K

t

t

e

ee

x x x

x x xx x x

x

xx

t a

e X a x Tt

Tt pinva X x) (

Applying it to our problem

Backward prediction

23 Oct 201211755/1879730

1

2

1

2 1

2 1

1 1

1

2

1

....

.. .. .. ......

..e

ee

x x x

x x xx x x

x

xx

K t

K t

K t K t

K t t

K t t

Tt

K t

K t

b

e X b x Tt

Tt pinvb X x) (

Page 6: Matrix derivative identities Derivatives

Finding the burst

At each time

Learn a“forward” predictor at

At each time, predict next sample  xtest= Siat,kxt-k

Compute error: ferrt=|xt-xtest|2

Learn a“backward” predict and compute backward error

berrt

Compute averageprediction error over window, threshold

23 Oct 201211755/1879731

Filling the hole

Learn “forward” predictor at left edge of “hole”

For eachmissing sample

At each time, predict next sample xtest= Siat,kxt-k

Use estimated samples  if real samples are not available

Learn “backward” predictor at left edge of “hole”

For eachmissing sample

At each time, predict next sample xtest= Sibt,kxt+k

Use estimated samples  if real samples are not available

Average forward and backward predictions

23 Oct 201211755/1879732

Reconstruction zoom in

Next glitch

Interpolationresult

Reconstructionarea

Actualdata

Distortedsignal

Recoveredsignal

33 11755/18797 23 Oct 2012

Incrementally learning the regression

Canwe learn A incrementally instead?

As data comes in?

The WidrowHoff rule

Note the structure

Can also be done in batch mode!

23 Oct 201211755/1879734

Requires knowledge ofall (x,y) pairs

T TXY XX A

-1

t

T tt t t t

t ty y yx a x a a

ˆ ˆ

1

error

Scalarprediction version

Predicting a value

What are we doing exactly?

23 Oct 201211755/1879735

T T

XY XX A-1

x XX YX x A y1

ˆ

T T T

TXX C

x C x21

ˆ

ii

Ti

Tyx x x XY y ˆ ˆ ˆ ˆ ˆ

LetNormalizing and rotating space

Therotation is irrelevant

Weighted combinationofinputs

Relationships are not alwayslinear

How do we model these?

Multiple solutions

23 Oct 201211755/1879736

Page 7: Matrix derivative identities Derivatives

Non-linear regression

y = j(x)+e

23 Oct 201211755/1879737

)] ( ... ) ( ) ( [ ) (2 1x x x x φ xN

)] ( ... ) ( ) ( [ ) (2 1K x φ x φ x φ X X

Y= A(X)+eReplace Xwith (X)in earlier equations for solution

) ( ) ( ) (T T

Y X X X A-1

What we are doing

Finding the optimal combination of various function

Remindyou of something?

23 Oct 201211755/1879738

Being non-commital: Local RegressionRegression is usually trained overthe entire data

Must apply everywhere

How about doing this locally?

For any x

23 Oct 201211755/1879739

ii

Ti

Tyx x x XY y ˆ ˆ ˆ ˆ ˆ

e y x x y ii

i d) , (

e y x C x y

ii

iT1

Local RegressionThe resulting regression is dependent on x!

No closed form solution

But can be highly accurate

But what is d(x,x’)??

23 Oct 201211755/1879740

ii

i dy x x y) , ( ˆii

i dy x x y) , ( ˆ

2|| ) , ( || ) (i

ii d ey x x y x

Kernel Regression

Actually a non­parametric MAP estimator of yNote –an estimator of y, not parameters of regressionThe “Kernel” is the kernel of a parzenwindow

But first.. MAP estimators..

23 Oct 201211755/1879741

ii h

ii

i h

K

K

) (

) (ˆ

x x

y x xy

Map Estimators

MAP (Maximum A Posteriori): Finda “best guess” for y(in a statistical sense), given that we know x

y= argmaxYP(Y|x)

ML (Maximum Likelihood): Findthat value of Y for which the statistical best guess of X would have been the observed X

y= argmaxYP(x|Y)

MAP is simpler to visualize

23 Oct 201242 11755/18797

Page 8: Matrix derivative identities Derivatives

MAP estimation: Gaussian PDF

F1F0

Assume X and Y arejointly Gaussian

The parameters of theGaussian are learned from trainingdata

23 Oct 201243 11755/18797

Learning the parametersof the Gaussian

23 Oct 201211755/1879744

xy z

N

ii

N1

1z z

x

yz

YY YX

XY XX

C CC C Cz

T

i

N

ii

NCz z zz z 1

1

Learning the parametersof the Gaussian

23 Oct 201211755/1879745

xy z

N

ii

N1

1z z

Ti

N

ii

NCz z zz z 1

1

x

yz

YY YX

XY XX

C CC C Cz

N

ii

N1

1x x

Ti

N

ii XY

NCy xy x 1

1

MAP estimation: Gaussian PDF

F1F0

Assume X and Y arejointly Gaussian

The parameters of theGaussian are learned from trainingdata

23 Oct 201246 11755/18797

MAP Estimator for Gaussian RV

Assume X and Y arejointly Gaussian

The parameters of the Gaussian are learned from trainingdata

Now we are given an X, but no YWhat is Y?

Level set ofGaussian

23 Oct 201247 11755/18797

MAP estimator for Gaussian RV

x0

23 Oct 201248 11755/18797

Page 9: Matrix derivative identities Derivatives

MAP estimation: Gaussian PDF

F1

23 Oct 201249 11755/18797

F1

MAP estimation: The Gaussian at a particular value of X

x0

23 Oct 201250 11755/18797

F1

MAP estimation: The Gaussian at a particular value of X

Most likelyvalue

x0

23 Oct 201251 11755/18797

MAP Estimation of a Gaussian RVY = argmaxyP(y| X) ???

23 Oct 201252 11755/18797

MAP Estimation of a Gaussian RV

23 Oct 201253 11755/18797

MAP Estimation of a Gaussian RV

23 Oct 201254 11755/18797

Page 10: Matrix derivative identities Derivatives

So what is this value?

Clearly a line

Equation of Line:

Scalar version given; vector version is identical

Derivation? Later in the program a bit

23 Oct 201211755/1879755

x XX YX Xx C C y 1

ˆ

x x y 1

ˆXX YX XC C

This is a multiple regression

This is the MAP estimate ofyNOT the regression parameter

What about the ML estimate of yAgain, ML estimate of y, not regression parameter

23 Oct 201211755/1879756

x x y 1

ˆXX YX XC C

Its also a minimum-mean-squared error estimateGeneral principle of MMSE estimation:

yis unknown, xis knownMust estimate it such that the expected squared error is minimized

Minimize above term

23 Oct 201211755/1879757

] | ˆ [2

x y y E Err

Its also a minimum-mean-squared error estimateMinimize error:

Differentiating andequating to 0:

23 Oct 201211755/1879758

] | ˆ ˆ [ ] | ˆ [2

x y y y y x y y T

E E Err

] | [ ˆ 2 ˆ ˆ ] | [ ] | ˆ 2 ˆ ˆ [x y y y y x y y x y y y y y yE E E ErrT T T T T T

0 ˆ ] | [ 2 ˆ ˆ 2 ] | ˆ 2 ˆ ˆ [ 2 y x y y y x y y y y y yd E d E dErrT T T T T

] | [ ˆx y yE The MMSE estimate is themean of the distribution

For the Gaussian: MAP = MMSE

Most likelyvalue

is also

The MEANvalue

Would be true of any symmetric distribution23 Oct 201259 11755/18797

MMSE estimates for mixture distributions

60

Let P(y|X)be a mixture densityTheMMSE estimate of y is given by

Just a weightedcombination of the MMSEestimates from the component distributions

) , | ( ) ( ) | (x y x yk P k P Pk

y x y y x yd k P k P Ek

) , | ( ) ( ] | [ y x y yd k P k Pk

) , | ( ) (

] , | [ ) (x yk E k Pk

23 Oct 201211755/18797

Page 11: Matrix derivative identities Derivatives

MMSE estimates from a Gaussian mixture

23 Oct 201211755/1879761

LetP(y|x)is also a Gaussian mixture

Let P(x,y)be a Gaussian Mixture

) , ; ( ) ( ) ( ) (k kk

N k P P PS z z y x,

xy z

) (

) , | ( ) | ( ) (

) (

) , , (

) () (

) | (x

x y x x

x

y x

xy x,

P

k P k P P

P

k P

PP

x y Pk k

k

k P k P P) , | ( ) | ( ) | (x y x x y

MMSE estimates from a Gaussian mixture

23 Oct 201211755/1879762

Let P(y|x)is a Gaussian Mixture

k

k P k P P) , | ( ) | ( ) | (x y x x y

) ], ; [ ]; ; ([ ) , , (, ,

, ,, ,

xx xy

yx yyx y x y x y

k k

k kk kC C

C CN k P

) ), ( ; ( ) , | (,1, , ,Q

x xx yx yx y x yk k k kC C N k P

Q

kk k k kC C N k P P) ), ( ; ( ) | ( ) | (,

1, , ,x xx yx yx y x x y

MMSE estimates from a Gaussian mixture

23 Oct 201211755/1879763

E(Y|X) is also a mixture

P(y|x) isa mixture density

Q

kk k k kC C N k P P) ), ( ; ( ) | ( ) | (,

1, , ,x xx yx yx y x x y

k

k E k P E] , | [ ) | ( ] | [x y x x y

kk k k kC C k P E) ( ) | ( ] | [,

1, , ,x xx yx yx x x y

MMSE estimates from a Gaussian mixture

23 Oct 201211755/1879764

A mixtureof estimates from individual Gaussians

MMSE with GMM: Voice Transformation

-FestvoxGMM transformation suite (Toda) awbbdljmkslt

awbbdljmkslt

23 Oct 201265 11755/18797

Voice Morphing

Align training recordings from both speakers

Cepstralvector sequence

Learn a GMM on joint vectors

Given speech from one speaker, find MMSE estimate of the other

Synthesize from cepstra

23 Oct 201211755/1879766

Page 12: Matrix derivative identities Derivatives

A problem with regressions

ML fit is sensitiveError is squared

Small variations in data large variations in weights

Outliers affect it adversely

UnstableIf dimension of X >= no. of instances

(XXT) is notinvertible

23 Oct 201211755/1879767

T T

XY XX A-1

MAP estimation of weights

Assume weights drawn from a Gaussian

P(a) = N(0,2I)

Max. Likelihood estimate

Maximum a posterioriestimate

23 Oct 201211755/1879768

a

Xe

y=aTX+e

) ; | ( log max arg ˆa X y aaP

) ( ) , | ( log max arg ) , | ( log max arg ˆa a X y X y a aA aP P P

MAP estimation of weights

Similar to ML estimate with an additional term

23 Oct 201211755/1879769

) ( ) , | ( log max arg ) , | ( log max arg ˆa a X y X y a aA AP P P

P(a) = N(0,2I)Log P(a) = C –log –0.5-2||a||2

T T T TC P) ( ) (

21

) ( log2X a y X a y a X, | y

a a X a y X a y aAT T T T T

C2

25. 0 ) ( ) (2

1log ' max arg ˆ

MAP estimate of weights

Equivalent to diagonal loading of correlation matrix

Improves condition number of correlation matrixCan be inverted with greater stability

Will not affect the estimation from well­conditioned data

Also called TikhonovRegularization Dual form: Ridge regression

MAP estimate of weights

Not to be confused with MAP estimate of Y

23 Oct 201211755/1879770

0 2 2 2 a I yX XX ad dLT T T

T T

XY I XX a-1

MAP estimate priors

Left:  Gaussian Prior on W

Right: LaplacianPrior23 Oct 201211755/1879771

MAP estimation of weights with laplacianpriorAssume weights drawn from a Laplacian

P(a) = l-1exp(-l-1|a|1)

Maximum a posterioriestimate

No closed form solution

Quadratic programming solution required

Non­trivial

23 Oct 201211755/1879772

11

) ( ) ( ' max arg ˆa X a y X a y aA

lT T T T

C

Page 13: Matrix derivative identities Derivatives

MAP estimation of weights with laplacianpriorAssume weights drawn from a Laplacian

P(a) = l-1exp(-l-1|a|1)

Maximum a posterioriestimate

Identical to L1 regularized least­squares estimation

23 Oct 201211755/1879773

11

) ( ) ( ' max arg ˆa X a y X a y aA

lT T T T

C

L1-regularized LSE

No closed form solution

Quadratic programming solutions required

Dual formulation

“LASSO” –Least absolute shrinkage and selection operator

23 Oct 201211755/1879774

11

) ( ) ( ' max arg ˆa X a y X a y aA

lT T T T

C

T T T TC) ( ) ( ' max arg ˆX a y X a y aA t 1 a subject to

LASSO Algorithms

Various convex optimization algorithms

LARS: Least angle regression

Pathwisecoordinate descent..

Matlabcode available from web

23 Oct 201211755/1879775

Regularized least squares

Regularization results in selection of suboptimal(in least­squares sense) solution

One of the loci outside center

Tikhonovregularization selectsshortestsolution

L1 regularization selectssparsest solution

23 Oct 201211755/1879776

Image Credit:Tibshirani

LASSO and Compressive Sensing

Given Y and X, estimate sparse WLASSO:   

X = explanatory variableY = dependent variablea = weights of regression

CS:X = measurement matrixY = measurementa = data

23 Oct 201211755/1879777

Y=Xa

An interesting problem: Predicting War!

Economists measure a number of social indicators for countries weekly

Happiness index

Hunger index

Freedom index

Twitter records

Question: Will there be a revolution or war next week?

23 Oct 201211755/1879778

Page 14: Matrix derivative identities Derivatives

An interesting problem: Predicting War!

Issues:

Dissatisfaction builds up–not an instantaneous phenomenon

Usually

War / rebellion build upmuch faster

Often in hours

Important to predict

Preparedness for security

Economic impact

23 Oct 201211755/1879779

Predicting War

Given

Sequence of economic indicators for each week

Sequence of unrest markers for each week

At the end of each week we know if war happened or not that week

Predictprobability of unrest next week

This could be a new unrest orpersistence of a current one

23 Oct 201211755/1879780

W

S

W

S

W

S

W

S

W

S

W

S

W

S

W

S

wk1wk2wk3wk4wk5wk6wk7wk8

O1O2O3O4O5O6O7O8

A Step Aside: Predicting Time Series

An HMM is a model for time­series data

How canwe use it predict the future?

23 Oct 201211755/1879781

Predicting with an HMM

Given

Observations O1..Ot

All HMM parameters

Learned from some training data

Must estimate future observationOt+1

Estimate must consider entire history (O1..Ot)

No knowledge of actual stateof the process at any time

23 Oct 201211755/1879782

Predicting with an HMM

Given O1..OtCompute P(O1.. Ot,s)Using the forward algorithm –computes  a(s,t)

23 Oct 201211755/1879783

timett+1

sa(s,t)

) ) ( , ,..., , ( ) , (2 1s t state O O O P t st a

' '.. 1

.. 1.. 1

) ,' () , (

) ,' () , (

) | (

s st t

t tt t

t st s

O s s POs s P

O s s Paa

Predicting with an HMM

Given P(st=s | O1..t) for allsP(st+1= s | O1..t) = Ss’P(st=s’|O1..t)P(s|s’)P(Ot+1,s|O1..t) = P(O|s) P(st+1=s|O1..t)P(Ot+1|O1..t) = SsP(Ot+1,s|O1..t)

= SsP(O|s) P(st+1=s|O1..t)This is a mixture distribution84

time

sa(s,t+1)

23 Oct 201211755/18797

Page 15: Matrix derivative identities Derivatives

Predicting with an HMM

P(Ot+1|O1..t) = SsP(Ot+1,s|O1..t) = SsP(O|s) P(st+1=s|O1..T)

MMSE estimate of Ot+1given O1..tE[Ot+1| O1..t] =SsP(st+1=s|O1..T)E[O|s]

A weighted sum of the state means23 Oct 201211755/1879785

time

sa(s,t+1)

Predicting with an HMMMMSE Estimate of Ot+1= E[Ot+1|O1..T]

E[Ot+1| O1..t] = SsP(st+1=s|O1..T) E[O|s]

If P(O|s)is a GMME(O|s) = SkP(k|s) k,s

23 Oct 201211755/1879786

sk

s k s k t tw O s P O, , .. 1 1) | ( ˆ

sks k s k

s

tws t

s tO, ,

'

1)' , (

) , ( ˆaa

Predicting War

Train an HMM on z = [w, s]After the tthweek, predict probabilitydistribution:

P(zt| z1…zt) = P(w, z | z1..zt)

Marginalize out x(not known for next week)

War?   E[w | z1..zt]23 Oct 201211755/1879787

ds z s w P z w Pt t) | , ( ) | (.. 1 .. 1