Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

  • Upload
    tuytm2

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    1/49

    Lecture 6: Logistic Regression

    CSC 84020 - Machine Learning

    Andrew Rosenberg

    February 19, 2009

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    2/49

    Last Time

    Regression

    Regularization and Overfitting

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    3/49

    Today

    Logistic Regression

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    4/49

    Classification

    Classification

    Goal: Identify which of K classes a data point x belongs to.

    Like Regression, Classification is a supervised task.

    For each data point xi we have a corresponding target (or label, orclass) ti that describes the correct classification of the data point.

    Goal: identify a function y : RD C whereti C = {c0, . . . , cK1}

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    5/49

    Representations of the target variable

    y : RD C where ti C

    For binary (two-way) classification it is convenient to represent tias a single scalar variable ti {0, 1}.

    This will allow us to interpret ti as the likelihood that a pointxi is a member of class cK1

    When hypothesized from a model, this can represent theconfidence of the prediction.

    For K 2 classes, we represent t as a K element vector, where, ifa point is a member of class cj the j-th element is 1, and all the

    others are 0.In 5-way classification, a member of class c2 is

    t = (0, 0, 1, 0, 0)T

    We may also represent t as a nominal variable when using

    non-probabilistic models.

    C

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    6/49

    Three approaches to Classification

    Generative Approach

    p(cj|x) =p(x|cj)p(cj)

    p(x)

    Discriminative approach

    p(cj|x)

    Discriminant function

    f(x) = cj

    Th h Cl ifi i

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    7/49

    Three approaches to Classification

    Generative Approach Highest resource requirements. Need toapproximate the joint probability p(x, cj)

    p(cj|x) =p(x|cj)p(cj)

    p(x)

    Discriminative approach

    p(cj|x)

    Discriminant functionf(x) = cj

    Th h Cl ifi i

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    8/49

    Three approaches to Classification

    Generative Approach

    p(cj|x) =p(x|cj)p(cj)

    p(x)

    Discriminative approach Moderate resource requirements.Typically less parameters to approximate than generative models

    p(cj|x)

    Discriminant functionf(x) = cj

    Th h t Cl ifi ti

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    9/49

    Three approaches to Classification

    Generative Approach

    p(cj|x) =p(x|cj)p(cj)

    p(x)

    Discriminative approach

    p(cj|x)

    Discriminant function Can be trained probabilistically, but theoutput does not include confidence information.

    f(x) = cj

    Di i i t F ti

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    10/49

    Discriminant Functions

    Wh Dis i i t F ti s li iti

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    11/49

    Why Discriminant Functions are limiting

    What can Generative and Discriminative approaches thatDiscriminant Functions cannot?

    ...Or why we like probabilities

    Minimizing Risk continuous updating.Reject Option I dont know

    Compensating for Priors

    Combining Models

    Well talk about these more when we discuss Perceptrons andNeural Networks.

    Generative Modeling

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    12/49

    Generative Modeling

    Generative modeling model the posterior

    p(c1|x) =p(x|c1)p(c1)

    p(x)

    Generative Modeling

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    13/49

    Generative Modeling

    Generative modeling model the posterior

    p(c1|x) =p(x|c1)p(c1)

    p(x)

    =p(x|c1)p(c1)

    j p(x, cj)

    Generative Modeling

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    14/49

    Generative Modeling

    Generative modeling model the posterior

    p(c1|x) =p(x|c1)p(c1)

    p(x)

    =p(x|c1)p(c1)

    j p(x, cj)

    =p(x|c1)p(c1)

    p(x, c0) + p(x, c1)

    Generative Modeling

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    15/49

    Generative Modeling

    Generative modeling model the posterior

    p(c1|x) =p(x|c1)p(c1)

    p(x)

    =p(x|c1)p(c1)

    j p(x, cj)

    =p(x|c1)p(c1)

    p(x, c0) + p(x, c1)

    = p(x|c1)p(c1)p(x|c0)p(c0) + p(x|c1)p(c1)

    Sigmoid function

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    16/49

    Sigmoid function

    The sigmoid1 function is a squashing function.

    (x) =1

    1 + exp(x)

    Squashing function maps the reals to a finite domain.

    : R (0, 1)

    1S-shaped

    Generative Modeling

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    17/49

    Generative Modeling

    p(c1|x) =

    p(x|c1)p(c1)

    p(x|c0)p(c0) + p(x|c1)p(c1)

    =

    p(x|c0)p(c0) + p(x|c1)p(c1)

    p(x|c1)p(c1)

    1

    =

    p(x|c0)p(c0)

    p(x|c1)p(c1)+ 1

    1

    =

    exp

    ln

    p(x|c0)p(c0)

    p(x|c1)p(c1)

    + 1

    1

    =

    exp

    ln

    p(x|c1)p(c1)

    p(x|c0)p(c0)

    + 1

    1

    a = ln p(x|c1)p(c1)p(x|c0)p(c0)

    p(c1|x) =1

    1 + exp(a)

    = (a)

    Some more vocabulary

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    18/49

    Some more vocabulary

    log-odds or log-odds-ratio

    a = lnp(x|c1)p(c1)

    p(x|c0)p(c0)

    logit function inverse of the sigmoid.

    =1

    1 + exp(a)

    a = ln

    1

    Generative Model

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    19/49

    Generative Model

    Derive p(c0|x) with Gaussian class conditional probability.

    p(x|ck) =1

    (2)D/21

    ||1/2exp

    1

    2(x k)

    T1(x k)

    Well assume that p(x|c0) and p(x|c1) have equal covariancematrices.

    Want to show that p(c0|x) = (wTx)

    p(c0|x) = (a)

    a = lnp(x|c0)p(c0)

    p(x|c1)p(c1)

    a = ln

    1

    (2)D/21

    ||1/2exp

    1

    2(x 0)

    T1(x 0)

    ff

    ln

    1

    (2)D/21

    ||1/2exp

    1

    2(x 1)

    T1(x 1)

    ff+ ln

    p(c0)

    p(c1)

    Generative model

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    20/49

    Generative model

    p(c0|x) = (a)

    a = ln

    1

    (2)D/21

    ||1/2exp

    1

    2(x 0)

    T1(x 0)

    ff

    ln

    1

    (2)D/21

    ||1/2exp

    1

    2(x 1)

    T1(x 1)

    ff+ ln

    p(c0)

    p(c1)

    Generative model

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    21/49

    Generative model

    p(c0|x) = (a)

    a = ln

    1

    (2)D/21

    ||1/2exp

    1

    2(x 0)

    T1(x 0)

    ff

    ln

    1

    (2)D/21

    ||1/2exp

    1

    2(x 1)

    T1(x 1)

    ff+ ln

    p(c0)

    p(c1)

    = 12

    (x 0)T

    1(x 0) + 12(x 1)

    T1(x 1) + ln

    p(c0)p(c1)

    Generative model

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    22/49

    G

    p(c0|x) = (a)

    a = ln

    1

    (2)D/21

    ||1/2exp

    1

    2(x 0)

    T1(x 0)

    ff

    ln

    1

    (2)D/21

    ||1/2exp

    1

    2(x 1)

    T1(x 1)

    ff+ ln

    p(c0)

    p(c1)

    = 12

    (x 0)T

    1(x 0) + 12(x 1)

    T1(x 1) + ln

    p(c0)p(c1)

    = 1

    2

    xT

    1

    x + 0T

    10 x

    T10 0

    T1

    x

    +1

    2 xT

    1

    x + 1T

    11 x

    T11 1

    T1

    x+ ln

    p(c0)

    p(c1)

    If A is symmetric A = AT. If A is symmetric, xTAy = yTAx. (HW).

    Generative model

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    23/49

    p(c0|x) = (a)

    a = 1

    2

    xT

    1

    x + 0T

    10 x

    T10 0

    T1

    x

    +1

    2

    xT

    1

    x + 1T

    11 x

    T11 1

    T1

    x

    + lnp

    (c0)p(c1)

    Generative model

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    24/49

    p(c0|x) = (a)

    a = 1

    2

    xT

    1

    x + 0T

    10 x

    T10 0

    T1

    x

    +1

    2

    xT

    1

    x + 1T

    11 x

    T11 1

    T1

    x

    + lnp

    (c0)p(c1)

    a = 1

    2

    0

    T10 20

    T1

    x

    +1

    2

    1

    T11 21

    T1

    x

    + ln p(c0)p(c1)

    If A is symmetric A = AT. If A is symmetric, xTAy = yTAxT. (HW).

    Generative model

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    25/49

    p(c0|x) =

    (a

    )a =

    1

    2

    xT

    1

    x + 0T

    10 x

    T10 0

    T1

    x

    +1

    2

    xT

    1

    x + 1T

    11 x

    T11 1

    T1

    x

    + lnp(c0)

    p(c1)

    a = 1

    2

    0

    T10 20

    T1

    x

    +1

    2

    1

    T11 21

    T1

    x

    + ln

    p(c0)

    p(c1)

    a = (01 1

    1)x 1

    20

    T10 +

    1

    21

    T11 + ln

    p(c0)

    p(c1)

    If A is symmetric A = AT. If A is symmetric, xTAy = yTAxT. (HW).

    Generative model

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    26/49

    p(c0|x) = (a)

    a = (01 1

    1)x 1

    20

    T10 +

    1

    21

    T11 + ln

    p(c0)

    p(c1)

    Generative model

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    27/49

    p(c0|x) = (a)

    a = (01 1

    1)x 1

    20

    T10 +

    1

    21

    T11 + ln

    p(c0)

    p(c1)

    a = (wTx + w0)

    w = 10T 11

    T

    w0 = 1

    20

    T10 +

    1

    21

    T11 + ln

    p(c0)

    p(c1)

    Maximum Likelihood Solution

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    28/49

    Now we have a way to describe the linear transformation to x to

    generate a prediction under a Gaussian assumption.

    How do we estimate the parameters?

    Maximize the likelihood function with respect to each parameter.

    p(t, X|,0,1, ) =N1n=0

    (N(xn |0, ))tn((1)N(xn |1, ))

    1tn

    tn = 1 for class 0, tn = 0 for class 1.

    Prior class probabilities p(C0) = , p(C1) = 1 .

    Maximum Likelihood Solution

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    29/49

    Optimize .

    p(t, X|,0,1, ) =N1Y

    n=0

    (N(xn|0, ))tn((1 )N(xn|1, ))

    1tn

    ln p(t, X|,0,1, ) = lnN1Yn=0

    (N(xn|0, ))tn((1 )N(xn|1, ))

    1tn

    =

    N1X

    n=0

    tn ln(N(xn|0, )) + (1 tn) ln((1 )N(xn|1, ))

    =N1Xn=0

    tn ln(N(xn|0, )) + (1 tn) ln((1 )N(xn|1, ))

    =N1X

    n=0

    tn ln() + (1 tn)ln(1 ) + const

    Maximum Likelihood Solution

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    30/49

    ln p(t, X|,0,1, ) =N1

    Xn=0

    tn ln() + (1 tn) ln(1 ) + const

    ln p(t, X|,0,1, )

    =

    1

    N1Xn=0

    tn 1

    1

    N1Xn=0

    (1 tn) = 0

    1

    N1

    Xn=0

    tn =1

    1

    N1

    Xn=0

    (1 tn)

    1

    N1Xn=0

    tn =N1Xn=0

    (1 tn)

    1

    1

    N1Xn=0

    tn =

    N1

    Xn=0(1 tn)

    1

    N1Xn=0

    tn =N1Xn=0

    (1 tn) +N1Xn=0

    (tn)

    1

    N1Xn=0

    tn =

    N1

    Xn=01

    Maximum Likelihood Solution

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    31/49

    1 N1X

    n=0tn =

    N1Xn=0

    1

    1

    N1Xn=0

    tn = N

    1

    N

    N1Xn=0

    tn =

    1

    N

    N1Xn=0

    tn =

    N0

    N=

    N0

    N0 + N1=

    Be prepared to maximize 0 and for HW.

    Discriminative Linear Classification

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    32/49

    In the generative case recall,

    p(t|x) =p(x|t)p(t)

    p(x)

    Can generate synthetic data from p(x).

    Need to model the joint probability

    In Discriminative Modeling

    Model p(t|x) directly

    Logistic Regression

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    33/49

    From the generative case we can find that under someassumptions:

    p(t|x) = y(x) = (wTx)

    In M-dimensions this has M parameters.In the generative case 2M means and M(M+1)/2 covariancematrix2.

    Parameters grow linearly in M or quadratically in M.

    So wed rather optimize this function directly.

    2Covariance matrices are symmetric

    Maximum likelihood

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    34/49

    Define the Likelihood.

    p(t|w) =N1n=0

    p(c0|xn)tn

    1 p(c1|xn)1tn

    E(w) = ln p(t|w) = N1

    n=0

    {tn ln p(c0|xn) + (1 tn) ln p(c1|xn)}

    Where y0 = p(c0|xn) = (an).

    E(w) = ln p(t|w) = N1

    n=0

    {tn

    ln yn

    + (1 tn

    )ln(1 yn

    )}

    This is also the cross entropy error function.3

    3Logisitic Regression is also called maximum entropy or maxent.

    Maximum Likelihood

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    35/49

    E(w) = ln p(t|w) = N1n=0

    {tn ln yn + (1 tn)ln(1 yn)}

    chain rule

    wE =N1n=0

    E

    yn

    yn

    anwan

    Maximum Likelihood

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    36/49

    Derivation of Eyn .

    E(w) = N1n=0

    {tn ln yn + (1 tn)ln(1 yn)}

    E

    yn (w) =

    1 tn1 yn

    tn

    yn

    =yn(1 tn) tn(1 yn)

    yn(1 yn)

    =yn yntn tn + yntn

    yn(1 yn)

    =yn tn

    yn(1 yn)

    Maximum Likelihood

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    37/49

    Derivation of ynan .

    dda

    =d 1

    1+exp(

    a)da

    =d(1 + exp(a))1

    da

    = (1)(1 + exp(a))2(exp(a))(1)

    = (1 + exp(a))2(exp(a))

    =1

    1 + exp(a)

    exp(a)

    1 + exp(a)

    =1

    1 + exp(a)1 + exp(a) 1

    1 + exp(a)

    =1

    1 + exp(a)

    1 + exp(a)

    1 + exp(a)

    1

    1 + exp(a)

    = (1 )

    Maximum Likelihood

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    38/49

    Derivation of wan

    an = wTx

    wan = xn

    Maximum Likelihood

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    39/49

    Putting it all together

    wE =N1n=0

    E

    yn

    yn

    anwan

    =

    N1n=0

    yn tnyn(1 yn) (y

    n(1 yn))xn

    =N1n=0

    (yn tn)xn

    Same as gradient of the sum of squares error in linear regression.

    How do we optimize this?

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    40/49

    We know the gradient, but how do we find the maximum value

    wE =N1n=0

    (yn tn)xn

    Numerical Approximations

    Gradient Ascent

    wn+1 = wn + wE(wn)

    Guess.

    Jump in the direction of the negative gradient.

    Guess again.

    Example of Gradient Descent

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    41/49

    10 8 6 4 2 0 2 4 6 8 10100

    50

    0

    50

    100

    150

    x0 = 5, f(x0) = 10, = .2

    x1 = x0 f

    (x0) = 5 .2 10 = 3

    Example of Gradient Descent

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    42/49

    10 8 6 4 2 0 2 4 6 8 1060

    40

    20

    0

    20

    40

    60

    80

    100

    120

    140

    x1 = 3, f(x1) = 6, = .2

    x2 = x1 f

    (x1) = 3 .2 6 = 1.8

    Example of Gradient Descent

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    43/49

    10 8 6 4 2 0 2 4 6 8 1020

    0

    20

    40

    60

    80

    100

    120

    140

    x2 = 1.8, f(x2) = 3.6, = .2

    x3 = x2 f

    (x2) = 1.8 .2 3.6 = 1.08

    Example of Gradient Descent

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    44/49

    10 8 6 4 2 0 2 4 6 8 100

    20

    40

    60

    80

    100

    120

    140

    x3 = 1.08, f(x3) = 2.16, = .2

    x4 = x3 f

    (x3) = 5 .2 10 = .648

    Another approach to N-way classification

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    45/49

    In this derivation we used a 1-of-K representation with K-classdiscriminant function.

    Another approach is to construct K 1 binary classifiers.

    Each classifier Cn compares cn to not cn

    Binary Classifiers are simpler.

    But there are some problems with this approach.

    One versus the rest

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    46/49

    K-class discriminant

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    47/49

    Context

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    48/49

    Logistic Regression

    Powerful classification technique.

    Must be approximated no closed form.Assumption of linearity

    Can also be extended with basis functions.

    Also called maximum entropy.

    Bye

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 6: Logistic Regression CSC 84020 - Machine Learning

    49/49

    Next

    Graphical Models

    http://find/http://goback/