25
E D = + f H f H g (D) (x) ¯ g (x) f (x) E E N N E E N E E N

WINSEM2014-15 CP3599 07-Jan-2015 RM01 LogisticRegression

Embed Size (px)

DESCRIPTION

gttggttt

Citation preview

  • Review of Leture 8

    Bias and variane

    Expeted value of Eout

    w.r.t. D

    = bias + var

    f

    H bias

    var

    f

    H

    g(D)(x) g(x) f(x)

    Learning urves

    How Ein

    and Eout

    vary with N

    PSfrag replaements

    Number of Data Points, N

    E

    x

    p

    e

    t

    e

    d

    E

    r

    r

    o

    r

    bias

    variane

    Eout

    Ein

    20

    40

    60

    80

    0.16

    0.17

    0.18

    0.19

    0.2

    0.21

    0.22

    PSfrag replaements

    Number of Data Points, N

    E

    x

    p

    e

    t

    e

    d

    E

    r

    r

    o

    r

    in-sample error

    generalization error

    Eout

    Ein

    20

    40

    60

    80

    0.16

    0.17

    0.18

    0.19

    0.2

    0.21

    0.22

    N VC dimension

    B-V:

    VC:

  • Learning From Data

    Yaser S. Abu-Mostafa

    California Institute of Tehnology

    Leture 9: The Linear Model II

    Sponsored by Calteh's Provost Oe, E&AS Division, and IST Tuesday, May 1, 2012

  • Where we are

    Linear lassiation X

    Linear regression X

    Logisti regression ?

    Nonlinear transforms X

    Learning From Data - Leture 9 2/24

  • Nonlinear transforms

    x = (x0, x1, , xd) z = (z0, z1, , zd)

    Eah zi = i(x) z = (x)

    Example: z = (1, x1, x2, x1x2, x21, x

    22)

    Final hypothesis g(x) in X spae:

    sign

    (w

    T(x))

    or wT(x)

    Learning From Data - Leture 9 3/24

  • The prie we pay

    x = (x0, x1, , xd) z = (z0, z1, , zd)

    w w

    dv

    = d+ 1 dv

    d+ 1

    Learning From Data - Leture 9 4/24

  • Two non-separable ases

    PSfrag replaements

    -1

    -0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    PSfrag replaements

    -1

    -0.5

    0

    0.5

    1

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    Learning From Data - Leture 9 5/24

  • First ase

    PSfrag replaements

    -1

    -0.5

    0

    0.5

    1

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    Use a linear model in X ; a

    ept Ein

    > 0

    or

    Insist on Ein

    = 0; go to high-dimensional Z

    Learning From Data - Leture 9 6/24

  • Seond ase

    PSfrag replaements

    -1

    -0.5

    0

    0.5

    1

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    z = (1, x1, x2, x1x2, x21, x

    22)

    Why not: z = (1, x21, x22)

    or better yet: z = (1, x21 + x22)

    or even: z = (x21 + x22 0.6)

    Learning From Data - Leture 9 7/24

  • Lesson learned

    Looking at the data before hoosing the model an be hazardous to your Eout

    Data snooping

    Learning From Data - Leture 9 8/24

  • Logisti regression - Outline

    The model

    Error measure

    Learning algorithm

    Learning From Data - Leture 9 9/24

  • A third linear model

    s =

    di=0

    wixi

    linear lassiation linear regression logisti regression

    h(x) = sign(s) h(x) = s h(x) = (s)

    sx

    x

    x

    x0

    1

    2

    d

    h x( )s

    x

    x

    x

    x0

    1

    2

    d

    h x( )s

    x

    x

    x

    x0

    1

    2

    d

    h x( )

    Learning From Data - Leture 9 10/24

  • The logisti funtion

    The formula:

    (s) =es

    1 + es

    PSfrag replaements

    (s)1

    0 s

    -4

    -2

    0

    2

    4

    0

    0.5

    1

    soft threshold: unertainty

    sigmoid: attened out `s'

    Learning From Data - Leture 9 11/24

  • Probability interpretation

    h(x) = (s) is interpreted as a probability

    Example. Predition of heart attaks

    Input x: holesterol level, age, weight, et.

    (s): probability of a heart attak

    The signal s = wTx risk sore

    h(x) = (s)

    Learning From Data - Leture 9 12/24

  • Genuine probability

    Data (x, y) with binary y, generated by a noisy target:

    P (y | x) =

    {f(x) for y = +1;

    1f(x) for y = 1.

    The target f : Rd [0, 1] is the probability

    Learn g(x) = (wT x) f(x)

    Learning From Data - Leture 9 13/24

  • Error measure

    For eah (x, y), y is generated by probability f(x)

    Plausible error measure based on likelihood:

    If h = f , how likely to get y from x?

    P (y | x) =

    {h(x) for y = +1;

    1 h(x) for y = 1.

    Learning From Data - Leture 9 14/24

  • Formula for likelihood

    P (y | x) =

    {h(x) for y = +1;

    1 h(x) for y = 1.

    Substitute h(x) = (wTx), noting (s) = 1 (s)

    PSfrag replaements

    (s)

    1

    0 s

    -4

    -2

    0

    2

    4

    0

    0.5

    1

    P (y | x) = (y wTx)

    Likelihood of D = (x1, y1), . . . , (xN , yN) is

    Nn=1

    P (yn | xn) =

    Nn=1

    (ynwT

    xn)

    Learning From Data - Leture 9 15/24

  • Maximizing the likelihood

    Minimize

    1

    Nln

    (Nn=1

    (yn wT

    xn)

    )

    =1

    N

    Nn=1

    ln

    (1

    (yn wT xn)

    ) [(s) =

    1

    1 + es

    ]

    Ein

    (w) =1

    N

    Nn=1

    ln(1 + eynw

    T

    xn

    )

    e(h(xn),yn)ross-entropy error

    Learning From Data - Leture 9 16/24

  • Logisti regression - Outline

    The model

    Error measure

    Learning algorithm

    Learning From Data - Leture 9 17/24

  • How to minimize Ein

    For logisti regression,

    Ein

    (w) =1

    N

    Nn=1

    ln(1 + eynw

    T

    xn

    )

    Compare to linear regression:

    Ein

    (w) =1

    N

    Nn=1

    (wTxn yn)2

    losed-form solution

    iterative solution

    Learning From Data - Leture 9 18/24

  • Iterative method: gradient desent

    PSfrag replaements

    Weights, w

    I

    n

    -

    s

    a

    m

    p

    l

    e

    E

    r

    r

    o

    r

    ,

    E

    i

    n

    -10

    -8

    -6

    -4

    -2

    0

    2

    10

    15

    20

    25

    General method for nonlinear optimization

    Start at w(0); take a step along steepest slope

    Fixed step size: w(1) = w(0) + v

    What is the diretion v?

    Ein

    (w)

    Learning From Data - Leture 9 19/24

  • Formula for the diretion v

    Ein

    = Ein

    (w(0) + v ) Ein

    (w(0))

    = Ein

    (w(0))tv + O(2)

    Ein

    (w(0))

    Sine v is a unit vetor,

    v = E

    in

    (w(0))

    Ein

    (w(0))

    Learning From Data - Leture 9 20/24

  • Fixed-size step?

    How aets the algorithm:

    PSfrag replaements

    Weights, w

    I

    n

    -

    s

    a

    m

    p

    l

    e

    E

    r

    r

    o

    r

    ,

    Ei

    n

    -1

    -0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    0

    0.2

    0.4

    0.6

    0.8

    1

    PSfrag replaements

    Weights, w

    I

    n

    -

    s

    a

    m

    p

    l

    e

    E

    r

    r

    o

    r

    ,

    E

    i

    n

    -1

    -0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    PSfrag replaements

    Weights, w

    I

    n

    -

    s

    a

    m

    p

    l

    e

    E

    r

    r

    o

    r

    ,

    E

    i

    n

    large

    small

    -1

    -0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    0

    0.2

    0.4

    0.6

    0.8

    1

    too small too large variable just right

    should inrease with the slope

    Learning From Data - Leture 9 21/24

  • Easy implementation

    Instead of

    w = v

    = E

    in

    (w(0))

    Ein

    (w(0))

    Have

    w = Ein

    (w(0))

    Fixed learning rate

    Learning From Data - Leture 9 22/24

  • Logisti regression algorithm

    1: Initialize the weights at t = 0 to w(0)

    2: for t = 0, 1, 2, . . . do

    3: Compute the gradient

    Ein

    = 1

    N

    Nn=1

    ynxn

    1 + e ynwT(t)xn

    4: Update the weights: w(t+ 1) = w(t) Ein

    5: Iterate to the next step until it is time to stop

    6: Return the nal weights w

    Learning From Data - Leture 9 23/24

  • Summary of Linear Models

    Credit

    Analysis

    Amount

    of Credit

    Approve

    or Deny

    Probability

    of Default

    Pereptron

    Logisti Regression

    Linear Regression

    Classiation Error

    PLA, Poket,. . .

    Cross-entropy Error

    Gradient desent

    Squared Error

    Pseudo-inverse

    Learning From Data - Leture 9 24/24