57
1 Language Lecture 7: Introduction to Classification: Linear Learning Algorithms 2009

Linear Functions

  • Upload
    egan

  • View
    29

  • Download
    1

Embed Size (px)

DESCRIPTION

CS546: Machine Learning and Natural Language Lecture 7: Introduction to Classification: Linear Learning Algorithms 2009. {. 1 if w 1 x 1 + w 2 x 2 +. . . w n x n >=  0 Otherwise. f (x) =. y = x 1  x 3  x 5. Disjunctions:. - PowerPoint PPT Presentation

Citation preview

Page 1: Linear Functions

1

CS546: Machine Learning and Natural Language

Lecture 7: Introduction to Classification:

Linear Learning Algorithms

2009

Page 2: Linear Functions

2

Linear Functions

f (x) = 1 if w1 x1 + w2 x2 +. . . wn xn >= 0 Otherwise {

• Disjunctions:

y = x1 x3 x5

y = ( 1• x1 + 1• x3 + 1• x5 >= 1)• At least m of n:

y = at least 2 of {x1 , x3 , x5}

y = ( 1• x1 + 1• x3 + 1• x5 >=2)

• Exclusive-OR: y = (x1 x2) v (x1 x2)

• Non-trivial DNF: y = (x1 x2) v (x3 x4)

Page 3: Linear Functions

3

Linear Functions

w ¢ x = 0

- --- --

-- -

- -

- --

-

w ¢ x =

Page 4: Linear Functions

4

Perceptron learning rule

• On-line, mistake driven algorithm.• Rosenblatt (1959) suggested that when a target output value is provided for a single neuron with fixed input, it can incrementally change weights and learn to produce the output using the Perceptron learning rule Perceptron == Linear Threshold Unit

12

6

345

7

6w

1w

T

y

1x

6x

Page 5: Linear Functions

5

Perceptron learning rule

• We learn f:X{-1,+1} represented as f = sgn{wx) Where X= or X= w n{0,1} nR nR• Given Labeled examples: )}y,(x),...,y,(x),y,{(x mm2211

1. Initialize w=0

2. Cycle through all examples

a. Predict the label of instance x to be y’ = sgn{wx)

b. If y’y, update the weight vector:

w = w + r y x (r - a constant, learning rate)

Otherwise, if y’=y, leave weights unchanged.

nR

Page 6: Linear Functions

6

Footnote About the Threshold

• On previous slide, Perceptron has no threshold• But we don’t lose generality:

,

1,

ww

xxx

0x

1x

xw

0x

1x

01,, xw

Page 7: Linear Functions

7

Geometric View

Page 8: Linear Functions

8

Page 9: Linear Functions

9

Page 10: Linear Functions

10

Page 11: Linear Functions

11

Perceptron learning rule

1. Initialize w=0

2. Cycle through all examples

a. Predict the label of instance x to be y’ = sgn{wx)

b. If y’y, update the weight vector to

w = w + r y x (r - a constant, learning rate)

Otherwise, if y’=y, leave weights unchanged.

nR

• If x is Boolean, only weights of active features are updated.

1/2x)exp(w1

1 to equivalent is 0xw

Page 12: Linear Functions

12

Perceptron Learnability

• Obviously can’t learn what it can’t represent– Only linearly separable functions

• Minsky and Papert (1969) wrote an influential book demonstrating Perceptron’s representational limitations– Parity functions can’t be learned (XOR)– In vision, if patterns are represented with local features,

can’t represent symmetry, connectivity• Research on Neural Networks stopped for years

• Rosenblatt himself (1959) asked,

• “What pattern recognition problems can be transformed so as to become linearly separable?”

Page 13: Linear Functions

13

(x1 x2) v (x3 x4) y1 y2

Page 14: Linear Functions

14

Perceptron Convergence• Perceptron Convergence Theorem: If there exist a set of weights that are consistent with the (I.e., the data is linearly separable) the perceptron learning algorithm will converge

-- How long would it take to converge ?

• Perceptron Cycling Theorem: If the training data is not linearly the perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop.

-- How to provide robustness, more expressivity ?

Page 15: Linear Functions

15

• Maintains a weight vector wRN, w0=(0,…,0).• Upon receiving an example x RN • Predicts according to the linear threshold function w•x 0.

Theorem [Novikoff,1963] Let (x1; y1),…,: (xt; yt), be a sequence of labeled examples with xi RN, xiR and yi {-1,1} for all i.

Let uRN, > 0 be such that, ||u|| = 1 and yi u • xi for all i.

Then Perceptron makes at most R2 / 2 mistakes on this example sequence.

(see additional notes)

Perceptron: Mistake Bound Theorem

Margin

Complexity Parameter

Page 16: Linear Functions

16

Perceptron-Mistake Bound

Proof: Let vk be the hypothesis before the k-th mistake. Assume that the k-th mistake occurs on the input example (xi, yi).

Assumptions

v1 = 0

||u|| ≤ 1yi u • xi

K < R2 / 2

Multiply by u

By definition of u

By induction

Projection

Page 17: Linear Functions

18

Perceptron for Boolean Functions

• How many mistakes will the Perceptron algorithms make when learning a k-disjunction?• It can make O(n) mistakes on k-disjunction on n attributes.•Our bound: R2 / 2

• w : 1 / k 1/2 – for k components, 0 for others,• : difference only in one variable : 1 / k ½

• R: n 1/2

Thus, we get : n k Is it possible to do better?• This is important if n—the number of features is very large

Page 18: Linear Functions

19

Winnow Algorithm

• The Winnow Algorithm learns Linear Threshold Functions.

• For the class of disjunction, instead of demotion we can use elimination.

(demotion) 1)x (if /2w w,xbut w 0f(x) If

)(promotion 1)x (if 2w w,xwbut 1f(x) If

nothing do :mistake no If

xw iff 1 is Prediction

w :Initialize

iii

iii

i

1n;

Page 19: Linear Functions

20

Winnow - Example

)256,....,256,256,1,512(

variable)goodeach (for log(n/2) .......................

mistake )2...,2,4,1,8( ),1..,0,0,1,0,1(

mistake )1...,2,2,1,4( ),0..,0,1,1,0,1(

mistake )1,....,1,2( ),0,...,0,0,1(

ok )1,....,1,1( ),0,,,,111,0,0(

ok )1,....,1,1( -,0,...,0), 0(

ok )1,...,1,1( ),(1,1,...,1

)1,...,1,1(;1024 :Initialize1024102321

w

wxw

wxw

wxw

wxw

wxw

wxw

w

xxxxf

Page 20: Linear Functions

21

Winnow - Example

• Notice that the same algorithm will learn a conjunction over these variables (w=(256,256,0,…32,…256,256) )

)256,....,256,256,1,512(

variable)goodeach (for log(n/2) .......................

mistake )2...,2,4,1,8( ),1..,0,0,1,0,1(

mistake )1...,2,2,1,4( ),0..,0,1,1,0,1(

mistake )1,....,1,2( ),0,...,0,0,1(

ok )1,....,1,1( ),0,,,,111,0,0(

ok )1,....,1,1( -,0,...,0), 0(

ok )1,...,1,1( ),(1,1,...,1

)1,...,1,1(;1024 :Initialize1024102321

w

wxw

wxw

wxw

wxw

wxw

wxw

w

xxxxf

Page 21: Linear Functions

22

Winnow - Mistake Bound

Claim: Winnow makes O(k log n) mistakes on k-disjunctions

u - # of mistakes on positive examples (promotions)v - # of mistakes on negative examples (demotions)

(demotion) 1)x (if /2w w,xbut w 0f(x) If

)(promotion 1)x (if 2w w,xwbut 1f(x) If

nothing do :mistake no If

xw iff 1 is Prediction

w :Initialize

iii

iii

i

1n;

Page 22: Linear Functions

23

Claim: Winnow makes O(k log n) mistakes on k-disjunctions

u - # of mistakes on positive examples (promotions)v - # of mistakes on negative examples (demotions)

1. u < k log(2n)

(demotion) 1)x (if /2w w,xbut w 0f(x) If

)(promotion 1)x (if 2w w,xwbut 1f(x) If

nothing do :mistake no If

xw iff 1 is Prediction

w :Initialize

iii

iii

i

1n;

Winnow - Mistake Bound

Page 23: Linear Functions

24

Claim: Winnow makes O(k log n) mistakes on k-disjunctions

u - # of mistakes on positive examples (promotions)v - # of mistakes on negative examples (demotions)

1. u < k log(2n)A weight that corresponds to a good variable is only promoted.When these weights get to n there will no more mistakes on positives

(demotion) 1)x (if /2w w,xbut w 0f(x) If

)(promotion 1)x (if 2w w,xwbut 1f(x) If

nothing do :mistake no If

xw iff 1 is Prediction

w :Initialize

iii

iii

i

1n;

Winnow - Mistake Bound

Page 24: Linear Functions

25

u - # of mistakes on positive examples (promotions)v - # of mistakes on negative examples (demotions)

2. v < 2(u + 1)

(demotion) 1)x (if /2w w,xbut w 0f(x) If

)(promotion 1)x (if 2w w,xwbut 1f(x) If

nothing do :mistake no If

xw iff 1 is Prediction

w :Initialize

iii

iii

i

1n;

Winnow - Mistake Bound

Page 25: Linear Functions

26

u - # of mistakes on positive examples (promotions)v - # of mistakes on negative examples (demotions)

2. v < 2(u + 1) Total weight: TW=n initially

(demotion) 1)x (if /2w w,xbut w 0f(x) If

)(promotion 1)x (if 2w w,xwbut 1f(x) If

nothing do :mistake no If

xw iff 1 is Prediction

w :Initialize

iii

iii

i

1n;

Winnow - Mistake Bound

Page 26: Linear Functions

27

u - # of mistakes on positive examples (promotions)v - # of mistakes on negative examples (demotions)

2. v < 2(u + 1) Total weight: TW=n initially Mistake on positive: TW(t+1) < TW(t) + n

(demotion) 1)x (if /2w w,xbut w 0f(x) If

)(promotion 1)x (if 2w w,xwbut 1f(x) If

nothing do :mistake no If

xw iff 1 is Prediction

w :Initialize

iii

iii

i

1n;

Winnow - Mistake Bound

xw

Page 27: Linear Functions

28

u - # of mistakes on positive examples (promotions)v - # of mistakes on negative examples (demotions)

2. v < 2(u + 1) Total weight TW=n initially Mistake on positive: TW(t+1) < TW(t) + n Mistake on negative: TW(t+1) < TW(t) - n/2

(demotion) 1)x (if /2w w,xbut w 0f(x) If

)(promotion 1)x (if 2w w,xwbut 1f(x) If

nothing do :mistake no If

xw iff 1 is Prediction

w :Initialize

iii

iii

i

1n;

Winnow - Mistake Bound

Page 28: Linear Functions

29

u - # of mistakes on positive examples (promotions)v - # of mistakes on negative examples (demotions)

2. v < 2(u + 1) Total weight TW=n initially Mistake on positive: TW(t+1) < TW(t) + n Mistake on negative: TW(t+1) < TW(t) - n/2 0 < TW < n + u n - v n/2 v < 2(u+1)

(demotion) 1)x (if /2w w,xbut w 0f(x) If

)(promotion 1)x (if 2w w,xwbut 1f(x) If

nothing do :mistake no If

xw iff 1 is Prediction

w :Initialize

iii

iii

i

1n;

Winnow - Mistake Bound

Page 29: Linear Functions

30

u - # of mistakes on positive examples (promotions)v - # of mistakes on negative examples (demotions)

# of mistakes: u+ v < 3u + 2 = O(k log n)

(demotion) 1)x (if /2w w,xbut w 0f(x) If

)(promotion 1)x (if 2w w,xwbut 1f(x) If

nothing do :mistake no If

xw iff 1 is Prediction

w :Initialize

iii

iii

i

1n;

Winnow - Mistake Bound

Page 30: Linear Functions

31

Winnow - Extensions • This algorithm learns monotone functions

• in Boolean algebra sense• For the general case: - Duplicate variables

• For the negation of variable x, introduce a new variable y. • Learn monotone functions over 2n variables

- Balanced version:• Keep two weights for each variable; effective weight is the difference

(demotion) 1 where2 2

1 ,)(but 0)( If

)(promotion 1 where2

1 2 ,)(but 1)( If

:Rule Update

iiiii

iiiii

xwwwwxwwxf

xwwwwxwwxf

Page 31: Linear Functions

32

Winnow - A Robust Variation

• Winnow is robust in the presence of various kinds of noise. (classification noise, attribute noise)

• Importance: sometimes we learn under some distribution but test under a slightly different one. (e.g., natural language applications)

Page 32: Linear Functions

33

Modeling: • Adversary’s turn: may change the target concept by adding or removing some variable from the target disjunction. Cost of each addition move is 1.

• Learner’s turn: makes prediction on the examples given, and is then told the correct answer (according to current target function) • Winnow-R: Same as Winnow, only doesn’t let weights go below 1/2

• Claim: Winnow-R makes O(c log n) mistakes, (c - cost of adversary) (generalization of previous claim)

Winnow - A Robust Variation

Page 33: Linear Functions

34

Algorithmic Approaches

• Focus: Two families of algorithms (one of the on-line representative)

• Additive update algorithms: Perceptron

• Multiplicative update algorithms: Winnow

SVM (not on-line, but a close relative of Perceptron)

Close relatives: Boosting; Max Entropy

Which Algorithm to choose?

Page 34: Linear Functions

35

Algorithm Descriptions

• Multiplicative weight update algorithm (Winnow, Littlestone, 1988. Variations exist)

• Additive weight update algorithm (Perceptron, Rosenblatt, 1958.

Variations exist)

(demotion) 1)x (if 1- w w,xbut w 0Class If

)(promotion 1)x (if 1 w w,xwbut 1Class If

iii

iii

(demotion) 1)x (if /2w w,xbut w 0Class If

)(promotion 1)x (if 2w w,xwbut 1Class If

iii

iii

xw iff 1 is Prediction

R w:Hypothesis ;{0,1} x :Examples nn

Page 35: Linear Functions

36

How to Compare?

• Generalization (since the representation is the same) How many examples are needed to get to a given level of accuracy?

• Efficiency How long does it take to learn a hypothesis and evaluate it (per-

example)? • Robustness; Adaptation to a new domain,

….

Page 36: Linear Functions

37

Sentence Representation

S= I don’t know whether to laugh or cry

- Define a set of features: features are relations that hold in

the sentence

- Map a sentence to its feature-based representation

The feature-based representation will give some

of the information in the sentence

- Use this as an example to your algorithm

Page 37: Linear Functions

38

Sentence Representation

S= I don’t know whether to laugh or cry

- Define a set of features:

features are relations that hold in the sentence- Conceptually, there are two steps in coming up with

a feature-based representation

1. What are the information sources available? Sensors: words, order of words, properties (?) of

words 2. What features to construct based on these?

Why needed?

Page 38: Linear Functions

39

Weather

Whether

523341321 xxxxxxxxx 541 yyy

New discriminator in functionally simpler

Embedding

Page 39: Linear Functions

40

Domain Characteristics

• The number of potential features is very large

• The instance space is sparse

• Decisions depend on a small set of features (sparse)

• Want to learn from a number of examples that is

small relative to the dimensionality

Page 40: Linear Functions

41

Which Algorithm to Choose?

• Generalization

– Multiplicative algorithms:• Bounds depend on ||u||, the separating hyperplane

• M =2ln n ||u||12 maxi||x(i)||12/mini(u ¢ x(i))2

• Advantage with few relevant features in concept

– Additive algorithms:• Bounds depend on ||x|| (Kivinen / Warmuth, ‘95)• M = ||u||2 maxi||x(i)||2/mini(u ¢ x(i))2

• Advantage with few active features per example

The l1 norm: ||x||1 = i|xi| The l2 norm: ||x||2 =(1n|xi|2)1/2

The lp norm: ||x||p = (1n|xi|p )1/p The l1 norm: ||x||1 = max

i|x

i|

Page 41: Linear Functions

42

Generalization

• Dominated by the sparseness of the function space

Most features are irrelevant # of examples required by

multiplicative algorithms depends mostly on # of relevant

features (Generalization bounds depend on

||w||;)

• Lesser issue: Sparseness of features space:

advantage to additive. Generalization depend on ||x||

(Kivinen/Warmuth 95); see additional notes.

Page 42: Linear Functions

43

Function: At least 10 out of

fixed 100 variables are active

Dimensionality is n

Perceptron,SVMs

n: Total # of Variables (Dimensionality)

Winnow

Mistakes bounds for 10 of 100 of n

# o

f m

ista

kes t

o

con

verg

en

ce

Page 43: Linear Functions

44

Dual Perceptron-We can replace xi ¢ xj with K(xi ,xj) which can be regarded a dot product in some large (or infinite) space

- K(x,y) - often can be computed efficiently without computing mapping to this space

Page 44: Linear Functions

45

Efficiency

• Dominated by the size of the feature space

• Could be more efficient since work is done in the original feature space.

• In practice: explicit Kernels (feature space blow-up) is often more efficient.

i

ii )K(x,xcf(x)

• Additive algorithms allow the use of Kernels No need to explicitly generate the complex

features

kn ) (x)... (x), (x), (x) n321 (),...,,( 321 kxxxxX

• Most features are functions (e.g., conjunctions) of raw attributes

Page 45: Linear Functions

46

Practical Issues and Extensions

• There are many extensions that can be made to these basic algorithms.

• Some are necessary for them to perform well.

• Infinite attribute domain

• Regularization

Page 46: Linear Functions

47

Extensions: Regularization

w ¢ x = 0

- --- --

-- -

- -

- --

-

w ¢ x =

• In general – regularization is used to bias the learner in the direction of a low-expressivity (low VC dimension) separator

• Thick Separator (Perceptron or Winnow)

– Promote if:w x > +– Demote if:w x < -

Page 47: Linear Functions

49

SNoW

• A learning architecture that supports several linear update rules (Winnow, Perceptron, naïve Bayes)

• Allows regularization; voted Winnow/Perceptron; pruning; many options

• True multi-class classification • Variable size examples; very good support for large

scale domains in terms of number of examples and number of features.

• “Explicit” kernels (blowing up feature space).• Very efficient (1-2 order of magnitude faster than

SVMs)• Stand alone, implemented in LBJ

[Dowload from: http://L2R.cs.uiuc.edu/~cogcomp ]

Page 48: Linear Functions

50

COLT approach to explaining Learning• No Distributional Assumption• Training Distribution is the same as the Test

Distribution • Generalization bounds depend on this view and affects model selection. ErrD(h) < ErrTR(h) + P(VC(H), log(1/±),1/m)

• This is also called the “Structural Risk Minimization” principle.

Page 49: Linear Functions

51

COLT approach to explaining Learning

• No Distributional Assumption• Training Distribution is the same as the Test Distribution • Generalization bounds depend on this view and affect model selection.

ErrD(h) < ErrTR(h) + P(VC(H), log(1/±),1/m)

• As presented, the VC dimension is a combinatorial parameter that is associated with a class of functions.

• We know that the class of linear functions has a lower VC dimension than the class of quadratic functions.

• But, this notion can be refined to depend on a given data set, and this way directly affect the hypothesis chosen for this data set.

Page 50: Linear Functions

52

Data Dependent VC dimension

• Consider the class of linear functions, parameterized by their margin. • Although both classifiers separate the data, the distance with which

the separation is achieved is different:

• Intuitively, we can agree that: Large Margin Small VC dimension

Page 51: Linear Functions

53

Margin and VC dimension

Page 52: Linear Functions

54

Margin and VC dimension

• Theorem (Vapnik): If H is the space of all linear classifiers in <n that separate the training data with margin at least °, then

VC(H) · R2/°2

• where R is the radius of the smallest sphere (in <n) that contains the data.

• This is the first observation that will lead to algorithmic approach.• The second one is that:

Small ||w|| Large Margin• Consequently: the algorithm will be: from among all those w’s that

agree with the data, find the one with the minimal size ||w||

Page 53: Linear Functions

55

Margin and Weight Vector

Consequently: the algorithm will be: from among all those w’s that agree with the data, find the one with the minimal size ||w||. This leads to the SVM optimization algorithm

Page 54: Linear Functions

56

Key Problems

• Computational Issues– A lot of effort has been spent on trying to optimize SVMs.– Gradually, algorithms became more on-line and more similar to

Perceptron and Stochastic Gradient Descent. – Algorithms like SMO have decomposed the quadratic

programming– More recent algorithms have become almost identical to earlier

algorithms we have seen• Is it really optimal?

– Experimental Results are very good– Issues with the tradeoff between # of examples and # of features

are similar to other linear classifiers.

Page 55: Linear Functions

57

Support Vector Machines

SVM = Linear Classifier + Regularization + [Kernel Trick].• This leads to an algorithm: from among all those w’s that agree with

the data, find the one with the minimal size ||w|| Minimize ½ ||w||2Subject to: y(w ¢ x + b) ¸ 1, for all x 2 S

• This is an optimization problem that can be solved using techniques from optimization theory. By introducing Lagrange multipliers we can rewrite the dual formulation of this optimization problems as:

w = i i yi xi

• Where the ’s are such that the following functional is maximized:• L() = -1/2 i j 1 j xi xj yi yj + i i• The optimum setting of the ’s turns out to be:

i yi (w ¢ xi +b -1 ) = 0 8 i

Page 56: Linear Functions

58

Support Vector Machines

SVM = Linear Classifier + Regularization + [Kernel Trick].Minimize ½ ||w||2

Subject to: y(w ¢ x + b) ¸ 1, for all x 2 S• The dual formulation of this optimization problems gives:

w = i i yi xi

• Optimum setting of ’s : i yi (w ¢ xi +b -1 ) = 0 8 i• That is, i :=0 only when w ¢ xi +b -1=0• Those are the points sitting on the margin, call support vectors• We get:

f(x,w,b) = w ¢ x +b = i i yi xi ¢ x +b• The value of the function depends on the support vectors, and only

on their dot product with the point of interest x.

1. Dependence on the dot product leads to the ability to introduce kernels (just like in perceptron)

2. What if the data is not linearly separable?

3. What is the difference from regularized perceptron/Winnow?

Page 57: Linear Functions

59

Summary

• Described examples of linear algorithms:– Perceptron, Winnow, SVM

• Additive vs. Multiplicative versions• Basic theory behind these methods• Robust modifications