Classification: Linear Models

Slide 1

Oliver SchulteMachine Learning 726Classification: Linear Models#/57If you use insert slide number under Footer, that text box only displays the slide number, not the total number of slides. So I use a new textbox for the slide number in the master.1Parent Node/Child NodeDiscreteContinuousDiscreteMaximum LikelihoodDecision Treeslogit distribution(logistic regression)Classifiers:linear discriminant (perceptron)Support vector machine

Continuousconditional Gaussian(not discussed)linear Gaussian(linear regression)#/57Linear Classification ModelsGeneral Idea: Learn linear continuous function y of continuous features x.Classify as positive if y crosses a threshold, typically 0.As in linear regression, can use more complicated features defined by basis functions .

#/57Example: Classifying DigitsClassify input vector as 4 vs. not 4.Represent input image as vector x with 28x28 =784 numbers.Target t = 1 for positive, -1 for negative.Given a training set (x1,t1,..,xN,tN), the problem is find a good linear function y(x).y:R784 R.Classify x as positive if y(x) >0, negative o.w.

#/57could choose other values, like 1 vs. 0. This will turn out to be convenient.tiff images work on pc4Other ExamplesWill the person vote conservative, given age, income, previous votes?Is the patient at risk of diabetes given body mass, age, blood test measurements?Predict Earthquake vs. nuclear explosion given body wave magnitude and surface wave magnitude.AgeIncomeVotesConvervativedisaster typesurface wave magnitudebody wave magnitude#/57Linear SeparationRussell and Norvig Figure 18.15white = earthquakeblack = nuclear explosionx1 = surface wave magnitudex2 = body wave magnitude

#/57Events in Asia and Middle East between 1982 and 1990.6Linear DiscriminantsSimple linear model:Can drop explicit w0 if we assume fixed dummy bias.Decision surface is line, orthogonal to w.In 2-D, just try a line between the classes!

#/57weight vector points towards positive class7Perceptron Learning#/57Defining an Error FunctionGeneral idea: Encode class label using a real number t.e.g., positive = 1, negative = 0 or negative = -1.Measure error by comparing continuous linear output y and class label code t.#/57The Error Function for linear discriminantsCould use squared error as in linear regression.Various problems (see book). Basically due to the fact that 1,-1 are not real target values.Different criterion developed for learning perceptrons.Perceptrons are a precursor to neural nets.Analog implementation by Rosenblatt in the 1950s, see Figure 4.8.

#/57The Perceptron CriterionAn example is misclassified if (Take a moment to verify this.)Perceptron Error

where M is the set of misclassified inputs, the mistakes.Exercise: find the gradient of the error function wrt a single input xn.

#/57Solution: 0 if x_n is correctly classified, o.w. - x_n t_n (input vector times target scalar vector) Proof: fix single w_j, multiply t_n into the dot product. If output = 0, algorithm fails. Or assume this does not happen.11Perceptron Learning AlgorithmUse stochastic gradient descent.gradient descent for one example at a time, cycle through.Update Equation:where we set = 1 (without loss of generality in this case).Excel Demo.

#/57Legend: the arrrow shows the negated gradient, indicating the direction that produces steepest descent along the error surface12Perceptron Demo

#/57weight vector = black. points in direction of red class. Add weight vector to misclssified feature vector to get new weight vector.13Perceptron Learning AnalysisTheorem If the classes are linearly separable, the perceptron learning algorithm converges to a weight vector that separates them.Convergence can be slow.Sensitive to initialization.#/57NonseparabilityLinear discriminants can solve problems only if the classes can be separated by a line (hyperplane).Canonical example of non-separable problem is X-OR.Perceptron typically does not converge.

#/57using 1 for true, 0 for false.15Nonseparability: real world exampleFigure Russell and Norvig 18.15 b

white = earthquakeblack = nuclear explosionx1 = surface wave magnitudex2 = body wave magnitude#/57more actual data points added, no longer linearly separable.16Responses to NonseparabilityClasses cannot be separated by a linear discriminantlogistic regressionFisher discriminant(not covered)separate classes not completely but well

neural networksupport vector machineadd hidden featuresuse non-linear activation functionfinds approximate solution#/57Logistic Regression#/57From Values to ProbabilitiesKey idea: instead of predicting a class label, predict the probability of a class label.E.g., p+ = P(class is positive|features)p- = P(class is negative|features)Naturally a continuous quantity.How to turn a real number y into a probability p+?#/57The Logistic Sigmoid FunctionDefinition:Squeezes the real line into [0,1].Differentiable: (nice exercise)

#/5720Soft threshold interpretationFigure Russell and Norvig 18.17If y> 0, (y) goes to 1 very quickly.If y

Documents

Classification: Linear Models