41
Chapter4: Fitting a model to data 1

Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

Embed Size (px)

Citation preview

Page 1: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

1

Chapter4: Fitting a model to data

Page 2: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

2

Fitting a model to data• Predictive modeling involves finding a model of the

target variable in terms of other descriptive attributes.• In Chapter 3, we constructed a supervised

segmentation model by recursively finding informative attributes on ever-more-precise subsets of the set of all instances, or from the geometric perspective, ever-more-precise sub-regions of the instance space.

• From the data we produced both the structure of the model (the particular tree model that resulted from the tree induction) and the numeric “parameters” of the model (the probability estimates at the leaf nodes).

Page 3: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

3

Fitting a model to data

• An alternative method for learning a predictive model from a dataset is to start by specifying the structure of the model with certain numeric parameters left unspecified.

• Then the data mining calculates the best parameter values given a particular set of training data.

• The goal of data mining is to tune the parameters so that the model fits the data as well as possible.

• This general approach is called parameter learning or parametric modeling.

Page 4: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

4

Fitting a model to data

•Many data mining procedures fall within this general framework. We will illustrate with some of the most common, all of which are based on linear models.

•The crux of the fundamental concept of this chapter—fitting a model to data by finding “optimal” model parameters.

Page 5: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

5

• It shows the space broken up into regions by horizontal and vertical decision boundaries that partition the instance space into similar regions.

• A main purpose of creating homogeneous regions is so that we can predict the target variable of a new, unseen instance by determining which segment it falls into.

Classification via mathematical function

Recall the instance-space view of tree models from Chapter 3. Action: Separate row data into some similar regions

Figure 4-1

Page 6: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

6

Classification via mathematical function

Figure 4-2. The raw data points of Figure 4-1, without decision lines.

Page 7: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

7

Classification via mathematical function

• This is call a linear classifier and is essentially a weighted sum of the values for the various attributes

• Compare entropy values in Figure 4-3 and Figure 4-1.

We can separate the instances almost perfectly (by class) if weare allowed to introduce a boundary that is still a straight line, but is not perpendicular to the axes.

-- 20 40

Figure 4-3

Page 8: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

8

Linear Discriminant Function (線性判別函數 )•The line can be expressed in the model

mathematically: Age = (-1.5)*Balance+60•Classification function: class(x) = •A general linear model: f(x) = +*+…

-

-

Page 9: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

9

• There are actually many different linear discriminants that can separate the classes perfectly. They have different slopes and intercepts, and each represents a different model of the data.

Which is the “best” line?

Linear Discriminant Function

Page 10: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

10

Optimizing an objective function

STEP1: Define an objective function that represents our goal.

STEP2: The function can be calculated for a particular set of weights and a particular set of data.

STEP3: Find the optimal value for the weights by maximizing or minimizing the objective function.

Page 11: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

11

An example of mining a linear discriminant from data

Data: sepal (花萼 )width, petal (花瓣 ) widthTypes: Iris (鳶尾 ) Setosa, Iris Versicolor

Two different separation lines:a. Logistic regressionb. Support vector machine

The two methods produce differentboundaries because they’re optimizing different functions.

filled dots: Iris Setosacircles: Iris Versicolor

Page 12: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

12

Linear models for scoring and ranking instances•In Chapter 3, we explained how a tree

model could produce an estimate of class membership probability.

•We can do this with linear models as well.

Page 13: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

13

Linear models for scoring and ranking instances•In some applications, we don’t need a

precise probability estimate. We simply need to rank cases by the likelihood of belonging to one class or the other.

•For example, for targeted marketing we may have a limited budget for targeting prospective customers.

•We need a list of customers as to their likelihood for responding to our marketing offers.

Page 14: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

14

Linear discriminant functions can give us such a ranking for free

•請看課本 Fig. 4-4 (投影片下一頁 )•Consider the + instances to be responders

and instances to be non-responders.•A linear discriminant function gives a line

that separate the two classes. • If a new customer x happens to be on the

line, i.e., f(x) = 0, then we are most uncertain about his/her class.

•But if f(x) is positive and large, then we would be certain that x will most likely be a responder.

•Thus f(x) essentially gives a ranking.

Page 15: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

15

Page 16: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

16

Support vector machine(支持向量機 )•SVM is a type of linear discriminants.• Instead of thinking about separating with a

line, first fit the fattest bar between the classes. Shown by parallel dashed lines in the figure.

•SVM’s objective function incorporates the idea that a wider bar is better. •Once the widest bar is found, the linear discriminant will be the center line through the bar.

Page 17: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

17

Support vector machine(支持向量機 )•The distance between the dashed parallel

lines is called the margin around the linear discriminant.

•The objective is to maximize the margin.•The margin-maximizing boundary gives the maximal leeway for classifying points that fall closer to the boundary.

Page 18: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

18

Support vector machine(支持向量機 )•Sometimes a single line can not perfectly

separate the data into classes, as the following figure shows.

•SVM’s objective function will penalize a training point for being on the wrong

side of the decision boundary.

Page 19: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

19

Support vector machine(支持向量機 )•If the data are not linearly separable, the

best fit is some balance between a fat margin and a low total error penalty.

•The penalty for a misclassified point is proportional to the distance from the marginal

boundary.

Page 20: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

20

Support vector machine(支持向量機 )

The term “loss” is used a general term for error penalty.Support vector uses hinge loss. The hinge loss only becomes positive whenan example is on the wrong side of the boundary and beyond the margin.Zero-one loss assigns a loss of zero for a correct decision and one for an Incorrect decision.

Page 21: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

21

Class probability estimation and logistic regression

•Linear discriminants could be used to give estimates of class probability.

•The most common procedure is called logistic regression.

•Consider the problem of using the basic linear model to estimate the class probability: f(x) = w0+w1x1+w2x2+…..

Page 22: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

22

Class probability estimation and logistic regression

•As we discussed, an instance being further from the separating boundary ought to lead to a higher probability of being in one class or the other, and the output of the linear function, f(x), gives the distance from the separating boundary.

•However, this shows a problem: f(x) ranges from - to , and a probability should range from zero to one.

Page 23: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

23

Class probability estimation and logistic regression• Probability ranges from zero to one, odds

(機會,可能性 ) range from 0 to , log-odds ranges from - to .

Table 4-2. Probabilities, odds, and the corresponding log-odds.

Prob. Odds log-oddsProbability Odds Log-odds 0.5 50:50 or 1 0 0.9 90:10 or 9 2.19 0.999 999:1 or 999 6.9 0.01 1:99 or 0.0101 –4.6 0.001 1:999 or 0.001001 –6.9

Page 24: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

24

Class probability estimation and logistic regression

P+(x): probability that a dataitem represented by featurex belongs to class +

p+(x)

Distance from the decision boundary

Page 25: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

25

Logistic regression vs. Tree induction

•A classification tree uses decision boundaries that are perpendicular to the instance-space axes, whereas the linear classifier can use decision boundaries of any direction or orientation.

Figure 4-1

Figure 4-3

Page 26: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

26

Logistic regression vs. Tree induction

•A classification tree is a “piecewise” classifier that segments the instance space recursively when it has to, using a divide-and-conquer approach. In principle, a classification tree can cut up the instance space arbitrarily finely into very small regions.

•A linear classifier places a single decision surface through the entire space. It has great freedom in the orientation of the surface, but it is limited to a single division into two segments.

Page 27: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

27

Logistic regression vs. Tree induction

•It is usually not easy to determine in advance which of these characteristics are a better match to a given dataset.

•When applied to a business problem, there is a difference in the comprehensibility (可瞭解性 ) of the models to stakeholders with different backgrounds.

Page 28: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

28

Logistic regression (LR) vs. Tree induction•What an LR model is doing can be quite

understandable to people with a strong background in statistics, and difficult to understand for those who do not.

•A DT (decision tree), if it is not too large, may be considerably more understandable to someone without a strong statistics or mathematics background.

Page 29: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

29

Logistic regression (LR) vs. Tree induction•Why is this important? For many business

problems, the data science team does not have the ultimate say in which models are used or implemented.

•Often there is at least one manager who must “sign off” on the use of a model in practice, and in many cases a set of stakeholders need to be satisfied with the model.

Page 30: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

30

A real example classified by LR and DT•Pages 104-107

Figure 4-11. One of the cell images from which the Wisconsin Breast Cancer datasetwas derived. (Image courtesy of Nick Street and Bill Wolberg.)

Page 31: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

31

Bread cancer classificationTable 4-3. The attributes of the Wisconsin Breast Cancer dataset.

Attribute name DescriptionRADIUS Mean of distances from center to points on the perimeterTEXTURE Standard deviation of grayscale valuesPERIMETER Perimeter of the massAREA Area of the massSMOOTHNESS Local variation in radius lengthsCOMPACTNESS Computed as: perimeter2/area – 1.0CONCAVITY Severity of concave portions of the contourCONCAVE POINTS Number of concave portions of the contourSYMMETRY A measure of the nucleii’s symmetryFRACTAL DIMENSION 'Coastline approximation' – 1.0DIAGNOSIS (Target) Diagnosis of cell sample: malignant or benign

Page 32: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

32

Bread cancer classificationTable 4-4. Linear equation learned by logistic regression on the Wisconsin Breast Cancer

dataset (see text and Table 4-3 for a description of the attributes).Attribute Weight (learned parameter)SMOOTHNESS_worst 22.3CONCAVE_mean 19.47CONCAVE_worst 11.68SYMMETRY_worst 4.99CONCAVITY_worst 2.86CONCAVITY_mean 2.34RADIUS_worst 0.25TEXTURE_worst 0.13AREA_SE 0.06TEXTURE_mean 0.03TEXTURE_SE –0.29COMPACTNESS_mean –7.1COMPACTNESS_SE –27.87w0 (intercept) –17.7

Page 33: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

33

Breast cancer classification by DT

Figure 4-13. Decision tree learned from the Wisconsin Breast Cancer dataset.Summary

Page 34: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

34

Nonlinear functions

Figure 4-12. The Iris dataset with a nonlinear feature. In this figure, logistic regressionand support vector machine—both linear models—are provided an additional feature,Sepal width2, which allows both the freedom to create more complex, nonlinear models(boundaries), as shown.

Page 35: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

35

Neural networks (NN)

•NN also implement complex nonlinear numeric functions.

•We can think of a NN as a stack of models. On the bottom of the stack are the original features.

•From these features are learned a variety of relatively simple models.

•Let’s say these are LRs. Then, each subsequent layer in the stack applies a simple model (let’s say, another LR) to the outputs of the next layer down.

Page 36: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

36

結論•This chapter introduced a second type of

predictive modeling technique called function fitting or parametric modeling.

• In this case the model is a partially specified equation: a numeric function of the data attributes, with some unspecified numeric parameters.

•The task of the data mining procedure is to “fit” the model to the data by finding the best set of parameters, in some sense of “best.”

Page 37: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

37

結論•There are many varieties of function

fitting techniques, but most use the same linear model structure: a simple weighted sum of the attribute values.

•The parameters to be fit by the data mining are the weights on the attributes.

•Linear modeling techniques include linear discriminants such as support-vector machines, logistic regression, and traditional linear regression.

Page 38: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

38

結論•Conceptually the key difference between

these techniques is their answer to a key issue, What exactly do we mean by best fitting the data?

•The goodness of fit is described by an “objective function,” and each technique uses a different function. The resulting techniques may be quite different.

Page 39: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

39

結論•We now have seen two very different sorts

of data modeling, tree induction and function fitting, and have compared them.

•We have also introduced two criteria by which models can be evaluated: the predictive performance of a model and its intelligibility. It is often advantageous to build different sorts of models from a dataset to gain insight.

Page 40: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

40

結論•This chapter focused on the fundamental

concept of optimizing a model’s fit to data.•However, doing this leads to the most

important fundamental problem with data mining—if you look hard enough, you will find structure in a dataset, even if it’s just there by chance.

•This tendency is known as overfitting. Recognizing and avoiding overfitting is an important general topic in data science.

Page 41: Chapter4: Fitting a model to data 1. Fitting a model to data Predictive modeling involves finding a model of the target variable in terms of other descriptive

41

Overfitting

•When your learner outputs a classification that is 100% accurate on the training data but 50% accurate on test data, when in fact it could have output one that is 75% accurate on both, it has overfit.