23
Classificat Classificat ion ion Dr Eamonn Keogh Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 [email protected] Who is smarter, Humans or Pigeons?

Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 [email protected] Who

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

ClassificationClassification

Dr Eamonn KeoghDr Eamonn KeoghComputer Science & Engineering Department

University of California - RiversideRiverside,CA [email protected]

Who is smarter, Humans or Pigeons?

Page 2: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Section 1.1 (again)

Section 4.1 Section 4.3

Read in Detail

Section 4.2.2

Section 4.34

Glance over

Page 3: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Examples of class A Examples of class B 1) What class is this object?

2) What class is this object?

1

2

3

4

1

2

3

4

Page 4: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Examples of class A Examples of class B 1) What class is this object?

2) What class is this object?

1

2

3

4

1

2

3

4

Page 5: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Examples of class A Examples of class B 1) What class is this object?

2) What class is this object?

1

2

3

4

1

2

3

4

Page 6: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

The “game” we have just been playing is Supervised Classification.

Why is it useful?

Page 7: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Examples of class APeople who contracted

disease X.

Examples of class BPeople who are disease free.

1) What class is this person?Is this person at risk of getting the disease?

2) What class is this person?Is this person at risk of getting the disease?

1

2

3

4

1

2

3

4

Patient temperature 99Blood count 4214Weight 167

Patient temperature 98Blood count 3214Weight 179

Patient temperature 97Blood count 2763Weight 121

Patient temperature 99Blood count 3234Weight 117

Patient temperature 97Blood count 0012Weight 190

Patient temperature 99Blood count 0114Weight 202

Patient temperature 98Blood count 1014Weight 345

Patient temperature 99Blood count 1214Weight 190

Patient temperature 97Blood count 0118Weight 280

Patient temperature 99Blood count 3452Weight 99

Page 8: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Examples of class A Examples of class B 1) What class is this object?

2) What class is this object?

1

2

3

4

1

2

3

4

Page 9: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Examples of class A Examples of class B

1

2

3

4

1

2

3

4

3 4

1.5 5

6 8

2.5 5

5 2.5

5 2

8 3

4.5 3

1) What class is this object?

2) What class is this object?

8 1.5

4.5 7

Page 10: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

ClassificationClassification

There are many classification algorithms, in this class we will consider only…

• Simple Linear Classifier. • Nearest Neighbor Classifier. • Decision Tree.• Naïve Bayes.

Page 11: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

The classification problemThe classification problem• The classification algorithm is shown a number of labeled examples from the problem domain of interest. (this collection of labeled data is called the training set).

• The algorithm builds a model that “explains” the labeling of the examples. (this model may or may not be accessible to humans, depending on the algorithm).

• At some future time the algorithm is shown an unlabeled example, and asked to classify it.

Examples of class A Examples of class B 1) What class isthis object?

2) What class isthis object?

1

2

3

4

1

2

3

4

Examples of class A Examples of class B 1) What class isthis object?

2) What class isthis object?

1

2

3

4

1

2

3

4

Shape Domain Cat Domain

Page 12: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Class: Income Savings Num_credit_cards Is_married A: 123,000 34,100 0 NB: 24,000 -2,000 13 YA: 45,200 12,100 3 N… ….. …… … …

… ….. …… … …B: 423,020 23,440 0 NB: 14,000 87,000 0 YA: 11,200 -2,000 2 Y

Sample dataset for a credit worthiness problem

? 123,000 34,100 0 N

What is this instances class?

Number of rows is the size of the training set, number of columns is the dimensionality of the training set, each row is called an instance (or exemplar) each column is called a feature.

Page 13: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Visualizing classification algorithms

We can visualize some classification algorithms in 2D…

Warning: This tends to make the problem look easy...

Examples of class A Examples of class B 1) What class isthis object?

2) What class isthis object?

1

2

3

4

1

2

3

4

3 4

1.5 5

6 8

2.5 5

5 2.5

5 2

8 3

4.5 3

8 1.5

4.5 7

10

1 2 3 4 5 6 7 8 9 10

123456789

Class feature 1 feature 2height1 height2

A 3 4B 5 2.5A 1.5 5… … ...

Page 14: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

10

1 2 3 4 5 6 7 8 9 10

123456789

1) What class isthis object?

2) What class isthis object?

8 1.5

4.5 7

A trivial machine learning example represented in 2D Euclidean Space. The blue circles and red squares represent the two classes in our training data, and the black shapes are the objects we are trying to classify.

From now on we will only consider the 2D plots when explaining algorithms and problems. We should always remember that this plots are representations of real world objects.

Page 15: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Simple Linear Classifier A dataset which is not linearly separable

Page 16: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Piecewise Linear Classifier Simple Quadratic Classifier (or some other function)

Page 17: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

1) What class isthis object?

2) What class isthis object?

8 1.5

4.5 7

This example is one for which we know a perfect rule, “above the diagonal is circle class, below the diagonal is square class”. (Don’t forget that for real world problems we can never know a perfect rule, even if there is one).

What happens if we learn a piecewise linear classifier or a quadratic classifier on this dataset with small training dataset?

This problem is called overfitting.

Piecewise Linear Classifier

Simple Quadratic Classifier

Page 18: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

The Nearest Neighbor Algorithm

The nearest neighbor algorithm (NN) works by projecting the item to be classified into the same space as the training data, then finding the labeled exemplar which is closest. Whatever class that nearest neighbor is, is then assigned to the item to be classified.

In this example, the item (6, 2) is correctly classified.

In spite of its amazing simplicity, Nearest Neighbor is one of the best algorithms for many problems.

We can use many different distance measures to measure the distance between objects. Typically Euclidean distance is used.

Page 19: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Evaluation of Classification

• Leaving one out

• Cross fold validation

Page 20: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Discussion of Nearest Neighbor I

• It is sensitive to irrelevant features. One possible solution is search for good subsets.

• It is sensitive to noise. One possible solution is use KNN.

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

123456789

10

1 2 3 4 5 6 7 8 9 10

Suppose there is a disease. Although we don’t know this, it happens that if your blood sugar is over 5.5 you have the disease and below you don’t….

Page 21: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Discussion of Nearest Neighbor II

• It is sensitive to the units in which the features are measured. One possible solution is to normalize the features.

X axis measured in feetY axis measure in dollars

X axis measured in inchesY axis measure in dollars

Page 22: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

Discussion of Nearest Neighbor III

• Scalability

Page 23: Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Who

A Famous Problem.

R. A. Fisher’s Iris Dataset.

3 classes

50 of each class

The task is to classify Iris plants into one of 3 varieties using the Petal Length and Petal Width.

Iris Setosa Iris Versicolor Iris Virginica