96
CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Embed Size (px)

Citation preview

Page 1: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

CS 235Data MiningWinter 2011

Eamonn

Keogh

UCR

TA:AbdullahMueen

Page 2: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Important Note

• All information about grades/homeworks/ projects etc will be given out at the next meeting

• Someone give me a 15 minute warning before the end of this class

2

Page 3: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

3

What Is Data Mining?

• Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial, implicit, previously unknown

and potentially useful) patterns or knowledge from huge amount of data

• Alternative names– Knowledge discovery (mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

• “data mining” is not– Databases– (Deductive) query processing. – Expert systems or small ML/statistical programs

Page 4: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

What is Data Mining?example from yesterdays newspapers

Page 5: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

5

What is Data Mining?Example from the NBA

• Play-by-play information recorded by teams– Who is on the court– Who shoots– Results

• Coaches want to know what works best– Plays that work well against a given team– Good/bad player matchups

• Advanced Scout (from IBM Research) is a data mining tool to answer these questions

http://www.nba.com/news_feat/beyond/0126.html

0 20 40 60

OverallShootingPercentage

Starks+Houston+Ward playing

Page 6: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

What is Data Mining?Example from Keogh/Mueen

0 100 200 300 400 500

1

2

3x 104

Instance at 9,036

Instance at 3,664

0 10,000 20,000 30,000Approximately 14.4 minutes of insect telemetry

Beet Leafhopper (Circulifer tenellus)

plant membrane Stylet

voltage source

input resistor

V

0 50 100 150 2000

10

20

to insectconductive glue

voltage reading

to soil near plant

Page 7: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

All these examples show…• Lots of raw data in

• Some data mining

• Facts, rules, patterns out

7

Lots of data

Some rules or facts or patterns

Page 8: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

8

adapted from:U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

DataTargetData

Selection

KnowledgeKnowledge

PreprocessedData

Patterns

Data Mining

Interpretation/Evaluation

Knowledge Discovery in Databases: Process

Preprocessing

12.2 3434.0023211.2 3454.6455523.6 4324.53435

There exists a planet at…

Page 9: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

10

Course Outline

1. Introduction: What is data mining?– What makes it a new and unique

discipline?– Relationship between Data

Warehousing, On-line Analytical Processing, and Data Mining

– Data mining tasks - Clustering, Classification, Rule learning, etc.

2. Data mining process – Task identification– Data preparation/cleansing

3. Association Rule mining– Problem Description– Algorithms

4. Classification– Bayesian– Nearest Neighbor– Linear Classifier– Tree-based approaches

5. Prediction– Regression– Neural Networks

6. Clustering– Distance-based approaches– Density-based approaches

7. Anomaly Detection– Distance based– Density based– Model based

8. Similarity Search

Page 10: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

11

Data Mining: Classification Schemes

• General functionality– Descriptive data mining – Predictive data mining

• Different views, different classifications– Kinds of data to be mined– Kinds of knowledge to be discovered– Kinds of techniques utilized– Kinds of applications adapted

Page 11: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

12

Data Mining:History of the Field

• Knowledge Discovery in Databases workshops started ‘89– Now a conference under the auspices of ACM SIGKDD– IEEE conference series started 2001

• Key founders / technology contributors:– Usama Fayyad, JPL (then Microsoft, then his own company,

Digimine, now Yahoo! Research labs, now CEO at Open Insights)– Gregory Piatetsky-Shapiro (then GTE, now his own data mining

consulting company, Knowledge Stream Partners)– Rakesh Agrawal (IBM Research)

The term “data mining” has been around since at least 1983 – as a pejorative term in the statistics community

Page 12: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

13

Data Mining:The big players

Page 13: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

14

A data mining problem…

Wei Wang - School of Life Science, Fudan University, ChinaWei Wang - Nonlinear Systems Laboratory, Department of Mechanical Engineering, MITWei Wang - University of Maryland Baltimore CountyWei Wang - University of Naval EngineeringWei Wang - ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of SciencesWei Wang - Rutgers University, New Brunswick, NJ, USAWei Wang - Purdue University IndianapolisWei Wang - INRIA Sophia Antipolis, Sophia Antipolis, FranceWei Wang - Institute of Computational Linguistics, Peking UniversityWei Wang - National University of SingaporeWei Wang - Nanyang Technological University, SingaporeWei Wang - Computer and Electronics Engineering, University of Nebraska Lincoln, NE, USAWei Wang - The University of New South Wales, AustraliaWei Wang - Language Weaver, Inc.Wei Wang - The Chinese University of Hong Kong, Mechanical and Automation EngineeringWei Wang - Center for Engineering and Scientific Computation, Zhejiang University, ChinaWei Wang - Fudan University, Shanghai, ChinaWei Wang - University of North Carolina at Chapel Hill

Page 14: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

15

What Can Data Mining Do?

• Classify– Categorical, Regression

• Cluster• Summarize

– Summary statistics, Summary rules

• Link Analysis / Model Dependencies– Association rules

• Sequence analysis– Time-series analysis, Sequential associations

• Detect Deviations

Page 15: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Why is Data Mining Hard?• Scalability

• High Dimensionality

• Heterogeneous and Complex Data

• Data Ownership and Distribution

• Non-traditional Analysis

• Over fitting

• Privacy issues16

Page 16: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Scale of Data

“The great strength of computers is that they can reliably manipulate vast amounts of data very quickly. Their great

weakness is that they don’t have a clue as to what any of that data actually means”

(S. Cass, IEEE Spectrum, Jan 2004)

Organization Scale of DataWalmart ~ 20 million transactions/dayGoogle ~ 8.2 billion Web pagesYahoo ~10 GB Web data/hrNASA satellites ~ 1.2 TB/dayNCBI GenBank ~ 22 million genetic sequencesFrance Telecom 29.2 TBUK Land Registry 18.3 TBAT&T Corp 26.2 TB

17

Page 17: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

18

What Can Data Mining Do?

• Classify– Categorical, Regression

• Cluster• Summarize

– Summary statistics, Summary rules

• Link Analysis / Model Dependencies– Association rules

• Sequence analysis– Time-series analysis, Sequential associations

• Detect Deviations

Page 18: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Grasshoppers

KatydidsThe Classification ProblemThe Classification Problem(informal definition)

Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers, decide what type of insect the unlabeled example is.

Katydid or Grasshopper?

Page 19: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

email

spamThe Classification ProblemThe Classification Problem

Given a collection of annotated data…

Spam or email?

Page 20: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Polish

SpanishThe Classification ProblemThe Classification Problem

Given a collection of annotated data…

Spanish or Polish?

Page 21: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

False Nettle

Stinging Nettle

The Classification ProblemThe Classification Problem

Given a collection of annotated data…

Stinging Nettle or False Nettle?

Page 22: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Irish

GreekThe Classification ProblemThe Classification Problem

Given a collection of annotated data…

Greek or Irish?

GunopulosPapadopoulos

KolliosDardanos

KeoghGough

Greenhaugh Hadleigh Tsotras

Page 23: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Grasshoppers

KatydidsThe Classification ProblemThe Classification Problem(informal definition)

Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers, decide what type of insect the unlabeled example is.

Katydid or Grasshopper?

Page 24: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Thorax Thorax LengthLength

Abdomen Abdomen LengthLength Antennae Antennae

LengthLength

MandibleMandibleSizeSize

SpiracleDiameter

Leg Length

For any domain of interest, we can measure For any domain of interest, we can measure featuresfeatures

Color Color {Green, Brown, Gray, Other}{Green, Brown, Gray, Other} Has Wings?Has Wings?

Page 25: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Insect Insect IDID

Abdomen Abdomen LengthLength

Antennae Antennae LengthLength

Insect Insect ClassClass

1 2.7 5.5 GrasshopperGrasshopper

2 8.0 9.1 KatydidKatydid

3 0.9 4.7 GrasshopperGrasshopper

4 1.1 3.1 GrasshopperGrasshopper

5 5.4 8.5 KatydidKatydid

6 2.9 1.9 GrasshopperGrasshopper

7 6.1 6.6 KatydidKatydid

8 0.5 1.0 GrasshopperGrasshopper

9 8.3 6.6 KatydidKatydid

10 8.1 4.7 KatydidsKatydids

11 5.1 7.0 ??????????????

We can store We can store features in a features in a database.database.

My_CollectionMy_Collection

The classification The classification problem can now be problem can now be expressed as:expressed as:

• Given a training database Given a training database ((My_CollectionMy_Collection), predict the ), predict the class class label of a label of a previously previously unseen instanceunseen instance

previously unseen instancepreviously unseen instance = =

Page 26: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

An

ten

na

Len

gth

An

ten

na

Len

gth

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Grasshoppers Katydids

Abdomen LengthAbdomen Length

Page 27: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

An

ten

na

Len

gth

An

ten

na

Len

gth

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Grasshoppers Katydids

Abdomen LengthAbdomen Length

We will also use this lager dataset as a motivating example…

Each of these data objects are called…• exemplars• (training) examples• instances• tuples

Page 28: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

We will return to the previous slide in two minutes. In the meantime, we are going to play a quick game.

I am going to show you some classification problems which were shown to pigeons!

Let us see if you are as smart as a pigeon!

We will return to the previous slide in two minutes. In the meantime, we are going to play a quick game.

I am going to show you some classification problems which were shown to pigeons!

Let us see if you are as smart as a pigeon!

Page 29: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Examples of class A

3 4

1.5 5

6 8

2.5 5

Examples of class B

5 2.5

5 2

8 3

4.5 3

Pigeon Problem 1

Page 30: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Examples of class A

3 4

1.5 5

6 8

2.5 5

Examples of class B

5 2.5

5 2

8 3

4.5 3

8 1.5

4.5 7

What class is this object?

What class is this object?

What about this one, A or B?

What about this one, A or B?

Pigeon Problem 1

Page 31: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Examples of class A

3 4

1.5 5

6 8

2.5 5

Examples of class B

5 2.5

5 2

8 3

4.5 3

8 1.5

This is a B!This is a B!

Pigeon Problem 1

Here is the rule.If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Here is the rule.If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Page 32: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Examples of class A

4 4

5 5

6 6

3 3

Examples of class B

5 2.5

2 5

5 3

2.5 3

8 1.5

7 7

Even I know this one

Even I know this one

Pigeon Problem 2 Oh! This ones hard!

Oh! This ones hard!

Page 33: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Examples of class A

4 4

5 5

6 6

3 3

Examples of class B

5 2.5

2 5

5 3

2.5 3

7 7

Pigeon Problem 2

So this one is an A.So this one is an A.

The rule is as follows, if the two bars are equal sizes, it is an A. Otherwise it is a B.

The rule is as follows, if the two bars are equal sizes, it is an A. Otherwise it is a B.

Page 34: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Examples of class A

4 4

1 5

6 3

3 7

Examples of class B

5 6

7 5

4 8

7 7

6 6

Pigeon Problem 3

This one is really hard!What is this, A or B?

This one is really hard!What is this, A or B?

Page 35: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Examples of class A

4 4

1 5

6 3

3 7

Examples of class B

5 6

7 5

4 8

7 7

6 6

Pigeon Problem 3 It is a B!It is a B!

The rule is as follows, if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B.

The rule is as follows, if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B.

Page 36: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Why did we spend so much time with this game?

Why did we spend so much time with this game?

Because we wanted to show that almost all classification problems have a geometric interpretation, check out the next 3 slides…

Because we wanted to show that almost all classification problems have a geometric interpretation, check out the next 3 slides…

Page 37: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Examples of class A

3 4

1.5 5

6 8

2.5 5

Examples of class B

5 2.5

5 2

8 3

4.5 3

Pigeon Problem 1

Here is the rule again.If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Here is the rule again.If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Lef

t B

arL

eft

Bar

10

1 2 3 4 5 6 7 8 9 10

123456789

Right BarRight Bar

Page 38: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Examples of class A

4 4

5 5

6 6

3 3

Examples of class B

5 2.5

2 5

5 3

2.5 3

Pigeon Problem 2

Lef

t B

arL

eft

Bar

10

1 2 3 4 5 6 7 8 9 10

123456789

Right BarRight Bar

Let me look it up… here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B.

Let me look it up… here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B.

Page 39: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Examples of class A

4 4

1 5

6 3

3 7

Examples of class B

5 6

7 5

4 8

7 7

Pigeon Problem 3

Lef

t B

arL

eft

Bar

100

10 20 30 40 50 60 70 80 90100

102030405060708090

Right BarRight Bar

The rule again:if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B.

The rule again:if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B.

Page 40: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

An

ten

na

Len

gth

An

ten

na

Len

gth

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Grasshoppers Katydids

Abdomen LengthAbdomen Length

Page 41: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

An

ten

na

Len

gth

An

ten

na

Len

gth

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Abdomen LengthAbdomen Length

KatydidsGrasshoppers

• We can “project” the previously previously unseen instance unseen instance into the same space as the database.

• We have now abstracted away the details of our particular problem. It will be much easier to talk about points in space.

• We can “project” the previously previously unseen instance unseen instance into the same space as the database.

• We have now abstracted away the details of our particular problem. It will be much easier to talk about points in space.

11 5.1 7.0 ??????????????previously unseen instancepreviously unseen instance = =

Page 42: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Simple Linear ClassifierSimple Linear Classifier

If previously unseen instancepreviously unseen instance above the linethen class is Katydidelse class is Grasshopper

KatydidsGrasshoppers

R.A. Fisher1890-1962

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Page 43: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

The simple linear classifier is defined for higher dimensional spaces…

Page 44: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

… we can visualize it as being an n-dimensional hyperplane

Page 45: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

It is interesting to think about what would happen in this example if we did not have the 3rd dimension…

Page 46: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

We can no longer get perfect accuracy with the simple linear classifier…

We could try to solve this problem by user a simple quadratic classifier or a simple cubic classifier..

However, as we will later see, this is probably a bad idea…

Page 47: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

10

1 2 3 4 5 6 7 8 9 10

123456789

100

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

10

1 2 3 4 5 6 7 8 9 10

123456789

Which of the “Pigeon Problems” can be solved by Which of the “Pigeon Problems” can be solved by the Simple Linear Classifier?the Simple Linear Classifier?

1)1) PerfectPerfect2)2) UselessUseless3)3) Pretty GoodPretty Good

Problems that can be solved by a linear classifier are call linearly separable.

Page 48: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

A Famous ProblemA Famous ProblemR. A. Fisher’s Iris Dataset.

• 3 classes

• 50 of each class

The task is to classify Iris plants into one of 3 varieties using the Petal Length and Petal Width.

Iris Setosa Iris Versicolor Iris Virginica

Setosa

Versicolor

Virginica

Page 49: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

SetosaVersicolor

Virginica

We can generalize the piecewise linear classifier to N classes, by fitting N-1 lines. In this case we first learned the line to (perfectly) discriminate between Setosa and Virginica/Versicolor, then we learned to approximately discriminate between Virginica and Versicolor.

If petal width > 3.272 – (0.325 * petal length) then class = VirginicaElseif petal width…

Page 50: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

• Predictive accuracy• Speed and scalability

– time to construct the model– time to use the model– efficiency in disk-resident databases

• Robustness– handling noise, missing values and irrelevant features, streaming data

• Interpretability: – understanding and insight provided by the model

We have now seen one classification We have now seen one classification algorithm, and we are about to see more. algorithm, and we are about to see more. How should we compare themHow should we compare them??

Page 51: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Predictive Accuracy IPredictive Accuracy I• How do we estimate the accuracy of our classifier?

We can use K-fold cross validation

Insect IDInsect ID Abdomen Abdomen LengthLength

Antennae Antennae LengthLength

Insect Insect ClassClass

1 2.7 5.5 GrasshopperGrasshopper

2 8.0 9.1 KatydidKatydid

3 0.9 4.7 GrasshopperGrasshopper

4 1.1 3.1 GrasshopperGrasshopper

5 5.4 8.5 KatydidKatydid

6 2.9 1.9 GrasshopperGrasshopper

7 6.1 6.6 KatydidKatydid

8 0.5 1.0 GrasshopperGrasshopper

9 8.3 6.6 KatydidKatydid

10 8.1 4.7 KatydidsKatydids

We divide the dataset into K equal sized sections. The algorithm is tested K times, each time leaving out one of the K section from building the classifier, but using it to test the classifier instead

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Accuracy = Number of correct classificationsNumber of instances in our database

K = 5

Page 52: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Predictive Accuracy IIPredictive Accuracy II• Using K-fold cross validation is a good way to set any parameters we may need to adjust in (any) classifier.

• We can do K-fold cross validation for each possible setting, and choose the model with the highest accuracy. Where there is a tie, we choose the simpler model.

• Actually, we should probably penalize the more complex models, even if they are more accurate, since more complex models are more likely to overfit (discussed later).

10

1 2 3 4 5 6 7 8 9 10

123456789

10

1 2 3 4 5 6 7 8 9 10

123456789

10

1 2 3 4 5 6 7 8 9 10

123456789

Accuracy = 94% Accuracy = 100% Accuracy = 100%

Page 53: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Predictive Accuracy IIIPredictive Accuracy III

Accuracy = Number of correct classificationsNumber of instances in our database

Accuracy is a single number, we may be better off looking at a confusion matrix. This gives us additional useful information…

Cat Dog Pig

Cat 100 0 0

Dog 9 90 1

Pig 45 45 10

Classified as a…

True label is...

Page 54: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

We need to consider the time and space requirements for the two distinct phases of classification:

• Time to construct the classifier• In the case of the simpler linear classifier, the time taken to fit the line, this is linear in the number of instances.

• Time to use the model• In the case of the simpler linear classifier, the time taken to test which side of the line the unlabeled instance is. This can be done in constant time.

Speed and Scalability ISpeed and Scalability I

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

As we shall see, some classification algorithms are very efficient in one aspect, and very poor in the other.

Page 55: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Speed and Scalability IISpeed and Scalability IIWe need to consider the time and space requirements for the two distinct phases of classification:

•Time to construct the classifier•In the case of the simpler linear classifier, the time taken to fit the line, this is linear in the number of instances.

•Time to usethe model• In the case of the simpler linear classifier, the time taken to test which side of the line the unlabeled instance is. This can be done in constant time.

Speed and Scalability ISpeed and Scalability I

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

As we shall see, some classification algorithms are very efficient in one aspect, and very poor in the other.

For learning with small datasets, this is the whole picture

However, for data mining with massive datasets, it is not so much the (main memory) time complexity that matters, rather it is how many times we have to scan the database.

This is because for most data mining operations, disk access times completely dominate the CPU times.

For data mining, researchers often report the number of times you must scan the database.

Page 56: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Robustness IRobustness IWe need to consider what happens when we have:

• Noise• For example, a persons age could have been mistyped as 650 instead of 65, how does this effect our classifier? (This is important only for building the classifier, if the instance to be classified is noisy we can do nothing).

•Missing values•

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

For example suppose we want to classify an insect, but we only know the abdomen length (X-axis), and not the antennae length (Y-axis), can we still classify the instance?

Page 57: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Robustness IIRobustness IIWe need to consider what happens when we have:

• Irrelevant featuresFor example, suppose we want to classify people as either

• Suitable_Grad_Student• Unsuitable_Grad_Student

And it happens that scoring more than 5 on a particular test is a perfect indicator for this problem…

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

123456789

10

If we also use “hair_length” as a feature, how will this effect our classifier?

Page 58: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Robustness IIIRobustness IIIWe need to consider what happens when we have:

• Streaming dataFor many real world problems, we don’t have a single fixed dataset. Instead, the data continuously arrives, potentially forever… (stock market, weather data, sensor data etc)

1 2 3 4 5 6 7 8 9 10

123456789

10

Can our classifier handle streaming data?

Page 59: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

InterpretabilityInterpretabilitySome classifiers offer a bonus feature. The structure of the learned classifier tells use something about the domain.

Height

Wei

ght

As a trivial example, if we try to classify peoples health risks based on just their height and weight, we could gain the following insight (Based of the observation that a single linear classifier does not work well, but two linear classifiers do).There are two ways to be unhealthy, being obese and being too skinny.

Page 60: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Nearest Neighbor ClassifierNearest Neighbor Classifier

If the nearest instance to the previously previously unseen instance unseen instance is a Katydid class is Katydidelse class is Grasshopper

KatydidsGrasshoppers

Joe Hodges 1922-2000

Evelyn Fix 1904-1965

An

ten

na

Len

gth

An

ten

na

Len

gth

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Abdomen LengthAbdomen Length

Page 61: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

This division of space is called Dirichlet Tessellation (or Voronoi diagram, or Theissen regions).

We can visualize the nearest neighbor algorithm in terms of a decision surface…

Note the we don’t actually have to construct these surfaces, they are simply the implicit boundaries that divide the space into regions “belonging” to each instance.

Page 62: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

The nearest neighbor algorithm is sensitive to outliers…The nearest neighbor algorithm is sensitive to outliers…

The solution is to…

Page 63: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

We can generalize the nearest neighbor We can generalize the nearest neighbor algorithm to the K- nearest neighbor (KNN) algorithm to the K- nearest neighbor (KNN) algorithm. algorithm. We measure the distance to the nearest K instances, and let them vote. K is typically chosen to be an odd number.

K = 1 K = 3

Page 64: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

123456789

10

1 2 3 4 5 6 7 8 9 10Suppose the following is true, if an insects antenna is longer than 5.5 it is a Katydid, otherwise it is a Grasshopper.

Using just the antenna length we get perfect classification!

Suppose the following is true, if an insects antenna is longer than 5.5 it is a Katydid, otherwise it is a Grasshopper.

Using just the antenna length we get perfect classification!

The nearest neighbor algorithm is sensitive to irrelevant The nearest neighbor algorithm is sensitive to irrelevant features…features…

Training data

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

6

5

Suppose however, we add in an irrelevant feature, for example the insects mass.

Using both the antenna length and the insects mass with the 1-NN algorithm we get the wrong classification!

Suppose however, we add in an irrelevant feature, for example the insects mass.

Using both the antenna length and the insects mass with the 1-NN algorithm we get the wrong classification!

Page 65: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

How do we mitigate the nearest neighbor How do we mitigate the nearest neighbor algorithms sensitivity to irrelevant features? algorithms sensitivity to irrelevant features?

• Use more training instances

• Ask an expert what features are relevant to the task

• Use statistical tests to try to determine which features are useful

• Search over feature subsets (in the next slide we will see why this is hard)

Page 66: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Why searching over feature subsets is hardWhy searching over feature subsets is hardSuppose you have the following classification problem, with 100 features, where is happens that Features 1 and 2 (the X and Y below) give perfect classification, but all 98 of the other features are irrelevant…

Using all 100 features will give poor results, but so will using only Feature 1, and so will using Feature 2! Of the 2100 –1 possible subsets of the features, only one really works.

Only Feature 1

Only Feature 2

Page 67: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

1 2 3 4

3,42,41,42,31,31,2

2,3,41,3,41,2,41,2,3

1,2,3,4•Forward Selection•Backward Elimination•Bi-directional Search

Page 68: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

The nearest neighbor algorithm is sensitive to the units of measurementThe nearest neighbor algorithm is sensitive to the units of measurement

X axis measured in centimetersY axis measure in dollars

The nearest neighbor to the pink unknown instance is red.

X axis measured in millimetersY axis measure in dollars

The nearest neighbor to the pink unknown instance is blue.

One solution is to normalize the units to pure numbers. Typically the features are Z-normalized to have a mean of zero and a standard deviation of one. X = (X – mean(X))/std(x)

Page 69: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

We can speed up nearest neighbor algorithm by We can speed up nearest neighbor algorithm by “throwing away” some data. This is called data “throwing away” some data. This is called data editing.editing.Note that this can sometimes improve accuracy!Note that this can sometimes improve accuracy!

One possible approach. Delete all instances that are surrounded by members of their own class.

We can also speed up classification with indexing

Page 70: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

pn

i

pii cqCQD

1,

n

iii cqCQD

1

2,

Manhattan (p=1)

Max (p=inf)

Mahalanobis

Weighted Euclidean

Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Distance, however this need not be the case…Distance, however this need not be the case…

Page 71: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

……In fact, we can use the nearest neighbor algorithm In fact, we can use the nearest neighbor algorithm with any distance/similarity function with any distance/similarity function

IDID NameName ClassClass

1 Gunopulos GreekGreek

2 Papadopoulos GreekGreek

3 Kollios GreekGreek

4 Dardanos GreekGreek

5 Keogh IrishIrish

6 Gough IrishIrish

7 Greenhaugh IrishIrish

8 Hadleigh IrishIrish

For example, is “Faloutsos” Greek or Irish? We could compare the name “Faloutsos” to a database of names using string edit distance…

edit_distance(Faloutsos, Keogh) = 8edit_distance(Faloutsos, Gunopulos) = 6

Hopefully, the similarity of the name (particularly the suffix) to other Greek names would mean the nearest nearest neighbor is also a Greek name.

Specialized distance measures exist for DNA strings, time series, images, graphs, videos, sets, fingerprints etc…

Page 72: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Peter

Piter

Pioter

Piotr

Substitution (i for e)

Insertion (o)

Deletion (e)

Edit Distance ExampleEdit Distance Example

It is possible to transform any string Q into string C, using only Substitution, Insertion and Deletion.Assume that each of these operators has a cost associated with it.

The similarity between two strings can be defined as the cost of the cheapest transformation from Q to C. Note that for now we have ignored the issue of how we can find this cheapest

transformation

How similar are the names “Peter” and “Piotr”?Assume the following cost function

Substitution 1 UnitInsertion 1 UnitDeletion 1 Unit

D(Peter,Piotr) is 3

Pio

tr

Pyo

tr P

etro

s

Pie

tro

Ped

ro P

ierr

e

Pie

ro

Pet

er

Page 73: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Dear SIR,

I am Mr. John Coleman and my sister is Miss Rose Colemen, we are the children of late Chief Paul Colemen from Sierra Leone. I am writing you in absolute confidence primarily to seek your assistance to transfer our cash of twenty one Million Dollars ($21,000.000.00) now in the custody of a private Security trust firm in Europe the money is in trunk boxes deposited and declared as family valuables by my late father as a matter of fact the company does not know the content as money, although my father made them to under stand that the boxes belongs to his foreign partner.…

Page 74: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

This mail is probably spam. The original message has been attached along with this report, so you can recognize or block similar unwanted mail in future. See http://spamassassin.org/tag/ for more details.

Content analysis details: (12.20 points, 5 required)NIGERIAN_SUBJECT2 (1.4 points) Subject is indicative of a Nigerian spamFROM_ENDS_IN_NUMS (0.7 points) From: ends in numbersMIME_BOUND_MANY_HEX (2.9 points) Spam tool pattern in MIME boundaryURGENT_BIZ (2.7 points) BODY: Contains urgent matterUS_DOLLARS_3 (1.5 points) BODY: Nigerian scam key phrase ($NN,NNN,NNN.NN)DEAR_SOMETHING (1.8 points) BODY: Contains 'Dear (something)'BAYES_30 (1.6 points) BODY: Bayesian classifier says spam probability is 30 to 40% [score: 0.3728]

Page 75: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

76

Acknowledgements

Some of the material used in this lecture is drawn from other sources:

• Chris Clifton• Jiawei Han

• Dr. Hongjun Lu (Hong Kong Univ. of Science and Technology) • Graduate students from Simon Fraser Univ., Canada, notably

Eugene Belchev, Jian Pei, and Osmar R. Zaiane• Graduate students from Univ. of Illinois at Urbana-Champaign

• Dr. Bhavani Thuraisingham (MITRE Corp. and UT Dallas)

Page 76: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

77

Page 77: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Making Good FiguresMaking Good Figures

• We are going to see many figures this quarter

• I personally feel that making good figures is very important to a papers chance of acceptance.

Page 78: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

What's wrong with this figure? Let me count the ways…None of the arrows line up with the “circles”. The “circles” are all different sizes and aspect ratios, the (normally invisible) white bounding box around the numbers breaks the arrows in many places. The figure captions has almost no information. Circles are not aligned…

On the right is my redrawing of the figure with PowerPoint. It took me 300 seconds

This figure is an insult to reviewers. It says, “we expect you to spend an unpaid hour to review our paper, but we don’t think it worthwhile to spend 5 minutes to make clear figures”

What's wrong with this figure? Let me count the ways…None of the arrows line up with the “circles”. The “circles” are all different sizes and aspect ratios, the (normally invisible) white bounding box around the numbers breaks the arrows in many places. The figure captions has almost no information. Circles are not aligned…

On the right is my redrawing of the figure with PowerPoint. It took me 300 seconds

This figure is an insult to reviewers. It says, “we expect you to spend an unpaid hour to review our paper, but we don’t think it worthwhile to spend 5 minutes to make clear figures”

1 32

4 65

3 3

1 1

11 1

1

Fig. 1. A sample sequence graph. The line thickness encodes relative entropy

Fig. 1. Sequence graph example

Page 79: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Note that there are figures drawn seven hundred years ago that have much better symmetry and layout.Peter Damian, Paulus Diaconus, and others, Various saints lives: Netherlands, S. or France, N. W.; 2nd quarter of the 13th century

Note that there are figures drawn seven hundred years ago that have much better symmetry and layout.Peter Damian, Paulus Diaconus, and others, Various saints lives: Netherlands, S. or France, N. W.; 2nd quarter of the 13th century

Fig. 1. Sequence graph example

Lets us see some more examples of poor figures, then see some principles that can help

Page 80: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

This figure wastes This figure wastes 80% of the space it 80% of the space it takes up. takes up.

In any case, it could In any case, it could be replace by a short be replace by a short English sentence: English sentence: ““We found that for We found that for selectivity ranging selectivity ranging from 0 to 0.05, the from 0 to 0.05, the four methods did not four methods did not differ by more than differ by more than 5%5%””

Why did they bother Why did they bother with the legend, with the legend, since you can’t tell since you can’t tell the four lines apart the four lines apart anyway?anyway?

Page 81: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

This figure wastes almost a quarter of a page.

The ordering on the X-axis is arbitrary, so the figure could be replaced with the sentence “We found the average performance was 198 with a standard deviation of 11.2”.

The paper in question had 5 similar plots, wasting an entire page.

Page 82: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

The figure below takes up 1/6 of a page, but it only The figure below takes up 1/6 of a page, but it only reports 3 numbers.reports 3 numbers.

Page 83: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

The figure below takes up 1/6 of a page, but it only The figure below takes up 1/6 of a page, but it only reports 2 numbers!reports 2 numbers!

Actually, it really only reports one number! Only the relative times really Actually, it really only reports one number! Only the relative times really matter, so they could have written “matter, so they could have written “We found that FTW is 1007 times faster We found that FTW is 1007 times faster than the exact calculation, independent of the sequence lengththan the exact calculation, independent of the sequence length”.”.

Page 84: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

0 10

1

Figure 5. Each of our 100 motions plotted as a point in 2 dimensions. The X value is set to the distance to the nearest neighbor from the same class, and the Y value is set to the distance to the nearest neighbor from any other class.

The data is plotted in Figure 5. Note that any correctly classified motions must appear in the upper left (gray) triangle.

It is not obvious from this figure which It is not obvious from this figure which algorithm is best. The caption has almost algorithm is best. The caption has almost zero informationzero informationYou need to read the text very carefully You need to read the text very carefully to understand the figureto understand the figure

Redesign by KeoghRedesign by Keogh

Both figures below describe the classification of time series motions…Both figures below describe the classification of time series motions…

At a glance we can see that the At a glance we can see that the accuracy is very high. We can also see accuracy is very high. We can also see that DTW tends to win when the... that DTW tends to win when the...

In this region our algorithm wins

In this region DTW wins

Page 85: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

This should be a bar chart, the four items are unrelatedThis should be a bar chart, the four items are unrelated

(in any case this should probably be a table, not a figure)(in any case this should probably be a table, not a figure)

Page 86: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

This pie chart takes up a lot of space to This pie chart takes up a lot of space to communicate two numbers communicate two numbers ( Better as a table, or as simple text)( Better as a table, or as simple text)

A Database Architecture For Real-Time Motion Retrieval

People that have heard of Pacman

People that have not heard of Pacman

Page 87: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

• Think about the point you want to make, should it be done with words, a table, or a figure. If a figure, what kind?

• Color helps (but you cannot depend on it)

• Linking helps (sometimes called brushing)

• Direct labeling helps

• Meaningful captions helps

• Minimalism helps (Omit needless elements)

• Finally, taking great care, taking pride in your work, helps

Principles to make Good FiguresPrinciples to make Good Figures

Page 88: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen
Page 89: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Direct labeling helps

It removes one level of indirection, and allows the figures to be self explaining (see Edward Tufte: Visual Explanations, Chapter 4)

Figure 10. Stills from a video sequence; the right hand is tracked, and converted into a time series: A) Hand at rest: B) Hand moving above holster. C) Hand moving down to grasp gun. D Hand moving to shoulder level, E) Aiming Gun.

BA

CD

E

Page 90: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

0

5

10

15

20

25

30

35

40

45

50

0 10 20 30 40 50 60

Fish

BothNeither

Fowl

Linking helps interpretability I

What is Linking?Linking is connecting the same data in two views by using the same color (or thickness etc). In the figures below, color links the data in the pie chart, with data in the scatterplot.

How did we get from here

To here?

It is not clear from the above figure.

See next slide for a suggested fix.

Page 91: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Linking helps interpretability II

In this figure, the color of the arrows inside the fish link to the colors of the arrows on the time series. This tells us exactly how we go from a shape to a time series.Note that there are other links, for example in II, you can tell which fish is which based on color or link thickness linking.

Minimalism helps: In this case, numbers on the X-axis do not mean anything, so they are deleted.

Page 92: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

© Sinauer

A nice example of linking

Page 93: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

0

DROP2

10

1

DROP1

ABEL

EBEL

False Alarm Rate

Det

ect

ion

Rat

e

Direct labeling helps

Note that the line thicknesses differ by powers of 2, so even in a B/W printout you can tell the four lines apart.

• Don’t cover the data with the labels! You are implicitly saying “the results are not that important”.• Do we need all the numbers to annotate the X and Y axis? • Can we remove the text “With Ranking”? Minimalism helps: delete the “with

Ranking”, the X-axis numbers, the grid…

Page 94: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Covering the data with the labels is a common sin

Page 95: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

These two images, which are both use to discuss an anomaly detection algorithm, illustrate many of the points discussed in previous slides.

Color helps - Direct labeling helps - Meaningful captions help

The images should be as self contained as possible, to avoid forcing the reader to look back to the text for clarification multiple times.

Note that while Figure 6 use color to highlight the anomaly, it also uses the line thickness (hard to see in PowerPoint) thus this figure works also well in B/W printouts

Page 96: CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

Sometime between next week and the end of the Sometime between next week and the end of the quarterquarter

Find a poor figure in a data mining paper.Find a poor figure in a data mining paper.

Create a fixed version of it.Create a fixed version of it.

Present one to three slides about it at the beginning of a Present one to three slides about it at the beginning of a class.class.

Before you start to work on the poor figure, run it by me.Before you start to work on the poor figure, run it by me.