CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen

CS 235Data MiningWinter 2011

Eamonn

Keogh

UCR

TA:AbdullahMueen

Important Note

• All information about grades/homeworks/ projects etc will be given out at the next meeting

• Someone give me a 15 minute warning before the end of this class

2

3

What Is Data Mining?

• Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial, implicit, previously unknown

and potentially useful) patterns or knowledge from huge amount of data

• Alternative names– Knowledge discovery (mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

• “data mining” is not– Databases– (Deductive) query processing. – Expert systems or small ML/statistical programs

What is Data Mining?example from yesterdays newspapers

5

What is Data Mining?Example from the NBA

• Play-by-play information recorded by teams– Who is on the court– Who shoots– Results

• Coaches want to know what works best– Plays that work well against a given team– Good/bad player matchups

• Advanced Scout (from IBM Research) is a data mining tool to answer these questions

http://www.nba.com/news_feat/beyond/0126.html

0 20 40 60

OverallShootingPercentage

Starks+Houston+Ward playing

http://domino.research.ibm.com/comm/wwwr_thinkresearch.nsf/pages/datamine296.html#one

What is Data Mining?Example from Keogh/Mueen

0 100 200 300 400 500

1

2

3x 104

Instance at 9,036

Instance at 3,664

0 10,000 20,000 30,000Approximately 14.4 minutes of insect telemetry

Beet Leafhopper (Circulifer tenellus)

plant membrane Stylet

voltage source

input resistor

V

0 50 100 150 2000

10

20

to insectconductive glue

voltage reading

to soil near plant

All these examples show…• Lots of raw data in

• Some data mining

• Facts, rules, patterns out

7

Lots of data

Some rules or facts or patterns

8

adapted from:U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

DataTargetData

Selection

KnowledgeKnowledge

PreprocessedData

Patterns

Data Mining

Interpretation/Evaluation

Knowledge Discovery in Databases: Process

Preprocessing

12.2 3434.0023211.2 3454.6455523.6 4324.53435

There exists a planet at…

10

Course Outline

1. Introduction: What is data mining?– What makes it a new and unique

discipline?– Relationship between Data

Warehousing, On-line Analytical Processing, and Data Mining

– Data mining tasks - Clustering, Classification, Rule learning, etc.

2. Data mining process – Task identification– Data preparation/cleansing

3. Association Rule mining– Problem Description– Algorithms

4. Classification– Bayesian– Nearest Neighbor– Linear Classifier– Tree-based approaches

5. Prediction– Regression– Neural Networks

6. Clustering– Distance-based approaches– Density-based approaches

7. Anomaly Detection– Distance based– Density based– Model based

8. Similarity Search

11

Data Mining: Classification Schemes

• General functionality– Descriptive data mining – Predictive data mining

• Different views, different classifications– Kinds of data to be mined– Kinds of knowledge to be discovered– Kinds of techniques utilized– Kinds of applications adapted

12

Data Mining:History of the Field

• Knowledge Discovery in Databases workshops started ‘89– Now a conference under the auspices of ACM SIGKDD– IEEE conference series started 2001

• Key founders / technology contributors:– Usama Fayyad, JPL (then Microsoft, then his own company,

Digimine, now Yahoo! Research labs, now CEO at Open Insights)– Gregory Piatetsky-Shapiro (then GTE, now his own data mining

consulting company, Knowledge Stream Partners)– Rakesh Agrawal (IBM Research)

The term “data mining” has been around since at least 1983 – as a pejorative term in the statistics community

13

Data Mining:The big players

14

A data mining problem…

Wei Wang - School of Life Science, Fudan University, ChinaWei Wang - Nonlinear Systems Laboratory, Department of Mechanical Engineering, MITWei Wang - University of Maryland Baltimore CountyWei Wang - University of Naval EngineeringWei Wang - ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of SciencesWei Wang - Rutgers University, New Brunswick, NJ, USAWei Wang - Purdue University IndianapolisWei Wang - INRIA Sophia Antipolis, Sophia Antipolis, FranceWei Wang - Institute of Computational Linguistics, Peking UniversityWei Wang - National University of SingaporeWei Wang - Nanyang Technological University, SingaporeWei Wang - Computer and Electronics Engineering, University of Nebraska Lincoln, NE, USAWei Wang - The University of New South Wales, AustraliaWei Wang - Language Weaver, Inc.Wei Wang - The Chinese University of Hong Kong, Mechanical and Automation EngineeringWei Wang - Center for Engineering and Scientific Computation, Zhejiang University, ChinaWei Wang - Fudan University, Shanghai, ChinaWei Wang - University of North Carolina at Chapel Hill

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/w/Wang_0004:Wei.html











15

What Can Data Mining Do?

• Classify– Categorical, Regression

• Cluster• Summarize

– Summary statistics, Summary rules

• Link Analysis / Model Dependencies– Association rules

• Sequence analysis– Time-series analysis, Sequential associations

• Detect Deviations

Why is Data Mining Hard?• Scalability

• High Dimensionality

• Heterogeneous and Complex Data

• Data Ownership and Distribution

• Non-traditional Analysis

• Over fitting

• Privacy issues16

Scale of Data

“The great strength of computers is that they can reliably manipulate vast amounts of data very quickly. Their great

weakness is that they don’t have a clue as to what any of that data actually means”

(S. Cass, IEEE Spectrum, Jan 2004)

Organization Scale of DataWalmart ~ 20 million transactions/dayGoogle ~ 8.2 billion Web pagesYahoo ~10 GB Web data/hrNASA satellites ~ 1.2 TB/dayNCBI GenBank ~ 22 million genetic sequencesFrance Telecom 29.2 TBUK Land Registry 18.3 TBAT&T Corp 26.2 TB

17

18

What Can Data Mining Do?

• Classify– Categorical, Regression

• Cluster• Summarize

– Summary statistics, Summary rules

• Link Analysis / Model Dependencies– Association rules

• Sequence analysis– Time-series analysis, Sequential associations

• Detect Deviations

Grasshoppers

KatydidsThe Classification ProblemThe Classification Problem(informal definition)

Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers, decide what type of insect the unlabeled example is.

Katydid or Grasshopper?

email

spamThe Classification ProblemThe Classification Problem

Given a collection of annotated data…

Spam or email?

Polish

SpanishThe Classification ProblemThe Classification Problem


Spanish or Polish?

False Nettle

Stinging Nettle

The Classification ProblemThe Classification Problem


Stinging Nettle or False Nettle?

Irish

GreekThe Classification ProblemThe Classification Problem


Greek or Irish?

GunopulosPapadopoulos

KolliosDardanos

KeoghGough

Greenhaugh Hadleigh Tsotras

Grasshoppers

KatydidsThe Classification ProblemThe Classification Problem(informal definition)

Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers, decide what type of insect the unlabeled example is.

Katydid or Grasshopper?

Thorax Thorax LengthLength

Abdomen Abdomen LengthLength Antennae Antennae

LengthLength

MandibleMandibleSizeSize

SpiracleDiameter

Leg Length

For any domain of interest, we can measure For any domain of interest, we can measure featuresfeatures

Color Color {Green, Brown, Gray, Other}{Green, Brown, Gray, Other} Has Wings?Has Wings?

Insect Insect IDID

Abdomen Abdomen LengthLength

Antennae Antennae LengthLength

Insect Insect ClassClass

1 2.7 5.5 GrasshopperGrasshopper

2 8.0 9.1 KatydidKatydid








10 8.1 4.7 KatydidsKatydids

11 5.1 7.0 ??????????????

We can store We can store features in a features in a database.database.

My_CollectionMy_Collection

The classification The classification problem can now be problem can now be expressed as:expressed as:

• Given a training database Given a training database ((My_CollectionMy_Collection), predict the ), predict the class class label of a label of a previously previously unseen instanceunseen instance

previously unseen instancepreviously unseen instance = =

An

ten

na

Len

gth

An

ten

na

Len

gth

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Grasshoppers Katydids

Abdomen LengthAbdomen Length

An

ten

na

Len

gth

An

ten

na

Len

gth

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9



We will also use this lager dataset as a motivating example…

Each of these data objects are called…• exemplars• (training) examples• instances• tuples

We will return to the previous slide in two minutes. In the meantime, we are going to play a quick game.

I am going to show you some classification problems which were shown to pigeons!

Let us see if you are as smart as a pigeon!

We will return to the previous slide in two minutes. In the meantime, we are going to play a quick game.

I am going to show you some classification problems which were shown to pigeons!

Let us see if you are as smart as a pigeon!

Examples of class A

3 4

1.5 5

6 8

2.5 5

Examples of class B

5 2.5

5 2

8 3

4.5 3

Pigeon Problem 1

Examples of class A

3 4

1.5 5

6 8

2.5 5

Examples of class B

5 2.5

5 2

8 3

4.5 3

8 1.5

4.5 7

What class is this object?

What class is this object?

What about this one, A or B?

What about this one, A or B?

Pigeon Problem 1

Examples of class A

3 4

1.5 5

6 8

2.5 5

Examples of class B

5 2.5

5 2

8 3

4.5 3

8 1.5

This is a B!This is a B!

Pigeon Problem 1

Here is the rule.If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Here is the rule.If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Examples of class A

4 4

5 5

6 6

3 3

Examples of class B

5 2.5

2 5

5 3

2.5 3

8 1.5

7 7

Even I know this one

Even I know this one

Pigeon Problem 2 Oh! This ones hard!

Oh! This ones hard!

Examples of class A

4 4

5 5

6 6

3 3

Examples of class B

5 2.5

2 5

5 3

2.5 3

7 7

Pigeon Problem 2

So this one is an A.So this one is an A.

The rule is as follows, if the two bars are equal sizes, it is an A. Otherwise it is a B.

The rule is as follows, if the two bars are equal sizes, it is an A. Otherwise it is a B.

Examples of class A

4 4

1 5

6 3

3 7

Examples of class B

5 6

7 5

4 8

7 7

6 6

Pigeon Problem 3

This one is really hard!What is this, A or B?

This one is really hard!What is this, A or B?

Examples of class A

4 4

1 5

6 3

3 7

Examples of class B

5 6

7 5

4 8

7 7

6 6

Pigeon Problem 3 It is a B!It is a B!

The rule is as follows, if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B.

The rule is as follows, if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B.

Why did we spend so much time with this game?

Why did we spend so much time with this game?

Because we wanted to show that almost all classification problems have a geometric interpretation, check out the next 3 slides…

Because we wanted to show that almost all classification problems have a geometric interpretation, check out the next 3 slides…

Examples of class A

3 4

1.5 5

6 8

2.5 5

Examples of class B

5 2.5

5 2

8 3

4.5 3

Pigeon Problem 1

Here is the rule again.If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Here is the rule again.If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Lef

t B

arL

eft

Bar

10

1 2 3 4 5 6 7 8 9 10

123456789

Right BarRight Bar

Examples of class A

4 4

5 5

6 6

3 3

Examples of class B

5 2.5

2 5

5 3

2.5 3

Pigeon Problem 2

Lef

t B

arL

eft

Bar

10

1 2 3 4 5 6 7 8 9 10

123456789

Right BarRight Bar

Let me look it up… here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B.

Let me look it up… here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B.

Examples of class A

4 4

1 5

6 3

3 7

Examples of class B

5 6

7 5

4 8

7 7

Pigeon Problem 3

Lef

t B

arL

eft

Bar

100

10 20 30 40 50 60 70 80 90100

102030405060708090

Right BarRight Bar

The rule again:if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B.

The rule again:if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B.

An

ten

na

Len

gth

An

ten

na

Len

gth

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9



An

ten

na

Len

gth

An

ten

na

Len

gth

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9


KatydidsGrasshoppers

• We can “project” the previously previously unseen instance unseen instance into the same space as the database.

• We have now abstracted away the details of our particular problem. It will be much easier to talk about points in space.

• We can “project” the previously previously unseen instance unseen instance into the same space as the database.

• We have now abstracted away the details of our particular problem. It will be much easier to talk about points in space.

11 5.1 7.0 ??????????????previously unseen instancepreviously unseen instance = =

Simple Linear ClassifierSimple Linear Classifier

If previously unseen instancepreviously unseen instance above the linethen class is Katydidelse class is Grasshopper


R.A. Fisher1890-1962

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

The simple linear classifier is defined for higher dimensional spaces…

… we can visualize it as being an n-dimensional hyperplane

It is interesting to think about what would happen in this example if we did not have the 3rd dimension…

We can no longer get perfect accuracy with the simple linear classifier…

We could try to solve this problem by user a simple quadratic classifier or a simple cubic classifier..

However, as we will later see, this is probably a bad idea…

10

1 2 3 4 5 6 7 8 9 10

123456789

100

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

10

1 2 3 4 5 6 7 8 9 10

123456789

Which of the “Pigeon Problems” can be solved by Which of the “Pigeon Problems” can be solved by the Simple Linear Classifier?the Simple Linear Classifier?

1)1) PerfectPerfect2)2) UselessUseless3)3) Pretty GoodPretty Good

Problems that can be solved by a linear classifier are call linearly separable.

A Famous ProblemA Famous ProblemR. A. Fisher’s Iris Dataset.

• 3 classes

• 50 of each class

The task is to classify Iris plants into one of 3 varieties using the Petal Length and Petal Width.

Iris Setosa Iris Versicolor Iris Virginica

Setosa

Versicolor

Virginica

SetosaVersicolor

Virginica

We can generalize the piecewise linear classifier to N classes, by fitting N-1 lines. In this case we first learned the line to (perfectly) discriminate between Setosa and Virginica/Versicolor, then we learned to approximately discriminate between Virginica and Versicolor.

If petal width > 3.272 – (0.325 * petal length) then class = VirginicaElseif petal width…

• Predictive accuracy• Speed and scalability

– time to construct the model– time to use the model– efficiency in disk-resident databases

• Robustness– handling noise, missing values and irrelevant features, streaming data

• Interpretability: – understanding and insight provided by the model

We have now seen one classification We have now seen one classification algorithm, and we are about to see more. algorithm, and we are about to see more. How should we compare themHow should we compare them??

Predictive Accuracy IPredictive Accuracy I• How do we estimate the accuracy of our classifier?

We can use K-fold cross validation

Insect IDInsect ID Abdomen Abdomen LengthLength

Antennae Antennae LengthLength

Insect Insect ClassClass










10 8.1 4.7 KatydidsKatydids

We divide the dataset into K equal sized sections. The algorithm is tested K times, each time leaving out one of the K section from building the classifier, but using it to test the classifier instead

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Accuracy = Number of correct classificationsNumber of instances in our database

K = 5

Predictive Accuracy IIPredictive Accuracy II• Using K-fold cross validation is a good way to set any parameters we may need to adjust in (any) classifier.

• We can do K-fold cross validation for each possible setting, and choose the model with the highest accuracy. Where there is a tie, we choose the simpler model.

• Actually, we should probably penalize the more complex models, even if they are more accurate, since more complex models are more likely to overfit (discussed later).

10

1 2 3 4 5 6 7 8 9 10

123456789

10

1 2 3 4 5 6 7 8 9 10

123456789

10

1 2 3 4 5 6 7 8 9 10

123456789

Accuracy = 94% Accuracy = 100% Accuracy = 100%

Predictive Accuracy IIIPredictive Accuracy III

Accuracy = Number of correct classificationsNumber of instances in our database

Accuracy is a single number, we may be better off looking at a confusion matrix. This gives us additional useful information…

Cat Dog Pig

Cat 100 0 0

Dog 9 90 1

Pig 45 45 10

Classified as a…

True label is...

We need to consider the time and space requirements for the two distinct phases of classification:

• Time to construct the classifier• In the case of the simpler linear classifier, the time taken to fit the line, this is linear in the number of instances.

• Time to use the model• In the case of the simpler linear classifier, the time taken to test which side of the line the unlabeled instance is. This can be done in constant time.

Speed and Scalability ISpeed and Scalability I

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

As we shall see, some classification algorithms are very efficient in one aspect, and very poor in the other.

Speed and Scalability IISpeed and Scalability IIWe need to consider the time and space requirements for the two distinct phases of classification:

•Time to construct the classifier•In the case of the simpler linear classifier, the time taken to fit the line, this is linear in the number of instances.

•Time to usethe model• In the case of the simpler linear classifier, the time taken to test which side of the line the unlabeled instance is. This can be done in constant time.

Speed and Scalability ISpeed and Scalability I

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

As we shall see, some classification algorithms are very efficient in one aspect, and very poor in the other.

For learning with small datasets, this is the whole picture

However, for data mining with massive datasets, it is not so much the (main memory) time complexity that matters, rather it is how many times we have to scan the database.

This is because for most data mining operations, disk access times completely dominate the CPU times.

For data mining, researchers often report the number of times you must scan the database.

Robustness IRobustness IWe need to consider what happens when we have:

• Noise• For example, a persons age could have been mistyped as 650 instead of 65, how does this effect our classifier? (This is important only for building the classifier, if the instance to be classified is noisy we can do nothing).

•Missing values•

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

For example suppose we want to classify an insect, but we only know the abdomen length (X-axis), and not the antennae length (Y-axis), can we still classify the instance?

Robustness IIRobustness IIWe need to consider what happens when we have:

• Irrelevant featuresFor example, suppose we want to classify people as either

• Suitable_Grad_Student• Unsuitable_Grad_Student

And it happens that scoring more than 5 on a particular test is a perfect indicator for this problem…

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

123456789

10

If we also use “hair_length” as a feature, how will this effect our classifier?

Robustness IIIRobustness IIIWe need to consider what happens when we have:

• Streaming dataFor many real world problems, we don’t have a single fixed dataset. Instead, the data continuously arrives, potentially forever… (stock market, weather data, sensor data etc)

1 2 3 4 5 6 7 8 9 10

123456789

10

Can our classifier handle streaming data?

InterpretabilityInterpretabilitySome classifiers offer a bonus feature. The structure of the learned classifier tells use something about the domain.

Height

Wei

ght

As a trivial example, if we try to classify peoples health risks based on just their height and weight, we could gain the following insight (Based of the observation that a single linear classifier does not work well, but two linear classifiers do).There are two ways to be unhealthy, being obese and being too skinny.

Nearest Neighbor ClassifierNearest Neighbor Classifier

If the nearest instance to the previously previously unseen instance unseen instance is a Katydid class is Katydidelse class is Grasshopper


Joe Hodges 1922-2000

Evelyn Fix 1904-1965

An

ten

na

Len

gth

An

ten

na

Len

gth

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9


This division of space is called Dirichlet Tessellation (or Voronoi diagram, or Theissen regions).

We can visualize the nearest neighbor algorithm in terms of a decision surface…

Note the we don’t actually have to construct these surfaces, they are simply the implicit boundaries that divide the space into regions “belonging” to each instance.

The nearest neighbor algorithm is sensitive to outliers…The nearest neighbor algorithm is sensitive to outliers…

The solution is to…

We can generalize the nearest neighbor We can generalize the nearest neighbor algorithm to the K- nearest neighbor (KNN) algorithm to the K- nearest neighbor (KNN) algorithm. algorithm. We measure the distance to the nearest K instances, and let them vote. K is typically chosen to be an odd number.

K = 1 K = 3

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

123456789

10

1 2 3 4 5 6 7 8 9 10Suppose the following is true, if an insects antenna is longer than 5.5 it is a Katydid, otherwise it is a Grasshopper.

Using just the antenna length we get perfect classification!

Suppose the following is true, if an insects antenna is longer than 5.5 it is a Katydid, otherwise it is a Grasshopper.

Using just the antenna length we get perfect classification!

The nearest neighbor algorithm is sensitive to irrelevant The nearest neighbor algorithm is sensitive to irrelevant features…features…

Training data

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

6

5

Suppose however, we add in an irrelevant feature, for example the insects mass.

Using both the antenna length and the insects mass with the 1-NN algorithm we get the wrong classification!

Suppose however, we add in an irrelevant feature, for example the insects mass.

Using both the antenna length and the insects mass with the 1-NN algorithm we get the wrong classification!

How do we mitigate the nearest neighbor How do we mitigate the nearest neighbor algorithms sensitivity to irrelevant features? algorithms sensitivity to irrelevant features?

• Use more training instances

• Ask an expert what features are relevant to the task

• Use statistical tests to try to determine which features are useful

• Search over feature subsets (in the next slide we will see why this is hard)

Why searching over feature subsets is hardWhy searching over feature subsets is hardSuppose you have the following classification problem, with 100 features, where is happens that Features 1 and 2 (the X and Y below) give perfect classification, but all 98 of the other features are irrelevant…

Using all 100 features will give poor results, but so will using only Feature 1, and so will using Feature 2! Of the 2100 –1 possible subsets of the features, only one really works.

Only Feature 1

Only Feature 2

1 2 3 4

3,42,41,42,31,31,2

2,3,41,3,41,2,41,2,3

1,2,3,4•Forward Selection•Backward Elimination•Bi-directional Search

The nearest neighbor algorithm is sensitive to the units of measurementThe nearest neighbor algorithm is sensitive to the units of measurement

X axis measured in centimetersY axis measure in dollars

The nearest neighbor to the pink unknown instance is red.

X axis measured in millimetersY axis measure in dollars

The nearest neighbor to the pink unknown instance is blue.

One solution is to normalize the units to pure numbers. Typically the features are Z-normalized to have a mean of zero and a standard deviation of one. X = (X – mean(X))/std(x)

We can speed up nearest neighbor algorithm by We can speed up nearest neighbor algorithm by “throwing away” some data. This is called data “throwing away” some data. This is called data editing.editing.Note that this can sometimes improve accuracy!Note that this can sometimes improve accuracy!

One possible approach. Delete all instances that are surrounded by members of their own class.

We can also speed up classification with indexing

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

pn

i

pii cqCQD

1,

n

iii cqCQD

1

2,

Manhattan (p=1)

Max (p=inf)

Mahalanobis

Weighted Euclidean

Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Distance, however this need not be the case…Distance, however this need not be the case…

……In fact, we can use the nearest neighbor algorithm In fact, we can use the nearest neighbor algorithm with any distance/similarity function with any distance/similarity function

IDID NameName ClassClass

1 Gunopulos GreekGreek

2 Papadopoulos GreekGreek

3 Kollios GreekGreek

4 Dardanos GreekGreek

5 Keogh IrishIrish

6 Gough IrishIrish

7 Greenhaugh IrishIrish

8 Hadleigh IrishIrish

For example, is “Faloutsos” Greek or Irish? We could compare the name “Faloutsos” to a database of names using string edit distance…

edit_distance(Faloutsos, Keogh) = 8edit_distance(Faloutsos, Gunopulos) = 6

Hopefully, the similarity of the name (particularly the suffix) to other Greek names would mean the nearest nearest neighbor is also a Greek name.

Specialized distance measures exist for DNA strings, time series, images, graphs, videos, sets, fingerprints etc…

Peter

Piter

Pioter

Piotr

Substitution (i for e)

Insertion (o)

Deletion (e)

Edit Distance ExampleEdit Distance Example

It is possible to transform any string Q into string C, using only Substitution, Insertion and Deletion.Assume that each of these operators has a cost associated with it.

The similarity between two strings can be defined as the cost of the cheapest transformation from Q to C. Note that for now we have ignored the issue of how we can find this cheapest

transformation

How similar are the names “Peter” and “Piotr”?Assume the following cost function

Substitution 1 UnitInsertion 1 UnitDeletion 1 Unit

D(Peter,Piotr) is 3

Pio

tr

Pyo

tr P

etro

s

Pie

tro

Ped

ro P

ierr

e

Pie

ro

Pet

er

Dear SIR,

I am Mr. John Coleman and my sister is Miss Rose Colemen, we are the children of late Chief Paul Colemen from Sierra Leone. I am writing you in absolute confidence primarily to seek your assistance to transfer our cash of twenty one Million Dollars ($21,000.000.00) now in the custody of a private Security trust firm in Europe the money is in trunk boxes deposited and declared as family valuables by my late father as a matter of fact the company does not know the content as money, although my father made them to under stand that the boxes belongs to his foreign partner.…

This mail is probably spam. The original message has been attached along with this report, so you can recognize or block similar unwanted mail in future. See http://spamassassin.org/tag/ for more details.

Content analysis details: (12.20 points, 5 required)NIGERIAN_SUBJECT2 (1.4 points) Subject is indicative of a Nigerian spamFROM_ENDS_IN_NUMS (0.7 points) From: ends in numbersMIME_BOUND_MANY_HEX (2.9 points) Spam tool pattern in MIME boundaryURGENT_BIZ (2.7 points) BODY: Contains urgent matterUS_DOLLARS_3 (1.5 points) BODY: Nigerian scam key phrase ($NN,NNN,NNN.NN)DEAR_SOMETHING (1.8 points) BODY: Contains 'Dear (something)'BAYES_30 (1.6 points) BODY: Bayesian classifier says spam probability is 30 to 40% [score: 0.3728]

76

Acknowledgements

Some of the material used in this lecture is drawn from other sources:

• Chris Clifton• Jiawei Han

• Dr. Hongjun Lu (Hong Kong Univ. of Science and Technology) • Graduate students from Simon Fraser Univ., Canada, notably

Eugene Belchev, Jian Pei, and Osmar R. Zaiane• Graduate students from Univ. of Illinois at Urbana-Champaign

• Dr. Bhavani Thuraisingham (MITRE Corp. and UT Dallas)

77

Making Good FiguresMaking Good Figures

• We are going to see many figures this quarter

• I personally feel that making good figures is very important to a papers chance of acceptance.

What's wrong with this figure? Let me count the ways…None of the arrows line up with the “circles”. The “circles” are all different sizes and aspect ratios, the (normally invisible) white bounding box around the numbers breaks the arrows in many places. The figure captions has almost no information. Circles are not aligned…

On the right is my redrawing of the figure with PowerPoint. It took me 300 seconds

This figure is an insult to reviewers. It says, “we expect you to spend an unpaid hour to review our paper, but we don’t think it worthwhile to spend 5 minutes to make clear figures”

What's wrong with this figure? Let me count the ways…None of the arrows line up with the “circles”. The “circles” are all different sizes and aspect ratios, the (normally invisible) white bounding box around the numbers breaks the arrows in many places. The figure captions has almost no information. Circles are not aligned…

On the right is my redrawing of the figure with PowerPoint. It took me 300 seconds

This figure is an insult to reviewers. It says, “we expect you to spend an unpaid hour to review our paper, but we don’t think it worthwhile to spend 5 minutes to make clear figures”

1 32

4 65

3 3

1 1

11 1

1

Fig. 1. A sample sequence graph. The line thickness encodes relative entropy

Fig. 1. Sequence graph example

Note that there are figures drawn seven hundred years ago that have much better symmetry and layout.Peter Damian, Paulus Diaconus, and others, Various saints lives: Netherlands, S. or France, N. W.; 2nd quarter of the 13th century

Note that there are figures drawn seven hundred years ago that have much better symmetry and layout.Peter Damian, Paulus Diaconus, and others, Various saints lives: Netherlands, S. or France, N. W.; 2nd quarter of the 13th century

Fig. 1. Sequence graph example

Lets us see some more examples of poor figures, then see some principles that can help

This figure wastes This figure wastes 80% of the space it 80% of the space it takes up. takes up.

In any case, it could In any case, it could be replace by a short be replace by a short English sentence: English sentence: ““We found that for We found that for selectivity ranging selectivity ranging from 0 to 0.05, the from 0 to 0.05, the four methods did not four methods did not differ by more than differ by more than 5%5%””

Why did they bother Why did they bother with the legend, with the legend, since you can’t tell since you can’t tell the four lines apart the four lines apart anyway?anyway?

This figure wastes almost a quarter of a page.

The ordering on the X-axis is arbitrary, so the figure could be replaced with the sentence “We found the average performance was 198 with a standard deviation of 11.2”.

The paper in question had 5 similar plots, wasting an entire page.

The figure below takes up 1/6 of a page, but it only The figure below takes up 1/6 of a page, but it only reports 3 numbers.reports 3 numbers.

The figure below takes up 1/6 of a page, but it only The figure below takes up 1/6 of a page, but it only reports 2 numbers!reports 2 numbers!

Actually, it really only reports one number! Only the relative times really Actually, it really only reports one number! Only the relative times really matter, so they could have written “matter, so they could have written “We found that FTW is 1007 times faster We found that FTW is 1007 times faster than the exact calculation, independent of the sequence lengththan the exact calculation, independent of the sequence length”.”.

0 10

1

Figure 5. Each of our 100 motions plotted as a point in 2 dimensions. The X value is set to the distance to the nearest neighbor from the same class, and the Y value is set to the distance to the nearest neighbor from any other class.

The data is plotted in Figure 5. Note that any correctly classified motions must appear in the upper left (gray) triangle.

It is not obvious from this figure which It is not obvious from this figure which algorithm is best. The caption has almost algorithm is best. The caption has almost zero informationzero informationYou need to read the text very carefully You need to read the text very carefully to understand the figureto understand the figure

Redesign by KeoghRedesign by Keogh

Both figures below describe the classification of time series motions…Both figures below describe the classification of time series motions…

At a glance we can see that the At a glance we can see that the accuracy is very high. We can also see accuracy is very high. We can also see that DTW tends to win when the... that DTW tends to win when the...

In this region our algorithm wins

In this region DTW wins

This should be a bar chart, the four items are unrelatedThis should be a bar chart, the four items are unrelated

(in any case this should probably be a table, not a figure)(in any case this should probably be a table, not a figure)

This pie chart takes up a lot of space to This pie chart takes up a lot of space to communicate two numbers communicate two numbers ( Better as a table, or as simple text)( Better as a table, or as simple text)

A Database Architecture For Real-Time Motion Retrieval

People that have heard of Pacman

People that have not heard of Pacman

• Think about the point you want to make, should it be done with words, a table, or a figure. If a figure, what kind?

• Color helps (but you cannot depend on it)

• Linking helps (sometimes called brushing)

• Direct labeling helps

• Meaningful captions helps

• Minimalism helps (Omit needless elements)

• Finally, taking great care, taking pride in your work, helps

Principles to make Good FiguresPrinciples to make Good Figures

Direct labeling helps

It removes one level of indirection, and allows the figures to be self explaining (see Edward Tufte: Visual Explanations, Chapter 4)

Figure 10. Stills from a video sequence; the right hand is tracked, and converted into a time series: A) Hand at rest: B) Hand moving above holster. C) Hand moving down to grasp gun. D Hand moving to shoulder level, E) Aiming Gun.

BA

CD

E

0

5

10

15

20

25

30

35

40

45

50

0 10 20 30 40 50 60

Fish

BothNeither

Fowl

Linking helps interpretability I

What is Linking?Linking is connecting the same data in two views by using the same color (or thickness etc). In the figures below, color links the data in the pie chart, with data in the scatterplot.

How did we get from here

To here?

It is not clear from the above figure.

See next slide for a suggested fix.

Linking helps interpretability II

In this figure, the color of the arrows inside the fish link to the colors of the arrows on the time series. This tells us exactly how we go from a shape to a time series.Note that there are other links, for example in II, you can tell which fish is which based on color or link thickness linking.

Minimalism helps: In this case, numbers on the X-axis do not mean anything, so they are deleted.

© Sinauer

A nice example of linking

0

DROP2

10

1

DROP1

ABEL

EBEL

False Alarm Rate

Det

ect

ion

Rat

e

Direct labeling helps

Note that the line thicknesses differ by powers of 2, so even in a B/W printout you can tell the four lines apart.

• Don’t cover the data with the labels! You are implicitly saying “the results are not that important”.• Do we need all the numbers to annotate the X and Y axis? • Can we remove the text “With Ranking”? Minimalism helps: delete the “with

Ranking”, the X-axis numbers, the grid…

Covering the data with the labels is a common sin

These two images, which are both use to discuss an anomaly detection algorithm, illustrate many of the points discussed in previous slides.

Color helps - Direct labeling helps - Meaningful captions help

The images should be as self contained as possible, to avoid forcing the reader to look back to the text for clarification multiple times.

Note that while Figure 6 use color to highlight the anomaly, it also uses the line thickness (hard to see in PowerPoint) thus this figure works also well in B/W printouts

Sometime between next week and the end of the Sometime between next week and the end of the quarterquarter

Find a poor figure in a data mining paper.Find a poor figure in a data mining paper.

Create a fixed version of it.Create a fixed version of it.

Present one to three slides about it at the beginning of a Present one to three slides about it at the beginning of a class.class.

Before you start to work on the poor figure, run it by me.Before you start to work on the poor figure, run it by me.

Documents

CS 235 Data Mining Winter 2011 Eamonn Keogh UCR TA: Abdullah Mueen