1
Observation 1: The convergence of DTW and Euclidean distance results for increasing data sizes. Observation 2: The increasing effectiveness of lower-bounding pruning for increasing data sizes. Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan Begum, Nurjahan Begum, Liudmila Liudmila Ulanova, Ulanova, Jun Wang Jun Wang 1 and and Eamonn Eamonn Keogh Keogh University University of California, of California, Riverside Riverside UT Dallas UT Dallas 1 Why is DTW Clustering Hard? Why is DTW Clustering Hard? Motivation of DTW Clustering Motivation of DTW Clustering Density Peaks (DP) Algorithm Density Peaks (DP) Algorithm Why Existing Work is Why Existing Work is not not the Answer? the Answer? TADPole TADPole: Our Proposed Algorithm : Our Proposed Algorithm How ‘good’ are TADPole Clusters? Case Study 1: Electromagnetic Case Study 1: Electromagnetic Articulograph Articulograph How Effective is How Effective is TADPole’s TADPole’s Pruning? Pruning? #kanyewest #Michael #MichaelJackson #taylorswift 0 40 80 120 hours Synonym Discovery ? Association Discovery ? I’mma let you finishBos taurus Hyperoodon ampullatus Talpa europaea Bos taurus Hyperoodon ampullatus Talpa europaea Cetartiodactyla DTW ED 0 1000 2000 0.01 0.03 0.05 0.07 1-NN error rate Size of training set Euclidean DTW 0 1000 2000 0.6 0.7 0.8 0.9 Dataset Size Rand Index DTW Euclidean Neither of these two observations help! 5 1 2 3 4 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 Mislabeled by k-means Outlier Scalability Issue: DTW is not a metric, therefore very difficult to index Quality Issue: Need clustering algorithm which is insensitive to outliers 3 steps 1. Density Calculation 2. NN within Higher Density List Calculation 3. Cluster Assignment 1 2 3 4 5 6 8 7 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 4 3 6 4 5 3 1 3 1 1 2 2 2 ρ 3 5 Elements with higher density 4.2 6 Item 1’s cluster label = item 3’s cluster label 1 d c j c ij i d d ) ( Pruning During Local Density Computation j LB Matrix (i,j) D ij UB Matrix (i,j) LB Matrix (i,j) D ij UB Matrix (i,j) d c LB Matrix (i,j) D ij UB Matrix (i,j) B) C) D) i j i i j j i D ij = 0 A) Pruning During NN Distance Calculation From Higher Density List LB Matrix (i,j 1 ) D 1 UB Matrix (i,j 1 ) D 2 UB Matrix (i,j 2 ) D 3 UB Matrix (i,j 3 ) A) B) C) i j 1 i i j 2 j 3 D 4 UB Matrix (i,j 4 ) i j 4 D) LB Matrix (i,j 2 ) LB Matrix (i,j 4 ) LB Matrix (i,j 3 ) Distance Calculations 0 3500 1 3 5 7 x 10 6 TADPole Number of objects Absolute Number 0 3500 0 100 Number of objects Brute force TADPole Percentage DP: 9 Hours TADPole: 9 minutes Distance Computation Ordering: Distance Computation Ordering: Anytime Anytime TADPole TADPole Distance Computation Percentage 100% 0.4 1 0 Rand Index Euclidean Distance Oracle Order TADPole Order 0 10% 0.4 1 Oracle Order Random Order TADPole Order Random Order Rand Index Distance Computation Percentage Zoom-In of Above Figure This reflects the 90% of DTW calculations that were admissibly pruned This reflects the 10% of DTW calculations that were calculated in anytime ordering 10% 0 150 Y Z Y Z 1 2 3 4 5 6 7 0.84 0.92 1 Distance Computation Percentage Rand Index Euclidean Distance Oracle Order Random Order TADPole Order Pruning: 94% Case Study 2: Case Study 2: Pulsus Pulsus Dataset Dataset Suspected Pulsus Severe Pulsus Healthy Oximeter Vein Artery Photo Detector LED 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Patient 639 Patient 523 Patient 618 Patient 2975918 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Normalized Respiration Rate Normalized Heart Rate Power Spectral Density Frequency A) B) C) D) E) F) 200 600 1000 1400 1800 200 600 1000 1400 1800 Non-Severe Pulsus Severe Pulsus PPG Reproducibility Reproducibility All the code and datasets used in this paper are publicly available in: www.cs.ucr.edu/~nbegu001/SpeededClusteringDTW Pruning: 88%

KDD Poster Nurjahan Begum

Embed Size (px)

Citation preview

Observation 1: The convergence of DTW and Euclidean distance results for increasing data sizes.

Observation 2: The increasing effectiveness of lower-bounding pruning for increasing data sizes.

Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy

Nurjahan Begum,Nurjahan Begum, LiudmilaLiudmila Ulanova,Ulanova, Jun WangJun Wang11 andand EamonnEamonn KeoghKeogh

University University of California, of California, Riverside Riverside UT DallasUT Dallas11

Why is DTW Clustering Hard?Why is DTW Clustering Hard?

Motivation of DTW ClusteringMotivation of DTW Clustering Density Peaks (DP) AlgorithmDensity Peaks (DP) Algorithm

Why Existing Work is Why Existing Work is not not the Answer?the Answer?

TADPoleTADPole: Our Proposed Algorithm: Our Proposed Algorithm

How ‘good’ are TADPole Clusters?

Case Study 1: Electromagnetic Case Study 1: Electromagnetic ArticulographArticulograph

How Effective is How Effective is TADPole’sTADPole’s Pruning?Pruning?

#kanyewest

#Michael

#MichaelJackson

#taylorswift

0 40 80 120 hours

Synonym Discovery ?

Association Discovery ?

“I’mma let you finish”

Bos taurus

Hyperoodon ampullatus

Talpa europaea

Bos taurus

Hyperoodon ampullatus

Talpa europaea

Cetartiodactyla

DTW ED

0 1000 2000

0.01

0.03

0.05

0.07

1-N

N

erro

r ra

te

Size of training set

Euclidean

DTW

0 1000 2000

0.6

0.7

0.8

0.9

Dataset Size

Ran

d I

nd

ex DTW

Euclidean

Neither of these two observations help!

5

1

2

3

4

6

7

8

9 10

11

12

13

1

2

3

4

5

6

7

8

9 10

11

12

13

Mislabeled

by k-means

Outlier

Scalability Issue: DTW is not a metric, therefore very difficult to index

Quality Issue: Need clustering algorithm which is insensitive to outliers

3 steps

1. Density Calculation

2. NN within Higher Density List Calculation

3. Cluster Assignment

1 2 3

4

5

6 8

7

9 10

11 12 13 1

2

3

4

5

6

7

8

9

10

11

12

13

4

3

6

4

5

3

1

3

1

1

2

2

2

ρ

3 5

Elements with higher density

4.2 6

Item 1’s cluster label = item 3’s cluster label

1 dc

j

ciji dd )(

Pruning During Local Density Computation

j

LBMatrix(i,j)

Dij

UBMatrix(i,j)

LBMatrix(i,j)

Dij

UBMatrix(i,j)

dc

LBMatrix(i,j)

Dij

UBMatrix(i,j)

B)

C)

D)

i j

i

i

j

j

i Dij = 0 A)

Pruning During NN Distance Calculation From Higher Density List

LBMatrix(i,j1)

D1

UBMatrix(i,j1)

D2

UBMatrix(i,j2)

D3

UBMatrix(i,j3)

A)

B)

C)

i j1

i

i

j2

j3

D4

UBMatrix(i,j4)

i j4

D)

LBMatrix(i,j2)

LBMatrix(i,j4)

LBMatrix(i,j3)

Dis

tan

ce C

alcu

lati

on

s

0 3500

1

3

5

7 x 10

6

TADPole

Number of objects

Absolute

Number

0 3500 0

100

Number of objects

Brute force

TADPole

Percentage

DP: 9 Hours TADPole: 9 minutes

Distance Computation Ordering:Distance Computation Ordering:

Anytime Anytime TADPoleTADPole

Distance Computation Percentage 100%

0.4

1

0

Ran

d I

ndex

Euclidean

Distance

Oracle

Order

TADPole

Order

0 10%

0.4

1

Oracle Order

Random Order

TADPole Order

Random

Order

Ran

d I

ndex

Distance Computation Percentage

Zoom-In of Above Figure

This reflects the

90% of DTW

calculations that

were admissibly

pruned

This reflects the

10% of DTW

calculations that

were calculated in

anytime ordering

10%

0 150

Y

Z

Y

Z

1 2 3 4 5 6 7

0.84

0.92

1

Distance Computation Percentage

Ran

d I

nd

ex Euclidean Distance

Oracle Order

Random Order

TADPole Order

Pruning: 94%

Case Study 2: Case Study 2: PulsusPulsus DatasetDataset

Suspected Pulsus

Severe Pulsus

Healthy

Oximeter

Vein

Artery

Photo Detector

LED

0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Patient 639 Patient 523 Patient 618 Patient 2975918

0 10 20 30 40 50 60 0 10 20 30 40 50 60

Normalized Respiration Rate

Normalized Heart Rate

Po

wer

Sp

ectr

al

Den

sity

Frequency

A) B)

C) D) E) F)

200 600 1000 1400 1800 200 600 1000 1400 1800

Non-Severe Pulsus Severe Pulsus

PP

G

ReproducibilityReproducibility

All the code and datasets used in this paper are publicly available in: www.cs.ucr.edu/~nbegu001/SpeededClusteringDTW

Pruning: 88%