45
1 CS 260 Winter 2014 Eamonn Keogh’s Presentation of Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh (2012). Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping SIGKDD 2012 Slides I created for this 260 class have this green background

CS 260 Winter 2014 Eamonn Keogh’s Presentation of

  • Upload
    bendek

  • View
    60

  • Download
    0

Embed Size (px)

DESCRIPTION

CS 260 Winter 2014 Eamonn Keogh’s Presentation of Thanawin Rakthanmanon , Bilson Campana , Abdullah Mueen , Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria , Eamonn Keogh (2012).  Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping  SIGKDD 2012 - PowerPoint PPT Presentation

Citation preview

Page 1: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

1

CS 260 Winter 2014

Eamonn Keogh’s Presentation of

Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh (2012). Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping SIGKDD 2012

Slides I created for this 260 class have this green background

Page 2: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

What is Time Series?

0 50 100 150 200 250 300 350 400 4500

0.5

1

0 10 20 30 40 50 60 70 80 90

Hand at rest

Hand moving above holster

Hand moving down to grasp gun

Hand moving to shoulder level

Shooting

2000 2001 20020

200

400 Lance Armstrong?

Page 3: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

Where is the closest match to Q in T?

What is Similarity Search I?

Q

T

Page 4: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

Where is the closest match to Q in T?

What is Similarity Search II?

Q

T

Page 5: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

Note that we must normalize the data

What is Similarity Search II?

Q

T

Page 6: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

6

Indexing refers to any technique to search a collection of items, without having to examine every object.

Obvious example: Search by last name

Let look for Poe….

What is Indexing I?

A-B-C-D-E-F

G-H-I-J-K-L-M

N-O-P-Q-R-S

T-U-V-W-X-Y-Z

Page 7: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

It is possible to index almost anything, using Spatial Access Methods (SAMs)

What is Indexing II?

T

Q

Page 8: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

It is possible to index almost anything, using Spatial Access Methods (SAMs)

What is Indexing II?

Page 9: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

What is Dynamic Time Warping?

Mountain GorillaGorilla gorilla beringei

Lowland GorillaGorilla gorilla graueri

DTWAlignment

Page 10: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

Searching and Mining Trillions of Time Series Subsequences under Dynamic Time

Warping

Thanawin (Art) Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Qiang Zhu, Brandon Westover, Jesin Zakaria, Eamonn Keogh

Page 11: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

11

What is a Trillion?• A trillion is simply one million million. • Up to 2011 there have been 1,709 papers in

this conference. If every such paper was on time series, and each had looked at five hundred million objects, this would still not add up to the size of the data we consider here.

• However, the largest time series data considered in a SIGKDD paper was a “mere” one hundred million objects.

Page 12: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

12

Dynamic Time Warping

Q

C

C

Q

Similar but out of phase peaks. C

Q

R (Warping Windows)

Page 13: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

13

Motivation

• Similarity search is the bottleneck for most time series data mining algorithms.

• The difficulty of scaling search to large datasets explains why most academic work considered at few millions of time series objects.

Page 14: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

14

Objective

• Search and mine really big time series. • Allow us to solve higher-level time series data

mining problem such as motif discovery and clustering at scales that would otherwise be untenable.

Page 15: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

15

Assumptions (1) • Time Series Subsequences must be Z-Normalized

– In order to make meaningful comparisons between two time series, both must be normalized.

– Offset invariance.– Scale/Amplitude invariance.

• Dynamic Time Warping is the Best Measure (for almost everything)– Recent empirical evidence strongly suggests that none of the

published alternatives routinely beats DTW.

A

BC

Page 16: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

16

Assumptions (2) • Arbitrary Query Lengths cannot be Indexed

– If we are interested in tackling a trillion data objects we clearly cannot fit even a small footprint index in the main memory, much less the much larger index suggested for arbitrary length queries.

• There Exists Data Mining Problems that we are Willing to Wait Some Hours to Answer– a team of entomologists has spent three years gathering 0.2 trillion datapoints– astronomers have spent billions dollars to launch a satellite to collect one

trillion datapoints of star-light curve data per day– a hospital charges $34,000 for a daylong EEG session to collect 0.3 trillion

datapoints

Page 17: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

17

Proposed Method: UCR Suite• An algorithm for searching nearest neighbor• Support both ED and DTW search• Combination of various optimizations

– Known Optimizations– New Optimizations

Page 18: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

Known Optimizations (1)• Using the Squared Distance

• Exploiting Multicores– More cores, more speed

• Lower Bounding– LB_Yi– LB_Kim– LB_Keogh

CU

L Q

LB_Keogh

𝐸𝐷ሺ𝑄,𝐶ሻ=ඨ ሺ𝑞𝑖 − 𝑐𝑖ሻ2𝑛𝑖=1

2

Page 19: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

19

Known Optimizations (2) • Early Abandoning of ED

• Early Abandoning of LB_Keogh

CQ

We can early abandon at this point

CU

L

UQ

LU, L is an envelope of Q

bsfcqCQED n

i ii 12)(),(

Page 20: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

20

CQ

CU

L

Fully calculated LBKeogh

About to begin calculation of DTW

Partial calculation of DTW

Partial truncation of LBKeogh

K = 0 K = 11

Known Optimizations (3) • Early Abandoning of DTW• Earlier Early Abandoning of DTW using LB Keogh

CQ

R (Warping Windows)

Stop if dtw_dist ≥ bsf

dtw_dist

Page 21: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

21

CQ

CU

L

Fully calculated LBKeogh

About to begin calculation of DTW

Partial calculation of DTW

Partial truncation of LBKeogh

K = 0 K = 11

Known Optimizations (3) • Early Abandoning of DTW• Earlier Early Abandoning of DTW using LB_Keogh

CQ

R (Warping Windows)

(partial)dtw_dist

(partial)lb_keogh

Stop if dtw_dist +lb_keogh ≥ bsf

Page 22: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

22

UCR Suite

New OptimizationsKnown Optimizations– Early Abandoning of ED– Early Abandoning of LB_Keogh– Early Abandoning of DTW– Multicores

Page 23: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

23

UCR Suite: New Optimizations (1)• Early Abandoning Z-Normalization

– Do normalization only when needed (just in time).– Small but non-trivial. – This step can break O(n) time complexity for ED (and, as

we shall see, DTW).– Online mean and std calculation is needed.

ii

xz

Page 24: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

24

UCR Suite: New Optimizations (2)• Reordering Early Abandoning

– We don’t have to compute ED or LB from left to right.– Order points by expected contribution.

CCQ Q

132 4

65

798

351 42

Standard early abandon ordering Optimized early abandon ordering

- Order by the absolute height of the query point.- This step only can save about 30%-50% of calculations.

Idea

Page 25: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

25

UCR Suite: New Optimizations (3)• Reversing the Query/Data Role in LB_Keogh

– Make LB_Keogh tighter.– Much cheaper than DTW.– Triple the data.–

CU

L

UQ

L

Envelop on Q Envelop on C

-------------------Online envelope calculation.

Page 26: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

26

UCR Suite: New Optimizations (4)• Cascading Lower Bounds

– At least 18 lower bounds of DTW was proposed. – Use some lower bounds only on the Skyline.

0

1

O(1) O(n) O(nR)

LB_KimFL LB_KeoghEQ

max(LB_KeoghEQ, LB_KeoghEC)Early_abandoning_DTW

LB_KimLB_YiTi

ghtn

ess

of

low

er b

ound

LB_EcornerLB_FTW DTW

LB_PAATigh

tnes

s of L

B(L

B/DT

W)

Page 27: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

27

UCR Suite

New Optimizations– Just-in-time Z-normalizations– Reordering Early Abandoning– Reversing LB_Keogh– Cascading Lower Bounds

Known Optimizations– Early Abandoning of ED– Early Abandoning of LB_Keogh– Early Abandoning of DTW– Multicores

Page 28: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

28

UCR Suite

New Optimizations– Just-in-time Z-normalizations– Reordering Early Abandoning– Reversing LB_Keogh– Cascading Lower Bounds

Known Optimizations– Early Abandoning of ED– Early Abandoning of LB_Keogh– Early Abandoning of DTW– Multicores

State-of-the-art*

*We implemented the State-of-the-art (SOTA) as well as we could.SOTA is simply the UCR Suite without new optimizations.

Page 29: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

29

Experimental Result: Random Walk

Million (Seconds)

Billion (Minutes)

Trillion (Hours)

UCR-ED 0.034 0.22 3.16

SOTA-ED 0.243 2.40 39.80

UCR-DTW 0.159 1.83 34.09

SOTA-DTW 2.447 38.14 472.80

• Random Walk: Varying size of the data

Code and data is available at: www.cs.ucr.edu/~eamonn/UCRsuite.html

Page 30: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

30

• Random Walk: Varying size of the query

Naïve DTW

100

1000

10000seconds

SOTA DTW

OPT DTW

(SOTA ED)

For query lengths of 4,096 (rightmost part of this graph) The times are:Naïve DTW : 24,286SOTA DTW : 5,078SOTA ED : 1,850OPT DTW : 567

Query Length

UCR DTWUCR DTW

Experimental Result: Random Walk

Page 31: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

31

Chromosome 2: BP 5709500:5782000

Human

Chimp

Gorilla

Orangutan

Gibbon

Rhesus macaque

Catarrhines

Hominidae

Homininae

Hominini

Hominoidea

• Query: Human Chromosome 2 of length 72,500 bps• Data: Chimp Genome 2.9 billion bps• Time: UCR Suite 14.6 hours, SOTA 34.6 days (830 hours)

Experimental Result: DNA

Page 32: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

32

• Data: 0.3 trillion points of brain wave• Query: Prototypical Epileptic Spike of 7,000 points (2.3 seconds)• Time: UCR-ED 3.4 hours, SOTA-ED 20.6 days (~500 hours)

Experimental Result: EEG

0 1000 2000 3000 4000 5000 6000 7000

Recorded with platinum-tipped silicon micro-electrode probes inserted 1.0 mm into the cerebral cortex

Recordings made from 96 active electrodes, with data sampled at 30kHz per electrode

Continuous Intracranial EEG

Q

Page 33: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

33

• Data: One year of Electrocardiograms 8.5 billion data points.• Query: Idealized Premature Ventricular Contraction (PVC) of

length 421 (R=21=5%).

UCR-ED SOTA-ED UCR-DTW SOTA-DTW

ECG 4.1 minutes 66.6 minutes 18.0 minutes 49.2 hours

Experimental Result: ECG

PVC (aka. skipped beat)

~30,000X faster than real time!

Page 34: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

34

Speeding Up Existing Algorithm

• Time Series Shapelets: – SOTA 18.9 minutes, UCR Suite 12.5 minutes

• Online Time Series Motifs: – SOTA 436 seconds, UCR Suite 156 seconds

• Classification of Historical Musical Scores: – SOTA 142.4 hours, UCR Suite 720 minutes

• Classification of Ancient Coins: – SOTA 12.8 seconds , UCR Suite 0.8 seconds

• Clustering of Star Light Curves: – SOTA 24.8 hours, UCR Suite 2.2 hours

Page 35: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

35

ConclusionUCR Suite …• is an ultra-fast algorithm for finding nearest

neighbor.• is the first algorithm that exactly mines trillion

real-valued objects in a day or two with a "off-the-shelf machine".

• uses a combination of various optimizations.• can be used as a subroutine to speed up other

algorithms.• Probably close to optimal ;-)

Page 36: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

Authors’ Photo

Bilson Campana Abdullah Mueen Gustavo BatistaQiang ZhuBrandon Westover Jesin Zakaria Eamonn Keogh

Thanawin Rakthanmanon

Page 37: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

Acknowledgements• NSF grants 0803410 and 0808770• FAPESP award 2009/06349-0• Royal Thai Government Scholarship

Page 38: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

38

Papers ImpactIt was best paper winner at SIGKDD 2012

It has 37 references according to Google Scholar. Given that it has been in print only 18 months, this would make it among the most cited papers of that conference, that year.

The work was expanded to a journal paper, which adds a section on uniform scaling.

Page 39: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

39

Discussion

The paper made use of videos

http://www.youtube.com/watch?v=c7xz9pVr05Q

Page 40: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

40

Questions

About the paper?

About the presentation of it?

Page 41: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

41

Backup Slides

Page 42: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

42

LB_Keogh

CU

LQ

C

Q

R (Warping Windows)

n

iiiii

iiii

otherwiseLcifLcUcifUc

CQKeoghLB1

2

2

0)()(

),(_

Ui = max(qi-r : qi+r)Li = min(qi-r : qi+r)

Page 43: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

43

Known Optimizations• Lower Bounding

– LB_Yi

– LB_Kim

– LB_Keogh

A

B

CD

max(Q)

min(Q)

CU

LQ

Page 44: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

44

Ordering

0 1 2 3 4 5 6 7 8 9 10 x 1070

5

10

15

20

25

30

35 Average Number of Point-to-point Distance Calculation

Data in Progress

Avg

No.

of C

alcu

latio

n

SOTA-ED

UCR-ED

When good candidateis found

CCQ Q

132 4

65

798

351 42

Standard early abandon ordering Optimized early abandon ordering

This step only can saveabout 50% of calculations

Page 45: CS 260 Winter 2014 Eamonn Keogh’s Presentation of

45

UCR Suite• New Optimizations

– Just-in-time Z-normalizations– Reordering Early Abandoning– Reversing LB_Keogh– Cascading Lower Bounds

• Known Optimizations– Early Abandoning of ED/LB_Keogh/DTW– Use Square Distance– Multicores