89
SAX! SAX! A Symbolic Representations A Symbolic Representations of Time Series of Time Series Eamonn Keogh and Jessica Lin Eamonn Keogh and Jessica Lin Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 [email protected]

SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

  • Upload
    tamal

  • View
    193

  • Download
    1

Embed Size (px)

DESCRIPTION

SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 [email protected]. Important! Read This!. - PowerPoint PPT Presentation

Citation preview

Page 1: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

SAX!SAX!

A Symbolic Representations A Symbolic Representations of Time Seriesof Time Series

Eamonn Keogh and Jessica LinEamonn Keogh and Jessica LinComputer Science & Engineering Department

University of California - RiversideRiverside,CA [email protected]

Page 2: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Important! Read This!Important! Read This!These slides are from an early talk about SAX, some slides will make little sense out of context, but are provided here to give a quick intro to the utility of SAX. Read [1] and visit [2] for more details.

You may use these slides for any teaching purpose, so long as they are clearly identified as being created by Jessica Lin and Eamonn Keogh.

You may not use the text and images in a paper or tutorial without express prior permission from Dr. Keogh.

[1] Lin, J., Keogh, E., Lonardi, S. & Chiu, B. (2003). A Symbolic Representation of Time Series, with Implications for Streaming Algorithms. In proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. San Diego, CA. June 13. [2] http://www.cs.ucr.edu/~eamonn/SAX.htm

Page 3: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Outline of TalkOutline of Talk• Prologue: Background on Time Series Data Mining

• The importance of the right representation

• A new symbolic representation, SAX • SAX for anomaly detection

• SAX for motif discovery

• SAX for visualization

• Appendix: SAX for classification, clustering, indexing

Page 4: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

What are Time Series?What are Time Series?

0 50 100 150 200 250 300 350 400 450 50023

24

25

26

27

28

29

25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750

.. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500

A time series is a collection of observations made sequentially in time.

Page 5: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Time Series are UbiquitousTime Series are Ubiquitous! IPeople measure things...People measure things...

• Schwarzeneggers popularity rating.• Their blood pressure.• The annual rainfall in New Zealand.• The value of their Yahoo stock.• The number of web hits per second.

… … and things change over time.and things change over time.

Thus time series occur in virtually every medical, scientific and Thus time series occur in virtually every medical, scientific and businesses domain.businesses domain.

Page 6: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Image data, may best be thought of as time series…

Page 7: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

0 10 20 30 40 50 60 70 80 90

Hand at rest

Hand moving to shoulder level

Steady pointing

0 10 20 30 40 50 60 70 80 90

Hand at rest

Hand moving above holster

Hand moving down to grasp gun

Hand moving to shoulder level

Steady pointing

Video data, may best be thought of as time series…

Point

Gun-Draw

Page 8: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

What do we want to do with the time series data?What do we want to do with the time series data?

ClusteringClustering ClassificationClassification

Query by Content

Rule Discovery

10

s = 0.5c = 0.3

Motif DiscoveryMotif Discovery

Novelty DetectionNovelty Detection

Page 9: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

All these problems require All these problems require similaritysimilarity matching matching

ClusteringClustering ClassificationClassification

Query by Content

Rule Discovery

10

s = 0.5c = 0.3

Motif DiscoveryMotif Discovery

Novelty DetectionNovelty Detection

Page 10: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Euclidean Distance MetricEuclidean Distance Metric

Q

C

D(Q,C)

n

iii cqCQD

1

2,

Given two time series

Q = q1…qn

and

C = c1…cn

their Euclidean distance is

defined as:

Page 11: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

• Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest

• Approximately solve the problem at hand in main memory

• Make (hopefully very few) accesses to the original data on disk to confirm the solution obtained in Step 2, or to modify the solution so it agrees with the solution we would have obtained on the original data

The Generic Data Mining AlgorithmThe Generic Data Mining Algorithm

But which approximation should we use?

But which approximation should we use?

Page 12: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Time Series Representations

Data Adaptive Non Data Adaptive

Spectral Wavelets Piecewise Aggregate

Approximation

Piecewise Polynomial

Symbolic Singular Value

Decomposition Random

Mappings

Piecewise Linear

Approximation Adaptive Piecewise Constant

Approximation

Discrete Fourier

Transform

Discrete Cosine

Transform Haar Daubechies

dbn n > 1 Coiflets Symlets

Sorted Coefficients

Orthonormal Bi-Orthonormal

Interpolation Regression

Trees

Natural Language

Strings

0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 1001200 20 40 60 80 100 120 0 20 40 60 80 100120

DFT DWT SVD APCA PAA PLA

0 20 40 60 80 100120

SYM

UUCUCUCD

UU

CU

CU

DD

Page 13: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

• Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest

• Approximately solve the problem at hand in main memory

• Make (hopefully very few) accesses to the original data on disk to confirm the solution obtained in Step 2, or to modify the solution so it agrees with the solution we would have obtained on the original data

The Generic Data Mining Algorithm (revisited) The Generic Data Mining Algorithm (revisited)

This only works if the approximation

allows lower bounding

This only works if the approximation

allows lower bounding

Page 14: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

M

i iiii svqvsrsr1

21 ))((

DLB(Q’,S’)

DLB(Q’,S’)

S

Q

D(Q,S)

n

iii sq

1

2

D(Q,S)

Exact (Euclidean) distance D(Q,S) Lower bounding distance DLB(Q,S)

Q’

S’

Lower bounding means that for all Q and S, we have…

DLB(Q’,S’) D(Q,S)

What is lower bounding?

Page 15: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

We can live without “trees”, “random mappings” and “natural language”, but it would be nice if we could lower bound strings (symbolic or discrete approximations)…

A lower bounding symbolic approach would allow data miners to…

• Use suffix trees, hashing, markov models etc• Use text processing and bioinformatic algorithms

Time Series Representations

Data Adaptive Non Data Adaptive

Spectral Wavelets Piecewise Aggregate

Approximation

Piecewise Polynomial

Symbolic Singular Value

Decomposition Random

Mappings

Piecewise Linear

Approximation Adaptive Piecewise Constant

Approximation

Discrete Fourier

Transform

Discrete Cosine

Transform Haar Daubechies

dbn n > 1 Coiflets Symlets

Sorted Coefficients

Orthonormal Bi-Orthonormal

Interpolation Regression

Trees

Natural Language

Strings

Page 16: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

We have created the first We have created the first symbolic representation of time symbolic representation of time series, that allows…series, that allows…

• Lower bounding of Euclidean distance• Dimensionality Reduction• Numerosity Reduction

Page 17: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

We call our representation SAXWe call our representation SAXSSymbolic ymbolic AAggregate Approggregate ApproXXimationimation

baabccbc

Page 18: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

How do we obtain SAX?How do we obtain SAX?

0 20 40 60 80 100 120

C

C

0

-

-

0 20 40 60 80 100 120

bbb

a

cc

c

a

baabccbc

First convert the time series to PAA representation, then convert the PAA to symbols

It take linear time

Page 19: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Time series subsequences tend to have a highly Gaussian distribution

-10 0 10

0.001 0.003

0.01 0.02

0.05 0.10

0.25

0.50

0.75

0.90 0.95

0.98 0.99

0.997 0.999

Pro

ba

bili

ty

A normal probability plot of the (cumulative) distribution of

values from subsequences of length 128.

0

-

-

0 20 40 60 80 100

bbb

a

cc

a

Why a Gaussian?Why a Gaussian?

Page 20: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Visual ComparisonVisual Comparison

A raw time series of length 128 is transformed into the word “ffffffeeeddcbaabceedcbaaaaacddee.”– We can use more symbols to represent the time series since each symbol

requires fewer bits than real-numbers (float, double)

-3

-2 -1 0 1 2 3

DFT

PLA

Haar

APCA

a b c d e f

Page 21: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

PAA distance lower-bounds the Euclidean Distance

0 20 40 60 80 100 120

-1.5 -1 -0.5 0 0.5 1 1.5

C

Q

0 20 40 60 80 100 120

-1.5 -1 -0.5 0 0.5 1 1.5

C

Q

= baabccbc C

= babcacca Q

n

iii cqCQD

1

2,

Euclidean Distance

w

i iiwn cqCQDR

1

2),(

w

i iiwn cqdistCQMINDIST

1

2)ˆ,ˆ()ˆ,ˆ(

dist() can be implemented using a table lookup.

Page 22: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

SAX is just as good as other SAX is just as good as other representations, or working representations, or working on the raw data for most on the raw data for most problems problems (Slides shown at the end of this presentation)(Slides shown at the end of this presentation)

Now let us consider SAX Now let us consider SAX for other problems, for other problems, including novelty detection, including novelty detection, visualization and motif visualization and motif discovery discovery

We will start with novelty We will start with novelty detection…detection…

Page 23: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Novelty DetectionNovelty Detection

• Fault detection Fault detection • Interestingness detectionInterestingness detection• Anomaly detectionAnomaly detection• Surprisingness detectionSurprisingness detection

Page 24: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

…note that this problem should not be confused with the relatively simple problem of outlier detection. Remember Hawkins famous definition of an outlier...

…note that this problem should not be confused with the relatively simple problem of outlier detection. Remember Hawkins famous definition of an outlier...

Thanks Doug, the check is in the mail.We are not interested in finding individually surprising datapoints, we are interested in finding surprising patterns.

Thanks Doug, the check is in the mail.We are not interested in finding individually surprising datapoints, we are interested in finding surprising patterns.

... an outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated from a different mechanism...

... an outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated from a different mechanism...

Douglas M. Hawkins

Page 25: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Lots of good folks have worked on this, and closely related problems. It is referred to as the detection of “Aberrant Behavior1”, “Novelties2”, “Anomalies3”, “Faults4”, “Surprises5”, “Deviants6” ,“Temporal Change7”, and “Outliers8”.

Lots of good folks have worked on this, and closely related problems. It is referred to as the detection of “Aberrant Behavior1”, “Novelties2”, “Anomalies3”, “Faults4”, “Surprises5”, “Deviants6” ,“Temporal Change7”, and “Outliers8”.

1. Brutlag, Kotsakis et. al.2. Daspupta et. al., Borisyuk et.

al.3. Whitehead et. al., Decoste4. Yairi et. al.5. Shahabi, Chakrabarti6. Jagadish et. al.7. Blockeel et. al., Fawcett et. al.8. Hawkins.

Page 26: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

The blue time series at the top is a normal healthy human electrocardiogram with an artificial “flatline” added. The sequence in red at the bottom indicates how surprising local subsections of the time series are under the measure introduced in Shahabi et. al.

The blue time series at the top is a normal healthy human electrocardiogram with an artificial “flatline” added. The sequence in red at the bottom indicates how surprising local subsections of the time series are under the measure introduced in Shahabi et. al.

Arrr... what be wrong with current approaches?Arrr... what be wrong with current approaches?

Page 27: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Our SolutionOur Solution

Based on the following intuition, a pattern is surprising if its frequency of occurrence is greatly different from that which we expected, given previous experience…

This is a nice intuition, but useless unless we can more formally define it, and calculate it efficiently

Page 28: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Note that unlike all previous attempts to solve this problem, our notion surprisingness of a pattern is not tied exclusively to its shape. Instead it depends on the difference between the shape’s expected frequency and its observed frequency. For example consider the familiar head and shoulders pattern shown below...

Note that unlike all previous attempts to solve this problem, our notion surprisingness of a pattern is not tied exclusively to its shape. Instead it depends on the difference between the shape’s expected frequency and its observed frequency. For example consider the familiar head and shoulders pattern shown below...

The existence of this pattern in a stock market time series should not be consider surprising since they are known to occur (even if only by chance). However, if it occurred ten times this year, as opposed to occurring an average of twice a year in previous years, our measure of surprise will flag the shape as being surprising. Cool eh?The pattern would also be surprising if its frequency of occurrence is less than expected. Once again our definition would flag such patterns.

The existence of this pattern in a stock market time series should not be consider surprising since they are known to occur (even if only by chance). However, if it occurred ten times this year, as opposed to occurring an average of twice a year in previous years, our measure of surprise will flag the shape as being surprising. Cool eh?The pattern would also be surprising if its frequency of occurrence is less than expected. Once again our definition would flag such patterns.

Page 29: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Tarzan (R) is a registered trademark of Edgar Rice Burroughs, Inc.

Tarzan (R) is a registered trademark of Edgar Rice Burroughs, Inc.

We call our algorithm… Tarzan!We call our algorithm… Tarzan!

“Tarzan” is not an acronym. It is a pun on the fact that the heart of the algorithm relies comparing two suffix trees, “tree to tree”!

Homer, I hate to be a fuddy-duddy, but could you put on some pants?

“Tarzan” is not an acronym. It is a pun on the fact that the heart of the algorithm relies comparing two suffix trees, “tree to tree”!

Homer, I hate to be a fuddy-duddy, but could you put on some pants?

Page 30: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Definition 1: A time series pattern P, extracted from database X is surprising relative to a database R, if the probability of its occurrence is greatly different to that expected by chance, assuming that R and X are created by the same underlying process.

Definition 1: A time series pattern P, extracted from database X is surprising relative to a database R, if the probability of its occurrence is greatly different to that expected by chance, assuming that R and X are created by the same underlying process.

We begin by defining some terms… Professor Frink?

We begin by defining some terms… Professor Frink?

Page 31: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Definition 1: A time series pattern P, extracted from database X is surprising relative to a database R, if the probability of occurrence is greatly different to that expected by chance, assuming that R and X are created by the same underlying process.

Definition 1: A time series pattern P, extracted from database X is surprising relative to a database R, if the probability of occurrence is greatly different to that expected by chance, assuming that R and X are created by the same underlying process.

But you can never know the probability of a pattern you have never seen!

And probability isn’t even defined for real valued time series!

But you can never know the probability of a pattern you have never seen!

And probability isn’t even defined for real valued time series!

Page 32: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

We need to discretize the time series We need to discretize the time series into symbolic strings… SAX!!into symbolic strings… SAX!!

aaabaabcbabccb

Once we have done this, we can Once we have done this, we can use Markov models to calculate use Markov models to calculate the probability of any pattern, the probability of any pattern, including ones we have never including ones we have never seen beforeseen before

Page 33: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

If x = principalskinner

is {a,c,e,i,k,l,n,p,r,s}|x| is 16

skin is a substring of xprin is a prefix of xner is a suffix of x

If y = in, then fx(y) = 2If y = pal, then fx(y) = 1

principalskinner

If x = principalskinner

is {a,c,e,i,k,l,n,p,r,s}|x| is 16

skin is a substring of xprin is a prefix of xner is a suffix of x

If y = in, then fx(y) = 2If y = pal, then fx(y) = 1

principalskinner

Page 34: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Can we do all this in linear space Can we do all this in linear space and time?and time?

Yes! Some very clever modifications of suffix trees (Mostly due to Stefano Lonardi) let us do this in linear space.

An individual pattern can be tested in constant time!

Page 35: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

We would like to demonstrate two features of our proposed approach

• Sensitivity (High True Positive Rate) The algorithm can find truly surprising patterns in a time series.

• Selectivity (Low False Positive Rate) The algorithm will not find spurious “surprising” patterns in a time series

Experimental EvaluationExperimental EvaluationSensitive

and Selective,

just like me

Sensitive and

Selective,

just like me

Page 36: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

0 200 400 600 800 1000 1200 1400 1600

200 400 600 800 1000 1200 1400 16000

Training data

Test data(subset)

Tarzan’s level of surprise

Experiment 1: Shock ECGExperiment 1: Shock ECG

Page 37: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Experiment 2: Video Experiment 2: Video (Part 1)(Part 1)

0 2000 4000 6000 8000 10000 12000

0 2000 4000 6000 8000 10000 12000

0 2000 4000 6000 8000 10000 12000

Training data

Test data(subset)

Tarzan’s level of surprise

We zoom in on this section in the next slide

Page 38: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Experiment 2: Video Experiment 2: Video (Part 2)(Part 2)

0 100 200 300 400 500 600 700100

150

200

250

300

350

400

Normal sequence

Normal sequenceActor

misses holster

Briefly swings gun at target, but does not aim

Laughing and flailing hand

Page 39: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

We consider a dataset that contains the power demand for a Dutch research facility for the entire year of 1997. The data is sampled over 15 minute averages, and thus contains 35,040 points.

Demand for Power?

Excellent!

Demand for Power?

Excellent!

0 200 400 600 800 1000 1200 1400 1600 1800 2000500

1000

1500

2000

2500

The first 3 weeks of the power demand dataset. Note the repeating pattern of a strong peak for each of the five weekdays, followed by relatively quite weekends

The first 3 weeks of the power demand dataset. Note the repeating pattern of a strong peak for each of the five weekdays, followed by relatively quite weekends

Experiment 3: Power Demand Experiment 3: Power Demand (Part 1)(Part 1)

Page 40: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Tarzan TSA Tree IMM

We used from Monday January 6th to Sunday March 23rd as reference data. This time period is devoid of national holidays. We tested on the remainder of the year.

We will just show the 3 most surprising subsequences found by each algorithm. For each of the 3 approaches we show the entire week (beginning Monday) in which the 3 largest values of surprise fell.

Both TSA-tree and IMM returned sequences that appear to be normal workweeks, however Tarzan returned 3 sequences that correspond to the weeks that contain national holidays in the Netherlands. In particular, from top to bottom, the week spanning both December 25th and 26th and the weeks containing Wednesday April 30th (Koninginnedag, “Queen's Day”) and May 19th (Whit Monday).

Mmm.. anomalous..

Mmm.. anomalous..

Experiment 3: Power Demand Experiment 3: Power Demand (Part 2)(Part 2)

Page 41: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

NASA recently said “TARZAN holds great promise for the future*”.

There is now a journal version of TARZAN (under review), if you would like a copy, just ask.

In the meantime, let us consider motif discovery…

NASA recently said “TARZAN holds great promise for the future*”.

There is now a journal version of TARZAN (under review), if you would like a copy, just ask.

In the meantime, let us consider motif discovery…

* Isaac, D. and Christopher Lynnes, 2003. Automated Data Quality Assessment in the Intelligent Archive, White Paper prepared for the Intelligent Data Understanding program.

Page 42: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

SAX allows Motif SAX allows Motif Discovery!Discovery!

Winding Dataset ( The angular speed of reel 2 )

0 500 1000 150 0 2000 2500

Informally, motifs are reoccurring patterns…

Page 43: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Motif DiscoveryMotif Discovery

Winding Dataset (The angular speed of reel 2)

0 500 1000 1500 2000 2500

0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140

A B C

A B C

To find these 3 motifs would require about 6,250,000 calls to the Euclidean distance function.

Page 44: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

· Mining association rules in time series requires the discovery of motifs. These are referred to as primitive shapes and frequent patterns.

· Several time series classification algorithms work by constructing typical prototypes of each class. These prototypes may be considered motifs.

· Many time series anomaly/interestingness detection algorithms essentially consist of modeling normal behavior with a set of typical shapes (which we see as motifs), and detecting future patterns that are dissimilar to all typical shapes.

· In robotics, Oates et al., have introduced a method to allow an autonomous agent to generalize from a set of qualitatively different experiences gleaned from sensors. We see these “experiences” as motifs.

· In medical data mining, Caraca-Valente and Lopez-Chavarrias have introduced a method for characterizing a physiotherapy patient’s recovery based of the discovery of similar patterns. Once again, we see these “similar patterns” as motifs.

• Animation and video capture… (Tanaka and Uehara, Zordan and Celly)

Why Find Motifs?Why Find Motifs?

Page 45: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Definition 1. Match: Given a positive real number R (called range) and a time series T containing a subsequence C beginning at position p and a subsequence M beginning at q, if D(C, M) R, then M is called a matching subsequence of C.

Definition 2. Trivial Match: Given a time series T, containing a subsequence C beginning at position p and a matching subsequence M beginning at q, we say that M is a trivial match to C if either p = q or there does not exist a subsequence M’ beginning at q’ such that D(C, M’) > R, and either q < q’< p or p < q’< q.

Definition 3. K-Motif(n,R): Given a time series T, a subsequence length n and a range R, the most significant motif in T (hereafter called the 1-Motif(n,R)) is the subsequence C1 that has highest count

of non-trivial matches (ties are broken by choosing the motif whose matches have the lower variance). The Kth most significant motif in T (hereafter called the K-Motif(n,R) ) is the subsequence CK that has

the highest count of non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1 i < K.

0 100 200 3 00 400 500 600 700 800 900 100 0

T

Space Shuttle STS - 57 Telemetry ( Inertial Sensor )

Trivial Matches

C

Page 46: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

OK, we can define motifs, but OK, we can define motifs, but how do we find them?how do we find them?

The obvious brute force search algorithm is just too slow…

Our algorithm is based on a hot idea from bioinformatics, random projection* and the fact that SAX allows use to lower bound discrete representations of time series.

* J Buhler and M Tompa. Finding motifs using random projections. In RECOMB'01. 2001.

Page 47: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

a b a b

c c c c

b a c c

a b a c

1

2

58

985

0 500 1000

a c b a

T ( m= 1000)

a = 3 {a,b,c} n = 16 w = 4

S ^

C1

C1 ^ Assume that we have a time series T of length 1,000, and a motif of length 16, which occurs

twice, at time T1 and

time T58.

A simple worked example of our motif discovery algorithmA simple worked example of our motif discovery algorithm

The next 4 slides

Page 48: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

a b a b 1

c c c c 2

b a c c 3

a b a c 4

1

2

58

985

457

1 58

2 985

1 2 58 985

1

2

58

985

1

1

A mask {1,2} was randomly chosen, so the values in columns {1,2} were used to project matrix into buckets.

Collisions are recorded by incrementing the appropriate location in the collision matrix

Page 49: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

A mask {2,4} was randomly chosen, so the values in columns {2,4} were used to project matrix into buckets.

Once again, collisions are recorded by incrementing the appropriate location in the collision matrix

1 2 58 985

1

2

58

985

2

1

a b a b 1

c c c c 2

b a c c 3

a b a c 4

1

2

58

985

2

1 58

985

Page 50: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

1 2 58 985

1

2

58

985

27

1

0

1 1 E( , , , , ) 1-

2

t i w id

i

k wi ak a w d t

iw a a

We can calculate the expected values in the We can calculate the expected values in the matrix, assuming there are NO patterns…matrix, assuming there are NO patterns…

3 2

1

0

2

2 1

1

3

2

2 1

3Suppose

E(k,a,w,d,t) = 2

Page 51: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

0 200 400 600 800 1000 1200

0 20 40 60 80 100 120

A

B 0 20 40 60 80 100 120

C

D

A Simple ExperimentA Simple Experiment

Lets imbed two motifs into a random walk time series, and see if we can recover them

Page 52: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

A

C

B D

Planted MotifsPlanted Motifs

Page 53: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

0

20

40

60

80

100

120

0

20

40

60

80

100

120

““Real” MotifsReal” Motifs

Page 54: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Some Examples of Real MotifsSome Examples of Real Motifs

0 500 1000 1500 2000

Motor 1 (DC Current)

2500

3500

4500

5500

6500

Astrophysics (Photon Count)

Page 55: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

How Fast can we find Motifs?How Fast can we find Motifs?

Brute Force

TS-P

0

2k

4k

6k

8k

10k

1000 2000 3000 4000 5000

Sec

onds

Length of Time Series

Page 56: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

The Joy of SAXThe Joy of SAX

We can use SAX to create a lite-weight, but incredibly useful tool call time series bitmaps.

Page 57: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

TGGCCGTGCTAGGCCCCACCCCTACCTTGCAGTCCCCGCAAGCTCATCTGCGCGAACCAGAACGCCCACCACCCTTGGGTTGAAATTAAGGAGGCGGTTGGCAGCTTCCCAGGCGCACGTACCTGCGAATAAATAACTGTCCGCACAAGGAGCCCGACGATAGTCGACCCTCTCTAGTCACGACCTACACACAGAACCTGTGCTAGACGCCATGAGATAAGCTAACACAAAAACATTTCCCACTACTGCTGCCCGCGGGCTACCGGCCACCCCTGGCTCAGCCTGGCGAAGCCGCCCTTCA

The DNA of two species…

CCGTGCTAGGGCCACCTACCTTGGTCCGCCGCAAGCTCATCTGCGCGAACCAGAACGCCACCACCTTGGGTTGAAATTAAGGAGGCGGTTGGCAGCTTCCAGGCGCACGTACCTGCGAATAAATAACTGTCCGCACAAGGAGCCGACGATAAAGAAGAGAGTCGACCTCTCTAGTCACGACCTACACACAGAACCTGTGCTAGACGCCATGAGATAAGCTAACA

Page 58: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

C T

A G

C T

A G

C T

A G

C T

A G

C T

A G

C T

A G

0.20 0.24

0.26 0.30

CCGTGCTAGGGCCACCTACCTTGGTCCGCCGCAAGCTCATCTGCGCGAACCAGAACGCCACCACCTTGGGTTGAAATTAAGGAGGCGGTTGGCAGCTTCCAGGCGCACGTACCTGCGAATAAATAACTGTCCGCACAAGGAGCCGACGATAAAGAAGAGAGTCGACCTCTCTAGTCACGACCTACACACAGAACCTGTGCTAGACGCCATGAGATAAGCTAACA

Page 59: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

CCGTGCTAGGGCCACCTACCTTGGTCCGCCGCAAGCTCATCTGCGCGAACCAGAACGCCACCACCTTGGGTTGAAATTAAGGAGGCGGTTGGCAGCTTCCAGGCGCACGTACCTGCGAATAAATAACTGTCCGCACAAGGAGCCGACGATAAAGAAGAGAGTCGACCTCTCTAGTCACGACCTACACACAGAACCTGTGCTAGACGCCATGAGATAAGCTAACA

CC CT TC TT

CA CG TA TC

AC AT GC GT

AA AG GA GG

C T

A G

CCC CCT CTC

CCA CCG CTA

CAC CAT

CAA

CC CT TC TT

CA CG TA TC

AC AT GC GT

AA AG GA GG

CC CT TC TTCC CT TC TT

CA CG TA TCCA CG TA TG

AC AT GC GTAC AT GC GT

AA AG GA GGAA AG GA GG

C T

A G

C T

A G

CCC CCT CTC

CCA CCG CTA

CAC CAT

CAA

CC CT TC TT

CA CG TA TC

AC AT GC GT

AA AG GA GG

CC CT TC TTCC CT TC TT

CA CG TA TCCA CG TA TC

AC AT GC GTAC AT GC GT

AA AG GA GGAA AG GA GG

C T

A G

C T

A G

CCC CCT CTC

CCA CCG CTA

CAC CAT

CAA

CC CT TC TTCC CT TC TT

CA CG TA TCCA CG TA TC

AC AT GC GTAC AT GC GT

AA AG GA GGAA AG GA GG

CC CT TC TTCC CT TC TT

CA CG TA TCCA CG TA TG

AC AT GC GTAC AT GC GT

AA AG GA GGAA AG GA GG

C T

A G

C T

A G

CCC CCT CTC

CCA CCG CTA

CAC CAT

CAA

Page 60: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

CCGTGCTAGGCCCCACCCCTACCTTGCAGTCCCCGCAAGCTCATCTGCGCGAACCAGAACGCCCACCACCCTTGGGTTGAAATTAAGGAGGCGGTTGGCAGCTTCCCAGGCGCACGTACCTGCGAATAAATAACTGTCCGCACAAGGAGCCCGACGATAGTCGACCCTCTCTAGTCACGACCTACACACAGAACCTGTGCTAGACGCCATGAGATAAGCTAACA

CA

AC AT

AA AG

CACA

AC ATAC AT

AA AGAA AG

CA

AC AT

AA AG

CACA

AC ATAC AT

AA AGAA AG

CACA

AC ATAC AT

AA AGAA AG

0.02

CACA

AC ATAC AT

AA AGAA AG

0.04

0.03

0.09 0.04

0.07 0.02

0.11 0.03

0

1

Page 61: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

CCGTGCTAGGCCCCACCCCTACCTTGCAGTCCCCGCAAGCTCATCTGCGCGAACCAGAACGCCCACCACCCTTGGGTTGAAATTAAGGAGGCGGTTGGCAGCTTCCCAGGCGCACGTACCTGCGAATAAATAACTGTCCGCACAAGGAGCCCGACGATAGTCGACCCTCTCTAGTCACGACCTACACACAGAACCTGTGCTAGACGCCATGAGATAAGCTAACA

OK. Given any DNA string I can make a colored bitmap, so what?

Page 62: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin
Page 63: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Two QuestionsTwo Questions

• Can we do something similar for time series?

• Would it be useful?

Page 64: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Can we do make bitmaps for time series?

aa ab ba bb

ac ad bc bd

ca cb da db

cc cd dc dd

a b

c d

aaa aab aba

aac aad abc

aca acb

acc

accbabcdbcabdbcadbacbdbdcadbaacb…

Yes, with SAX!

Time Series Bitmap

Page 65: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

What can we say about this data?

Page 66: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

While they are all examples of EEGs, example_a.dat is from a normal trace, whereas the others contain examples of spike-wave discharges.

Page 67: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

We can further enhance the time series bitmaps by arranging the thumbnails by “cluster”, instead of arranging by date, size, name etc

We can achieve this with MDS.

Page 68: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

A well known dataset Kalpakis_ECG, allegedly contains ECGS…

Page 69: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

A well known dataset Kalpakis_ECG, allegedly contains ECGS

If we view them as time series bitmaps, a handful stand out…

normal1.txt normal10.txtnormal11.txt

normal12.txt

normal13.txt normal2.txt

normal3.txtnormal4.txt

normal5.txt

normal6.txt

normal7.txt

normal8.txt

normal9.txt

normal14.txtnormal15.txt

normal16.txt

normal17.txt

normal18.txt

Page 70: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Some of the data are not heartbeats! They are the action potential of a normal pacemaker cell 0 100 200 300 400 500

0 100 200 300 400 500

initial rapid repolarization

0 100 200 300 400 5000 100 200 300 400 500

ventricular depolarization

initial rapid repolarization

“plateau” stage

repolarizationrecovery phase

0 100 200 300 400 500

normal1.txt normal10.txt normal11.txt

normal12.txt

normal13.txt normal2.txt

normal3.txtnormal4.txt

normal5.txt

normal6.txt

normal7.txt

normal8.txt

normal9.txt

normal14.txtnormal15.txt

normal16.txt

normal17.txt

normal18.txt

Page 71: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

We can test how much useful information is retained in the bitmaps by using only the bitmaps for clustering/classification/anomaly detection

Page 72: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

1

2

3

4

5

11

13

12

14

15

6

9

10

7

8

16

18

17

19

20

1

2

3

4

5

11

13

12

14

15

6

9

10

7

8

16

18

17

19

20

We can test how much useful information is retained in the bitmaps by using only the bitmaps for clustering/classification/anomaly detection

Cluster 1 (datasets 1 ~ 5):

BIDMC Congestive Heart Failure Database (chfdb): record chf02

Start times at 0, 82, 150, 200, 250, respectively

Cluster 2 (datasets 6 ~ 10):

BIDMC Congestive Heart Failure Database (chfdb): record chf15

Start times at 0, 82, 150, 200, 250, respectively

Cluster 3 (datasets 11 ~ 15):

Long Term ST Database (ltstdb): record 20021

Start times at 0, 50, 100, 150, 200, respectively

Cluster 4 (datasets 16 ~ 20):

MIT-BIH Noise Stress Test Database (nstdb): record 118e6

Start times at 0, 50, 100, 150, 200, respectively

Data Key

Page 73: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Here the bitmaps are almost the same.

Here the bitmaps are very different. This is the most unusual section of the time series, and it coincidences with the PVC.

Here is a Premature Ventricular Contraction (PVC)

We can use bitmaps to find anomalies…

Page 74: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Premature ventricular contraction Premature ventricular contractionSupraventricular escape beat

Annotations by

a cardiologist

Page 75: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

• For most classic data mining tasks (classification, clustering and indexing), SAX is at least as good as the raw data, DFT, DWT, SVD etc.

• SAX allows the best anomaly detection algorithm.

• SAX is the engine behind the only realistic time series motif discovery algorithm.

• SAX allows time series visualization.

SAX SummarySAX Summary

Page 76: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

The Last WordThe Last WordThe sun is setting on all The sun is setting on all other symbolic other symbolic representations of time representations of time series, SAX is the series, SAX is the onlyonly way way to goto go

Page 77: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

ConclusionsConclusions• SAX is posed to make major contributions to time series data mining in the next few years.

•A more general conclusion, if you want to solve you data mining problem, think representation, representation, representation.

Page 78: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

The slides that follow demonstrate that SAX is as good as DFT, DWT etc for the classic data mining tasks, this is important, but not very exciting, thus relegated to this appendix.

Page 79: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Experimental Validation

• Clustering– Hierarchical– Partitional

• Classification– Nearest Neighbor– Decision Tree

• Indexing– VA File

• Discrete Data only– Anomaly Detection– Motif Discovery

Page 80: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Clustering

• Hierarchical Clustering– Compute pairwise distance, merge similar

clusters bottom-up– Compared with Euclidean, IMPACTS, and

SDA

Page 81: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Hierarchical Clustering Euclidean

IMPACTS (alphabet=8) SDA

SAX

Hierarchical Clustering

Page 82: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Clustering

• Hierarchical Clustering– Compute pairwise distance, merge similar clusters

bottom-up

– Compared with Euclidean, IMPACTS, and SDA

• Partitional Clustering– K-means

– Optimize the objective function by minimizing the sum of squared intra-cluster errors

– Compared with Raw data

Page 83: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Partitional (K-means) Clustering

Partitional (k-means) Clustering

Number of Iterations

220000 225000 230000 235000 240000 245000 250000 255000 260000 265000

1 2 3 4 5 6 7 8 9 10 11

Raw data Our Symbolic Approach

Obj

ecti

ve F

unct

ion

Raw data

SAX

Page 84: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Classification

• Nearest Neighbor– Leaving-one-out cross validation– Compared with Euclidean Distance, IMPACTS,

SDA, and LP– Datasets: Control Charts & CBF (Cylinder,

Bell, Funnel)

Page 85: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Nearest Neighbor

5 6 7 8 9 10

Impacts SDA Euclidean LP max SAX

0 0.1

0.2 0.3

0.4 0.5

0.6

5 6 7 8 9 10

Control Chart Cylinder - Bell - Funnel

Err

or R

ate

Alphabet Size Alphabet Size

Nearest Neighbor

Page 86: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Classification• Nearest Neighbor

– Leaving-one-out cross validation

– Compared with Euclidean Distance, IMPACTS, SDA, and LP

– Datasets: Control Charts & CBF (Cylinder, Bell, Funnel)

• Decision Tree

– Defined for real data, but attempting to use DT on time series raw data would be a mistake

• High dimensionality/Noise level would result in deep, bushy trees

– Geurts (’01) suggests representng time series as Regression Tree, and training decision tree on it.

0 50 100

Adaptive Piecewise Constant Approximation

Page 87: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Decision (Regression) Tree

Dataset SAX Regression Tree CC 3.04 1.64 2.78 2.11

CBF 0.97 1.41 1.14 1.02

Page 88: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Indexing

• Indexing scheme similar to VA (Vector Approximation) File– Dataset is large and disk-resident– Reduced dimensionality could still be too high

for R-tree to perform well

• Compare with Haar Wavelet

Page 89: SAX! A Symbolic Representations of Time Series Eamonn Keogh and Jessica Lin

Indexing

0 0.1 0.2 0.3 0.4 0.5 0.6

Ballbeam Chaotic Memory Winding Dataset

DWT Haar SAX