18
Exact Discovery of Time Series Motifs This document was created to support our paper. It contains additional experiments and details which we could not fit into the paper.

Exact Discovery of Time Series Motifs

  • Upload
    danton

  • View
    71

  • Download
    1

Embed Size (px)

DESCRIPTION

Exact Discovery of Time Series Motifs. This document was created to support our paper. It contains additional experiments and details which we could not fit into the paper. b = 1.99 , and s = 4.8. b = 2.02, and s = 3.61. b = 2.53, and s = 3.75. 1000. 2. 0. 500. -2. 0. -4. 0. 20. 40. - PowerPoint PPT Presentation

Citation preview

Page 1: Exact Discovery of Time Series Motifs

Exact Discovery of Time Series Motifs

This document was created to support our paper. It contains additional experiments and details which we could not fit into the paper.

Page 2: Exact Discovery of Time Series Motifs

0 20 400

500

1000b = 1.99, and s = 4.8 b = 2.02, and s = 3.61 b = 2.53, and s = 3.75

b = 2.86, and s = 3.11 b = 2.01, and s = 5.36 b = 2.42, and s = 4.71

b = 2.62, and s = 2.98 b = 2.19, and s = 3.76 b = 2.69, and s = 3.56

-4

-2

0

2

In our paper we noted that the choice of a reference point can effect the quality of the best-so-far discovered, and the spread of the candidate time series on the number line (recall that that we want the former to be small, and the latter to be large, in order to cheaply prune away as many Euclidean calculations as possible)

To demonstrate this we conducted a test with 10,000 random walk time series of length 128. We randomly choose 9 time series (top right) as reference points, and measured b, the quality of the best-so-far we get doing a linear time search for approximate motifs, and s, the standard deviation of the distribution of the distances of the 99,999 other time series to the reference time series.

We can see that the best-so-far ranges from 1.9937 to 2.8552, and the standard deviation ranges from 2.9757 to 5.3582 (example continued on next slide)

Page 3: Exact Discovery of Time Series Motifs

0 5 10 15 20 250

We can take the largest value from the standard deviations, and the smallest best-so-far to begin pruning.

In the figure to the left, the blue brace is the width of the current best-so-far.

In our example, we have illustrated the location of just 5 time series on the number line, naturally there are really 10,000 such items.

If we consider object 34, we can see that only object 451 might join with it to be better than the current best-so-far, the other objects 7,44 and 5512 must be further from 34 than the best-so-far, so we do not need to check them.

Later we see that there is no object close enough to 7 that needs to be consider, and finally we find that we must check the true Euclidean distance between 44 and 5512.

In this contrived example instead of doing all ten pairs (5 * 4 / 2) of Euclidean distance calculations, we only needed to do two. We can easily see that the faction of Euclidean distance calculations depends on how great the items are spread relative to the best-so-far.

0 5 10 15 20 250

34 451 55127 44

0 5 10 15 20 250

55127 44

0 5 10 15 20 250

551244

Page 4: Exact Discovery of Time Series Motifs

In our paper we claimed that FLAME does not give exact motifs with respect to the raw time series. Here we make this clearer.

Suppose we have three time series…A = 9.9, 50.1, 89.9, 49.9B = 0.1, 59.9, 80.1, 40.1C = 10.1, 49.9, 90.1, 50.1

Assume they are discretized using the FLAME scheme, with each bucket covering a range of ten (i.e [0 to 9.999], [10 to 10.999], [20 to 20.999] etc), and we thus have..

A = A,F,I,EB = A,F,I,FC = B,E,J,F

Note that the squared Euclidean distance between A and B is 384.16, but the distance between A and C is only 0.16. In this trivial dataset, A and C are the true motifs.

However, under the FLAME mapping, A and B are identical, but A and C have a distance of 4.

Note that it might be possible to fix this with a SAX-like lower bound, however this has not been done, and it would require significant overhead, since many false positives would have to be checked

Sandeep Tata (2007). Declarative Querying For Biological Sequences. Ph.d Thesis, The University of Michigan.

Advisor Jignesh M. Patel.

Page 5: Exact Discovery of Time Series Motifs

Additional examples of the motif

0 50 100 150 200 250 300 350 400-3

-2

-1

0

1

2

3

4

5

6

Instance at 20,925Instance at 25,473

This is from DQmatixD1, row 16

Note that the end of this row is padded with some zeros, so only the first 78,254 datapoints are used

0 100 200 300 400 500

1

2

3x 10

4

Instance at 9,036

Instance at 3,664This is from DQmatixC, row 19

Note that the end of this row is padded with some zeros, so only the first 33,021 datapoints are used

0 1 2 3 4 5 6 7 8

x 104

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Here is the entire dataset

Here is the entire dataset0 10,000 20,000 30,0000

1

2

3

x 104

Approximately 14.4 minutes of insect telemetry

Page 6: Exact Discovery of Time Series Motifs

acet

amin

ophe

n_s_

0000

58.p

ng, a

lbut

erol

_s_0

0004

2.pn

g

acid

_anh

ydrid

es_s

_000

031.

png,

aci

d_ha

lide_

s_00

0035

.png

afric

an_l

ove_

gras

s_s_

0000

16.p

ng, a

frica

n_m

illet

_s_0

0001

3.pn

g

acyl

_hal

ide_

s_00

0091

.png

, acy

l_ha

lide_

s_00

0092

.png

alde

hyde

_s_0

0005

9.pn

g, a

lken

e_s_

0000

42.p

ng

arct

osta

phyl

os_a

lpin

a_s_

0000

39.p

ng, a

sple

nium

_pla

tyne

uron

_s_0

0003

5.pn

g

acyl

_anh

ydrid

es_s

_000

044.

png,

alk

yl_r

adic

al_s

_000

068.

png

Here are the file names of the near duplicated images discovered by our algorithm

Page 7: Exact Discovery of Time Series Motifs

• In the following slides we give some more information about the Beet leafhopper example in the paper.

Page 8: Exact Discovery of Time Series Motifs

Economic Importance

• Only known vector of beet curly top virus in North America

CDFAUC IPM Online

Page 9: Exact Discovery of Time Series Motifs

History On Sugar Beets• First reported in Nebraska in 1888• Outbreak in 1925 in California resulted in loss of one third of the

sugarbeet crop throughout the Sacramento Valley, and in a total loss of all late plantings in both the San Joaquin Valley and southern Salinas Valley (Severin & Schwing, 1926)

• Closure/part time operation of sugarbeet refinery factories, complete abandonment of thousands of acres of planted or prospective land for sugarbeet in Western U.S.(Bennett, 1971)

• So severe in the Salinas Valley that in 1947 a permanent research laboratory of plant pathology, entomology, and plant breeding was established by the USDA to work on controlling BCTV outbreaks (Wisler & Duffus, 2000).

• Resistant varieties became available in 1933 (Owen et al., 1938)

H.H.P. Severin, 1930

Page 10: Exact Discovery of Time Series Motifs

History On Tomatoes• San Joaquin Valley in 1948 and 1950, it was estimated that 80% of

the tomato crop was lost or damaged by BCTV (Bennett, 1971) • Today commercial and recreational growth of tomatoes in the

western United States is still limited in many areas by the incidence of BCTV

• Breeding program to develop BCTV resistant tomato varieties was established in Utah in 1930 by the U.S. Department of Agriculture (Martin, 1970).

• Resistant varieties have small fruit of poor quality (Martin, 1970)• Resistant lines only confer a reduction to the initial infection

– Once resistant varieties are infected, they react in the same way as susceptible varieties (Thomas & Martin, 1971, 1972)

H.H.P. Severin, 1930

Page 11: Exact Discovery of Time Series Motifs

esrpweb.csustan.edu/ gis/rp/lom.html

Breeding Area

• Biological control– Not effective due to

migratory patterns• Chemical Control

– Malathion treatments applied to thousands of acres of overwintering areas

– Insecticides on host plants

• Resistant plants– Increasingly important

Control Measures

Page 12: Exact Discovery of Time Series Motifs

Resistant Plants

• Develop BCTV resistant plants with horticulturally favorable properties

• Determine mechanisms of resistance• Resistance in tomatoes

– Appears to be due to change in feeding behavior• In order to experimentally test if the mechanism

of resistance is an effect on vector feeding behavior, we need to develop a methodology to study the feeding behavior of beet leafhopper.

Page 13: Exact Discovery of Time Series Motifs

Electrical Penetration Graph(EPG)

Page 14: Exact Discovery of Time Series Motifs

What EPGs Measure

• Fluctuations in voltage level – Occur in distinct patterns called waveforms– Each waveform is associated with a specific

feeding behavior• Before EPGs can be used to study feeding

behavior, the waveforms must first be experimentally correlated with specific feeding behaviors

Page 15: Exact Discovery of Time Series Motifs

Example of beet leafhopper EPG recording

Am

plitude (V)A

mplitude (V)

TimeTime 5 min

Example of beet leafhopper EPG recording

Page 16: Exact Discovery of Time Series Motifs

In addition to telemetry, we have a video stream we can refer to

Page 17: Exact Discovery of Time Series Motifs

Waveforms 5a and 5b Lei et al., 1999

Waveforms E(pd), (1) and E(pd), (2) Lei et al., 1999

Waveforms E1 and E2 Prado and Tjallingii, 1994

1 sec

1 sec

1 sec

Waveforms D2 and D3 Stafford, unpublished

1 sec

Some examples of manually discovered motifs. (Note that entomologists don’t use the term motifs)

Page 18: Exact Discovery of Time Series Motifs

0 1 2 3 4 5 6 7 8

x 104

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Here is the raw data in which we found the motif shown below

This is from DQmatixD1, row 16

Additional examples of the motif

0 50 100 150 200 250 300 350 400-3

-2

-1

0

1

2

3

4

5

6

Instance at 20,925Instance at 25,473