CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING

CS 521Data Mining TechniquesInstructor: Abdullah MueenLECTURE 8: TIME SERIES AND GRAPH MINING

Definition of Time Series Motifs

1. Length of the motif 2. Support of the motif 3. Similarity of the Pattern 4. Relative Position of the Pattern

Given a length, the most similar/least distant pair of non-overlapping subsequences.

20 40 60 80 100 120 140 160 180 200-2

-1

0

1

2

iii

y

yi

x

xi

)yx(yxd

yy

xx

2ˆˆ)ˆ,ˆ(

ˆ,ˆ

Problem Formulation

The most similar pair of non-overlapping subsequences

100 200 300 400 500 600 700 800 900 1000

-8000

-7500

-7000

. . .

12345678...

873

time:1000

The closest pair of points in high dimensional space

Optimal algorithm in two dimension : Θ(n log n) For large dimensionality d, optimum algorithm is effectively

Θ(n2d)

Lower Bound If P, Q and R are three points in a d-spaced(P,Q)+d(Q,R) ≥ d(P,R)

d(P,Q) ≥ |d(Q,R) - d(P,R)|

A third point R provides a very inexpensive lower bound on the true distance

If the lower bound is larger than the existing best, skip d(P, Q)

d(P,Q) ≥ |d(Q,R) - d(P,R)| ≥ BestPairDistance

P Q

R

Circular Projection

r

Pick a reference point r

Circularly Project all points on a line passing through the reference point

Equivalent to computing distance from r and then sorting the points according to distance

1

5

3

716

10

12

20

11

6

24

21

18

2

22

17

15

23

13

148

49

19 r

The Order Line

r

P Q

r|d(Q, r) - d(P, r)|

d(Q, r)

d(P, r)

k = 1k = 2k = 3

k=1:n-1• Compare every pair having

k-1 points in between

• Do k scans of the order line, starting with the 1st to kth point

BestPairDistance

1

5

3

716

10

12

20

11

624

21

18

2

22

17

15

23

13

148

49

19 r

0

Correctness If we search for all offset=1,2,…,n-1 then all possible pairs are considered.

◦ n(n-1)/2 pairs

for any offset=k, if none of the k scans needs an actual distance computation then for the rest of the offsets=k+1,…,n-1 no distance computation will be needed.

r

Graph Similarity Edit distance/graph isomorphism:

◦ Tree Edit Distance

Feature extraction◦ IN/out degree◦ Diameter

Iterative methods ◦ SimRank

Diameter Largest Shortest path in the graph.

1 let dist be a |V| × |V| array of minimum distances initialized to ∞ (infinity)2 for each vertex v3 dist[v][v] ← 04 for each edge (u,v)5 dist[u][v] ← w(u,v) // the weight of the edge (u,v)6 for k from 1 to |V|7 for i from 1 to |V|8 for j from 1 to |V|9 if dist[i][j] > dist[i][k] + dist[k][j] 10 dist[i][j] ← dist[i][k] + dist[k][j]11 end if

http://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm

Simrank

For a node v in a graph, we denote by I(v) and O(v) the set of in-neighbors and out-neighbors of v, respectively.

http://www-cs-students.stanford.edu/~glenj/simrank.pdf

1. A solution s( , ) [0, 1] to the n∗ ∗ ∈ 2 SimRank equations always exists and is unique.

2. Symmetric3. Reflexive

Tree Edit Distance

http://grfia.dlsi.ua.es/ml/algorithms/references/editsurvey_bille.pdf

Tree Edit Distance

Applications Find the most frequent tree structure in a phylogenetic tree.

Match a query subtree with a set of XML documents.

Ranking Nodes Page Rank

PR(A) is the PageRank of page A,

PR(Ti) is the PageRank of pages Ti which link to page A,

C(Ti) is the number of outbound links on page Ti and

d is a damping factor which can be set between 0 and 1.

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

ExamplePR(A) = 0.5 + 0.5 PR(C)PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

These equations can easily be solved. We get the following PageRank values for the single pages:

PR(A) = 14/13 = 1.07692308PR(B) = 10/13 = 0.76923077PR(C) = 15/13 = 1.15384615

Matlab Script Matlab script for the example in the previous slide

syms x y z;

eqn1 = x == 0.5 + 0.5*z

eqn2 = y == 0.5 + 0.25*x

eqn3 = z == 0.5 + 0.25*x + 0.5*y

[A,B] = equationsToMatrix([eqn1, eqn2, eqn3], [x, y, z])

X = linsolve(A,B)

HITS: Hyperlink-Induced Topic Search

http://www.cs.cornell.edu/home/kleinber/auth.pdf

Documents

CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING