Detecting Change in Multivariate Data Streams Using ...seminar/Slides/RobertKoyak.pdf · Minimum...

Preview:

Citation preview

Detecting Change in Multivariate Data Streams Using Minimum Subgraphs

Robert Koyak

Operations Research Dept.

Naval Postgraduate School

Collaborative work with Dave Ruth, Emily Craparo, and Kevin Wood

Basic Setup

0

1 2

1

( ) :

( ) :

j

N

N

F j

H

F F F

H

,

Have observations assumed to be sampled

independently from unknown, multivariate

distributions distribution of observation

T

Homogeneity Hypothesis

Heterogeneity Hypothesis

1 2 1 1

1 1

{2, , }

, ,

( , ) max ( , )

{ 1, , }

k k k

j r jk r j

k N

F F F F F

F F F F

j k N

here exists some such that

and

is

strictly positive and nondecreasing for

2

Heterogeneity includes:

• A single change in distribution at a known change point (“two-sample problem”)

• A single change in distribution at an unknown change point

• Directional drift (in mean or other features) that begins at an unknown point in the observation sequence

3

Distance Matrix

4

1 2 3

1 2 3

( , )

, ,

, ,

distance matrix (Euclidean,

Manhattan, etc.)

Maa, Pearl, and Bartoszynski (1996) :

independent, ~

independent, ~

if and only if

i j

i j i j i j

D d N N

d d

Y Y Y F

Z Z Z G

F G

y y y y

1 2 1 2 3 3( , ) ( , ) ( , ) d Y Y d Z Z d Y ZLL

5

The distance matrix has the information needed to express departure from the homogeneity hypothesis. For the types of departure we want to detect, this information should be expressed in particular ways. How can we unlock it?

6

The strategy we will explore is to fit a minimum subgraph (of some type) to the data treated as vertices in a complete, undirected graph. From the subgraph a statistic is derived that is sensitive to the departures from homogeneity that we wish to detect.

A Graph-Theoretic Approach

7

( , )

( , ), ,

| | ( 1) / 2

ˆ ˆ ˆ( , )

ˆ

N N

N

N

G V E

G V E V

E N N

G V E

G

Complete undirected graph

Subgraph family (e.g. spanning trees,

k-factors, Hamiltonian paths or circuits)

Minimum subgraph is defined by

argmin

G

G

( , )

ˆˆ ( )

Ni ji j E

d

GThe test statistic is

G

Minimum Spanning Trees (MSTs)

• Friedman and Rafsky (1979) used MSTs to define a multivariate extension of the runs test in the context of the two-sample problem

• The test statistic is the number of edges in the MST that join vertices belonging to different samples

• Small values of the statistic are evidence against homogeneity

8

9

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

Sch

uylk

ill

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

7474

80

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

Sch

uylk

ill

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

Sch

uylk

ill

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

7474

80

MST for breast cancer mortality rates, 1969 to 1988 (N = 20), relative to 1968 base. Next, treat Sample 1 as the years 1969–1978 and Sample 2 as the years 1979–1988

10

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

Sch

uylk

ill

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

7474

80

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

Sch

uylk

ill

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

7474

80

There are edges that join vertices in different samples. The p-value, obtained by a permutation test, is about 0.41

ˆ 11MST

Is anything really happening?

11

Spearman rank correlations vs. time, p-values: Philadelphia .0004 Schuylkill .01

Minimum Non-bipartite Matching (MNBM)

• Also known as unipartite matching, 1-factor

• Rosenbaum (2005) defined a “cross-match” test using MNBM analogous to that of Friedman and Rafsky

• The test statistic is the number of edges in the MNBM that join vertices belonging to different samples

• Small values of the statistic are evidence against homogeneity

12

Cross-match test (Rosenbaum)

13

2

/ 2

2

( ) 2

(number of matching edges)

Group 1 has observations

Group 2 has observations

number of cross-matches

number of matches within Group 1

C

C

k r

n N

k

N k

M

M

M M k

n k r NP M r

k r r k

1

,

0 ( ), , / 2r k n k

14

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

Sch

uylk

ill

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

74

80

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

Sch

uylk

ill

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

Sch

uylk

ill

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

74

80

MNBM fit to the breast cancer mortality data. Count the number of edges that join vertices in different groups

15

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

Sch

uylk

ill

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

74

80

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

Sch

uylk

ill

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

Sch

uylk

ill

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

74

80

There are edges that join vertices in different samples. The p-value, obtained from the exact null distribution, is about 0.87

ˆ 6CM

Extensions of the Cross-Match Test

16

1 :

Ruth (2009) and Ruth & Koyak (2011) introduce

two extensions of the cross-match test to detect

departures from homogeneity in the direction

of

(1) An exact, simultaneous cross-match test for

an

H

0 10 1

ˆ( , )

1 1ˆ2 4( , )

ˆ ˆ( ) min ( ) ( , , )

ˆ

| | ( 1)

SCM CM

SPM

unspecified change-point

(2) A sum of (vertex) pair maxima test

kk k k

i j E

i j E

k q k k

i j

i j N N

17

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

Sch

uylk

ill

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

74

80

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

Sch

uylk

ill

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

Sch

uylk

ill

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

0.95 1.05 1.15 1.25

1.0

1.2

1.4

1.6

Philadelphia

69 70

71 72

73

74

75

76

77

78

79

80

81

82 83

84

8586

87

88

74

80

SCM test has exact p-value of 0.59 for testing against an unspecified change-point SPM test has approximate p-value of 0.41

Some Theory

• Friedman & Rafsky’s – Asymptotic normality under H0

– Universal consistency under H1 for the two-sample problem (Henze & Penrose, 1999)

• Rosenbaum’s – Asymptotic normality under H0

– Consistency under restrictive assumptions

• Ruth’s SPM test – Asymptotic normality under H0

– Consistency remains to be proven

18

ˆMST

ˆCM

ˆSPM

Ensemble Tests

19

Problem with graph-theoretic tests: a single minimum

subgraph contains very limited information about and

as such these tests are not very powerful

Tukey suggested fitting multiple "orthogonal" MST

D

s in

Friedman & Rafsky's test and combining them (in a

manner that was not specified)

Two subgraphs are orthgonal if they share no common

edges

For MSTs this is problematic: existence of a

/ 2

fixed number

of orthogonal MSTs (even two) is not assured!

For MNBMs we are assured at least orthogonal

subgraphs (Anderson, 1971) constructed sequentially

N

0.95 1.00 1.05 1.10 1.15 1.20 1.25

1.0

1.2

1.4

1.6

Philadelphia

Schuylk

ill

69 70

71

72

7374

7576

77

78

79

808182 83

84

8586

87

88

First MNBM Fit to the Breast Cancer Mortality Data

0.95 1.00 1.05 1.10 1.15 1.20 1.25

1.0

1.2

1.4

1.6

Philadelphia

Schuylk

ill

69 70

71

72

7374

7576

77

78

79

808182 83

84

8586

87

88

First Two MNBMs Fit to the Breast Cancer Mortality Data

0.95 1.00 1.05 1.10 1.15 1.20 1.25

1.0

1.2

1.4

1.6

Philadelphia

Schuylk

ill

69 70

71

72

7374

7576

77

78

79

808182 83

84

8586

87

88

First Three NMBMs Fit to the Breast Cancer Mortality Data

Structure of Ensembles • Ensemble pairs decompose into Hamiltonian cycles

each having an even number of vertices

– Under H0 all 1-factors are equally likely but it is not true that all ensemble 2-factors are equally likely!

– However, conditional on the cyclic structure uniformity is true

– Second-order properties do not depend on the cyclic structure

• Ensemble 3-factors have more complex cyclic behavior and also exhibit triangles

– Prevalence of triangles depends on the dimensionality of the data:

lower dimension = more triangles 23

Ensemble Tests

24

/ 2

Ruth (2009) proposed an Ensemble Sum of Pair

Maxima (ESPM) test based on fitting a sequence

of orthogonal MNBMs and taking the

cumulative sums of the SPM statistics. The test

takes the followi

n N

1

{1, , } ,

1

2 2

,

ˆ ˆmax ( )

( 1)( 1) / 180, ( 1) / 3

ESPM SPM

ng form:

k

N k n k N

j

N k N

c j

c N N N kN N

Ensemble Tests

25

1

0 ,

1

ˆ ( )

/ ( 1)

SPM(1) Under the process has the

same first two moments as a Brownian bridge,

(2) Although the summands individually are asymptotically

normal

k

N k N k N

j

k

H B t c j

t k N

, the same is not true of the process itself!

(3) Unless the dimensionality of the observations is very large,

classical Brownian bridge theory (Shorack & Wellner, 1987)

produces critical values that violate the nominal level

(4) Ruth (2009) produced critical values for different values of

and dimensionality using extensive simulationsN d

Simulated critical values for N = 200

26

100 Simulated , Bivar. Normal, Homogeneous

27

Critical (.05) = 1.19

( )N kB t

100 Simulated , Bivar. Normal, Mean Jump

28

Critical (.05) = 1.19

( )N kB t

2 4 6 8 10

0.0

0.5

1.0

1.5

2.0

= .05 critical value

= .01 critical value

Number of Orthogonal Matchings (k )

Norm

aliz

ed P

roce

ss

()

NB

ESPMˆ 2.24 has p-value less than .01

Heterogeneity is signaled when six or more matchings are used

()

kt

Power simulations, N = 200, jump at observation 101, = norm of mean vector after the jump, nominal .05-level tests

30

(a) Multivariate normal, mean , 5p

Jump Drift

SCM SPM ESPM JJS SCM SPM ESPM JJS

0 .05 .06 .04 .05 .05 .04 .06 .07

.5 .09 .10 .60 .52 .05 .07 .27 .22

1.0 .33 .41 1.00 1.00 .16 .20 .84 .85

(b) Multivariate normal, mean , 20p

Jump Drift

SCM SPM ESPM JJS SCM SPM ESPM JJS

0 .05 .05 .05 .03 .05 .05 .05 .04

.5 .07 .09 .33 .20 .05 .07 .13 .09

1.0 .16 .22 .95 .95 .09 .11 .56 .49

(c) Multivariate normal, covariance matrix, 5p

Jump Drift

SCM SPM ESPM JJS SCM SPM ESPM JJS

0 .05 .06 .05 .04 .05 .05 .05 .05

.5 .42 .51 .97 .15 .20 .27 .52 .27

1.0 .99 .99 1.00 .24 .77 .79 1.00 .54

Power simulations, N = 200, jump at observation 101, nominal .05-level tests

31

(c) Multivariate normal, covariance matrix, 5p

Jump Drift

SCM SPM ESPM JJS SCM SPM ESPM JJS

0 .05 .06 .05 .04 .05 .05 .05 .05

.5 .42 .51 .97 .15 .20 .27 .52 .27

1.0 .99 .99 1.00 .24 .77 .79 1.00 .54

(d) Multivariate normal mixture, mean , 5p

Jump Drift

SCM SPM ESPM JJS SCM SPM ESPM JJS

0 .05 .05 .04 .27 .04 .04 .06 .28

.5 .08 .09 .56 .38 .07 .07 .21 .33

1.0 .25 .36 .99 .85 .12 .15 .76 .55

1+ mult.

norm

Graph-theoretic Tests: Some Challenges and Possible Directions

1. Computational

2. Theoretical

3. Alternate graph-theoretic approaches

4. Adaptation to real-world problems

32

Computational Challenges

33

2

4

( log( ))

.

( log( ))

Nm N

m N

N N

Finding a MNBM requires computation

time using the Blossom V algorithm (Kolmogorov,

2009). For the complete graph, For ensemble

tests the order of computation is about

wh

1000N

m N

ich is prohibitive with large sample sizes

(e.g. ).

Possible strategies:

(1) Use a greedy algorithm

(2) Restrict the edge set ( )

(3) Try something else

Faster Matchings?

34

Simple greedy heuristics are difficult to extend

to multiple matchings

Edge restriction heuristics. Sufficient conditions

for a perfect matching to exist ( even) include

-- A regular grap

N

/ 2

( )

h of degree

-- A connected, claw-free graph

-- A Delaunay triangulation

Necessary and sufficient conditions: Tutte's

Theorem

odd for all

N

V S S S V

Are MNBM tests universally consistent?

35

Asymptotic theory for MNBM is not straight-

forward even for a single matching, let alone

ensembles.

Aldous & Steele (1992) theory for MSTs exploits

perturbation localizability of MSTs (not applicable

to matchings).

Interesting recent work: "Poisson Matching"

(Holroyd . 2008)et al

36

,

1

( )

{0, 1}, 1, 1, ,

MNBM is a solution to the integer linear program

Minimize:

Subject to:

By replacing the integrality constraints with the

interval constraints

i j i j

n

i j i j i j

ii j

n

i jf x d

x x j n

x

1 12 4

0 1

ˆ ˆ ˆ| | ( 1)RSPM

a solution can be

obtained using LP. A "relaxed" SPM statistic can

be defined by

i j

i j i j

n n

i j i j

x

j x i j x N N

37

12

ˆ {0, ,1}

ˆ0 , 1, ,

Solutions to RNBM satisfy

To fit ensembles enforce the constraints

over a sequence of

problems. There is no assurance that solutions

will be "nested", howeve

i j

i j

x

x k k n

r, which complicates

theory

Performance of relaxed MNBM statistics

compares favorably with that of regular MNBM

What about nearest neighbors?

38

39

40

Possible Applications

• Process control (off-line, on-line)

• Mechanical prognostics

• Threat detection

• Syndromic surveillance

In high-dimensional problems, it may be useful to couple graph-theoretic methods with methods to reduce dimensionality

41

Dimension reduction

42

( , )

( , )

min ( )

s.t. ( ) argmin ( )

{0, 1}

Consider the optimization problem

Vector projects into a low -dimensional space

to minimize the sum of pair i

X E

ij

i j E

Ti j ij

i j

p

r

r

i j x

x

w p'

w

x

w

x w w y y

w

w

ndex differences in

the resulting minimum- weight matching

• Simplification 1: use Manhattan distance:

• Simplification 2: use relaxed matching instead of exact matching; enforce minimum-weight matching using strong duality.

43

,ij r

r

ijr ijr ir jrd d y yd w

{0,1 ( , )} , ,

,( , )

( , )

( , )

min

s.t.

p

v i j r

V

i j

i j E

i ijr

r

r i j

V i

v

v ijr

v j E r

r

r

i

A

j x

a

x

d w i j E

d w

w p'

w x 0 π

1x

Recommended