33
Mining Unusual Patterns in Data Streams in Multi- Dimensional Space Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj

Mining Unusual Patterns in Data Streams in Multi-Dimensional

  • Upload
    tommy96

  • View
    419

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Mining Unusual Patterns in Data Streams in Multi-Dimensional

Mining Unusual Patterns in Data Streams in Multi-

Dimensional Space

Jiawei Han

Department of Computer Science

University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj

Page 2: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 2

Outline

Characteristics of data streams

Mining unusual patterns in data streams

Multi-dimensional regression analysis of data

streams

Stream cubing and stream OLAP methods

Mining other kinds of patterns in data streams

Research problems

Page 3: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 3

Data Streams

Data Streams Data streams—continuous, ordered, changing, fast, huge amount

Traditional DBMS—data stored in finite, persistent data setsdata sets

Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single linear scan algorithm (can

only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in

nature, needs multi-level and multi-dimensional processing

Page 4: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 4

Stream Data Applications

Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply &

manufacturing Sensor, monitoring & surveillance: video streams Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too

expensive)

Page 5: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 5

Challenges of Stream Data Mining

Multiple, continuous, rapid, time-varying, ordered streams Main memory computation Mining queries are either continuous or ad-hoc Mining queries are often complex

Involving multiple streams, large amount of data, and history Finding patterns, models, anomaly, differences, …

Mining dynamics (changes, trends and evolutions) of data streams

Multi-level/multi-dimensional processing and data mining Most stream data are at pretty low-level or multi-dimensional in

nature

Page 6: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 6

Stream Data Mining Tasks

Multi-dimensional (on-line) analysis of streams Clustering data streams Classification of data streams Mining frequent patterns in data streams Mining sequential patterns in data streams Mining partial periodicity in data streams Mining notable gradients in data streams Mining outliers and unusual patterns in data streams ……

Page 7: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 7

Multi-Dimensional Stream Analysis: Examples

Analysis of Web click streams Raw data at low levels: seconds, web page addresses, user IP

addresses, … Analysts want: changes, trends, unusual patterns, at reasonable

levels of details E.g., Average clicking traffic in North America on sports in the last

15 minutes is 40% higher than that in the last 24 hours.”

Analysis of power consumption streams Raw data: power consumption flow for every household, every

minute Patterns one may find: average hourly power consumption

surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago

Page 8: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 8

A Key Step—Stream Data Reduction

Challenges of OLAPing stream data Raw data cannot be stored

Simple aggregates are not powerful enough

History shape and patterns at different levels are desirable: multi-

dimensional regression analysis

Proposal A scalable multi-dimensional stream “data cube” that can

aggregate regression model of stream data efficiently without

accessing the raw data

Stream data compression Compress the stream data to support memory- and time-efficient

multi-dimensional regression analysis

Page 9: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 9

Regression Cube for Time-Series

Initially, one time-series per base cell Too costly to store all these time-series Too costly to compute regression at multi-dimensional space

Regression cube Base cube: only store regression parameters of base cells (e.g.,

2 points vs. 1000 points) All the upper level cuboids can be computed precisely for linear

regression on both standard dimensions and time dimensions For quadratic regression, we need 5 points In general, we need:

where k = 2 for quadratic.

,2

3

2

)1( 2 kkkkk

+=++

Page 10: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 10

Basics of General Linear Regression

n tuples in one cell: (xi , yi), i =1..n, where yi is the measure attribute to be analyzed

For sample i , a vector of k user-defined predictors ui:

The linear regression model:

where η is a k × 1 vector of regression parameters

( )

( )

=

=

−− 1,

1

1

1

0

...

1

...

ki

i

ik

ii

u

u

u

u

u

x

xu

( ) 1,1110 ...| −−+++== kikiiT

ii uuyE ηηηη uu

Page 11: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 11

Theory of General Linear Regression

Collect into the model matrix U

The ordinary least square (OLS) estimate of is the argument that minimizes the residue sum of squares function

Main theorem to determine the OLS regression parameters

iu kn×

=

1,21

1,22221

1,11211

...1.

.

.

.

.

.

.

.

.

....1

...1

knnn

k

k

uuu

uuu

uuu

U

∧η η

)()()( ηηRSS T UyUy −−=η

yUUU TTRSS 1)(0)( −∧

=⇒=∂∂ ηηη

Page 12: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 12

Linearly Compressed Representation (LCR)

Stream data compression for multi-dimensional regression analysis

Define, for i, j = 0,…,k-1:

The linearly compressed representation (LCR) of one cell:

Size of LCR of one cell:

quadratic in k, independent of the number of tuples n in one cell

∑=

=n

hhjhiij uu

1

θ

},1,...,0,|{}1,...,0|{ jikjiki iji ≤−=−=∧

θη

,2

3

2

)1( 2 kkkkk

+=++

Page 13: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 13

Matrix Form of LCR

LCR consists of and , where

and

where

provides OLS regression parameters essential for regression analysis

is an auxiliary matrix that facilitates aggregations of LCR in standard and regression dimensions in a data cube environment

LCR only stores the upper triangle of

),...,,( 110 −

∧∧∧∧= k

T

ηηηη∧η Θ

=

−−

−−− 1,1

1,1

1,0

2,11,10,1

121110

020100

.

.

...

...

...

.

.

.

.

.

....

...

kk

k

k

kkk θ

θθ

θθθ

θθθθθθ

Θ

∧ηΘ

⇒= ΘΘT Θ

Page 14: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 14

Aggregation in Standard Dimensions

Given LCR of m cells that differ in one standard dimension, what is the LCR of the cell aggregated in that dimension? for m base cells

for an aggregated cell

The lossless aggregation formula

),(,...),,(),,( 222111 mmmLCRLCRLCR ΘΘΘ∧∧∧

=== ηηη

),( aaaLCR Θ∧

= η

1

1

ΘΘ =

=∑=

∧∧

a

m

iia ηη

Page 15: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 15

Stock Price Example—Aggregation in Standard Dimensions

Simple linear regression on time series dataCells of two companies

After aggregation:

Page 16: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 16

Aggregation in Regression Dimensions

Given LCR of m cells that differ in one regression dimension, what is the LCR of the cell aggregated in that dimension

for m base cells

for the aggregated cell

The lossless aggregation formula

),(,...),,(),,( 222111 mmmLCRLCRLCR ΘΘΘ∧∧∧

=== ηηη

),( aaaLCR Θ∧

= η

∑∑

=

=

∧−

=

=

=

m

iia

m

iii

m

iia

1

1

1

1

ΘΘ

ΘΘ ηη

Page 17: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 17

Stock Price Example—Aggregation in Time Dimension

Cells of two adjacent time intervals:

After aggregation

Page 18: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 18

Feasibility of Stream Regression Analysis

Efficient storage and scalable (independent of the

number of tuples in data cells)

Lossless aggregation without accessing the raw data

Fast aggregation: computationally efficient

Regression models of data cells at all levels

General results: covered a large and the most popular

class of regression

Including quadratic, polynomial, and nonlinear models

Page 19: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 19

A Stream Cube Architecture

A tilted time frame Different time granularities

second, minute, quarter, hour, day, week, …

Critical layers Minimum interest layer (m-layer) Observation layer (o-layer) User: watches at o-layer and occasionally needs to drill-down

down to m-layer

Partial materialization of stream cubes Full materialization: too space and time consuming No materialization: slow response at query time Partial materialization: what do we mean “partial”?

Page 20: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 20

A Tilted Time-Frame Model

31 days 24 hours 4 qtrs12 months

Time Now

24hrs 4qtrs 15minutes7 days

Time Now

25sec.Up to 7 days:

Up to a year:

Logarithmic (exponential) scale:4t 2t 1t

Time Now

4t8t16t

Page 21: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 21

Benefits of Tilted Time-Frame Model

Each cell stores the measures according to tilt-time-frame

Limited memory space: Impossible to store the history in full scale

Emphasis more on recent data

Most applications emphasize on recent data (slide window)

Natural partition on different time granularities

Putting different weights on remote data

Useful even for uniform weight

Tilted time-frame forms a new time dimension

for mining changes and evolutions

Essential for mining unusual patterns or outliers

Finding those with dramatic changes

E.g., exceptional stocks—not following the trends

Page 22: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 22

Two Critical Layers in the Stream Cube

(*, theme, quarter)

(user-group, URL-group, minute)

m-layer (minimal interest)

(individual-user, URL, second)

(primitive) stream data layer

o-layer (observation)

Page 23: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 23

On-Line Materialization vs. On-Line Computation

On-line materialization

Materialization takes precious resources and time

Only incremental materialization (with slide window)

Only materialize “cuboids” of the critical layers?

Some intermediate cells that should be materialized

Popular path approach vs. exception cell approach

Materialize intermediate cells along the popular paths

Exception cells: how to set up exception thresholds?

Notice exceptions do not have monotonic behaviour

Computation problem

How to compute and store stream cubes efficiently?

How to discover unusual cells between the critical layer?

Page 24: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 24

Stream Cube Structure: from m-layer to o-layer

(A1, *, C1)

(A1, *, C2) (A1, *, C2)

(A1, *, C2)

(A1, B1, C2) (A1, B2, C1)

(A2, *, C2)

(A2, B1, C1)

(A1, B2, C2)

(A2, B1, C2)

A2, B2, C1)

(A2, B2, C2)

Page 25: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 25

Stream Cube Computation

Cube structure from m-layer to o-layer

Three approaches All cuboids approach

Materializing all cells (too much in both space and time)

Exceptional cells approach

Materializing only exceptional cells (saves space but not time

to compute and definition of exception is not flexible)

Popular path approach

Computing and materializing cells only along a popular path

Using H-tree structure to store computed cells (which form the

stream cube—a selectively materialized cube)

Page 26: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 26

An H-Tree Cubing Structure

root

entertainment sports politics

uic uiuc uic uiuc

jeff mary jeff Jim

Q.I.Q.I. Q.I.

Regression:

Sum: xxxxCnt: yyyy

Quant-Info

Observation layer

Minimal int. layer

Page 27: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 27

Partial Materialization Using H-Tree

H-tree: Introduced for computing data cubes and iceberg cubes

J. Han, J. Pei, G. Dong, and K. Wang, “Efficient Computation

of Iceberg Cubes with Complex Measures”, SIGMOD'01

Compressed database, fast cubing, and space preserving in cube

computation

Using H-tree for partial stream cubing Space preserving:

Intermediate aggregates can be computed incrementally and

saved in tree nodes

Facilitate computing other cells and multi-dimensional analysis

H-tree with computed cells can be viewed as “stream cube”

Page 28: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 28

1

10

100

1000

25 50 100 200 400

Size (in K tuples)

Run

tim

e (i

n se

cond

s)

popular-pathexception-cellsall-cubing

0

200

400

600

25 50 100 200 400

Size (in K tuples)

Mem

ory

usag

e (i

n M

B)

popular-path

exception-cells

all-cubing

a) Time vs. m-layer size b) Space vs. m-layer size

Time and Space vs. Number of Tuples at the m-Layer (Dataset D3L3C10T400K)

Page 29: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 29

1

10

100

1000

10000

3 4 5 6 7

Number of levels

Run

tim

e (i

n se

cond

s)

popular-pathexception-cellsall-cubing

1

10

100

1000

10000

3 4 5 6 7

Number of levels

Mem

ory

usag

e (i

n M

B)

popular-pathexception-cellsall-cubing

Time and Space vs. the Number of Levels

a) Time vs. # levels b) Space vs. # levels

Page 30: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 30

Mining Other Unusual Patterns in Stream Data

Clustering and outlier analysis for stream mining

Clustering data streams (Guha, Motwani et al. 2000-2002)

History-sensitive, high-quality incremental clustering

Classification of stream data

Evolution of decision trees: Domingos et al. (2000, 2001)

Incremental integration of new streams in decision-tree induction

Frequent pattern analysis

Approximate frequent patterns (Manku & Motwani VLDB’02)

Evolution and dramatic changes of frequent patterns

Page 31: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 31

Conclusions

Stream data mining: A rich and largely unexplored field Current research focus in database community:

DSMS system architecture, continuous query processing, supporting mechanisms

Stream data mining and stream OLAP analysis Powerful tools for finding general and unusual patterns Effectiveness, efficiency and scalability: lots of open problems

Our philosophy: A multi-dimensional stream analysis framework

Time is a special dimension: tilted time frame What to compute and what to save?—Critical layers Very partial materialization/precomputation: popular path approach Mining dynamics of stream data

Page 32: Mining Unusual Patterns in Data Streams in Multi-Dimensional

May 10, 2010 Mining Unusual Patterns in Data Streams 32

References

B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom, “Models and issues in data

stream systems”, PODS'02 (tutorial).

S. Babu and J. Widom, “Continuous queries over data streams”, SIGMOD Record,

30:109--120, 2001.

Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and J. Wang. “Online analytical processing

stream data: Is it feasible?”, DMKD'02.

Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-dimensional regression

analysis of time-series data streams”, VLDB'02.

P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00.

M. Garofalakis, J. Gehrke, and R. Rastogi, “Querying and mining data streams: You only

get one look”, SIGMOD'02 (tutorial).

J. Gehrke, F. Korn, and D. Srivastava, “On computing correlated aggregates over

continuous data streams”, SIGMOD'01.

S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams”,

FOCS'00.

G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams”, KDD'01.