28
07/03/06 - Tunis ia 1 ME ME Data Mining Research at SMU Data Mining Research at SMU Margaret H. Dunham, Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Lin Lu, Donya Quick, Michael Pierce Pierce CSE Department CSE Department Southern Methodist University Southern Methodist University Dallas, Texas 75275 Dallas, Texas 75275 [email protected]

07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

Embed Size (px)

Citation preview

Page 1: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 1

MEMEData Mining Research at SMUData Mining Research at SMU

Margaret H. Dunham, Margaret H. Dunham,

DBGroup: Yu Meng, Jie Huang, Lin Lu, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael PierceDonya Quick, Michael Pierce

CSE DepartmentCSE Department

Southern Methodist UniversitySouthern Methodist University

Dallas, Texas 75275Dallas, Texas 75275

[email protected]

Page 2: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 2

Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.

DILBERT reprinted by permission of United Feature Syndicate, Inc.

Page 3: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 3

Outline

What is Data Mining? EMM

Spatio-temporal modeling Rare Event Detection

Bioinformatics TCGR: DNA/RNA visualization miRNA prediction

Web Usage Mining

Page 4: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 4

Data Mining Definition

Finding hidden information in a database

Fit data to a model Similar terms

Exploratory data analysis Data driven discovery Deductive learning

Page 5: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 5

Query Examples

Database

Data Mining

– Find all customers who have purchased milkFind all customers who have purchased milk

– Find all items which are frequently purchased Find all items which are frequently purchased with milk. (association rules)with milk. (association rules)

– Find all credit applicants with last name of Smith.Find all credit applicants with last name of Smith.– Identify customers who have purchased more Identify customers who have purchased more than $10,000 in the last month.than $10,000 in the last month.

– Find all credit applicants who are poor credit Find all credit applicants who are poor credit risks. (classification)risks. (classification)– Identify customers with similar buying habits. Identify customers with similar buying habits. (Clustering)(Clustering)

Page 6: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 6

Outline

What is Data Mining? EMM

Spatio-temporal modeling Rare Event Detection

Bioinformatics TCGR: DNA/RNA visualization miRNA prediction

Web Usage Mining

Page 7: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 7

Spatiotemporal Environment

Events arriving in a streamAt any time, t, we can view the state of the problem as represented by a vector of n numeric values:

Vt = <S1t, S2t, ..., Snt>

V2 V2 … V2S1 S11 S12 … S1q

S2 S21 S22 … S2q

… … … … …

Sn Sn1 Sn2 … Snq

Time

Page 8: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 8

Technique

Spatiotemporal modeling technique based on Markov models.

However – Size of MM depends on size of dataset The required structure of the MM is not known at the

model construction time. As the real world being modeled by the MM changes,

so should the structure of the MM.

Page 9: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 9

Extensible Markov Model (EMM)

Time Varying Discrete First Order Markov Model Nodes are clusters of real world states. Learning continues during application phase. Learning:

Transition probabilities between nodes Node labels (centroid/medoid of cluster) Nodes are added and removed as data arrives

Page 10: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 10

EMM Learning

<18,10,3,3,1,0,0><18,10,3,3,1,0,0>

<17,10,2,3,1,0,0><17,10,2,3,1,0,0>

<16,9,2,3,1,0,0><16,9,2,3,1,0,0>

<14,8,2,3,1,0,0><14,8,2,3,1,0,0>

<14,8,2,3,0,0,0><14,8,2,3,0,0,0>

<18,10,3,3,1,1,0.><18,10,3,3,1,1,0.>

1/3

N1

N2

2/3

N3

1/11/3

N1

N2

2/3

1/1

N3

1/1

1/2

1/3

N1

N2

2/31/2

1/2

N3

1/1

2/3

1/3

N1

N2

N1

2/21/1

N1

1

Page 11: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 11

Growth of EMM

0

100

200

300

400

500

600

700

800

1 80 159

238

317

396

475

554

633

712

791

870

949

1028

1107

1186

1265

1344

1423

1502

number of input data (total 1574)

num

ber o

f st

ate

in m

ode

l

threshold 0.994

threshold 0.995

threshold 0.996

threshold 0.997

threshold 0.998

Servent Data

Page 12: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 12

EMM Performance – Growth Rate

Minnesota Traffic Data

Page 13: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 13

EMM Water Level Prediction – Ouse Data

0

1

2

3

4

5

6

7

8

1

38

75

112

14

9

18

6

22

3

26

0

29

7

33

4

37

1

40

8

44

5

48

2

51

9

55

6

59

3

63

0

66

7

Input Time Series

Wa

ter

Le

ve

l (m

)

RLF Prediction EMM Prediction Observed

Page 14: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 14

Rare Event

Rare - Anomalous – Surprising Out of the ordinary Not outlier detection Ex: Snow in upstate New York is not rare

Snow in upstate New York in June is rare Rare events may change over time Applications

Intrusion Detection Fraud Flooding Unusual automobile/network traffic

Page 15: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 15

Rare Event in Cisco Data

Page 16: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 16

Outline

What is Data Mining? EMM

Spatio-temporal modeling Rare Event Detection

Bioinformatics TCGR: DNA/RNA visualization miRNA prediction

Web Usage Mining

Page 17: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 17

Chaos Game Representation (CGR)

2D technique to visually see the distribution of subpatterns

Our technique is based on the following:

Generate totals for each subpattern

Scale totals to a [0,1] range. (Note scaling can be a problem)

Convert range to red/blue• 0-0.5: White to Blue• 0.5-1: Blue to Red

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

A CG U

Page 18: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 18

CGR Example

Homo Sapiens – all mature miRNA

Patterns of length 3

UUC

GUG

Page 19: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 19

Temporal CGR (TCGR)

Temporal version of Frequency CGR In our context temporal means the starting location of a window

2D Array Each Row represents counts for a particular window in sequence

• First row – first window

• Last row – last window

• We start successive windows at the next character location Each Column represents the counts for the associated pattern in

that window• Initially we have assumed order of patterns is alphabetic

Size of TCGR depends on sequence length and subpattern lengt As sequence lengths vary, we only examine complete windows We only count patterns completely contained in each window.

Page 20: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 20

TCGR Example

acgtgcacgtaactgattccggaaccaaatgtgcccacgtcgaMoving Window

Pattern Lngth: 1 2 3

Page 21: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 21

TCGR – Mature miRNA (Window=5; Pattern=2)

All MatureMus Musculus Homo SapiensC Elegans

Page 22: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 22

Outline

What is Data Mining? EMM

Spatio-temporal modeling Rare Event Detection

Bioinformatics TCGR: DNA/RNA visualization miRNA prediction

Web Usage Mining

Page 23: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 23

The BIG PICTURE

2003-10-0515:49:20050721435700000026210000000000               02652026520000000002003-10-0516:40:49050832595900000872710001142380               07107071070000000002003-10-0504:55:10050767799900000191300000670518               00000000000000000002003-10-0509:43:10050781766100000603030000000000               03657004690000000002003-10-0514:49:360508182420000007066200000000000811a39        09142071070000000002003-10-0521:23:57050759031600000465050002794335               11992071070000000002003-10-0511:30:16050730512600000465050000195747               1684600597corduroy+coats

CAN’T SEE THE FOREST FOR THE TREES

Page 24: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

WebLog

Web Server

Interests…Motivations…

Preprocess Web Data:Cleanse

Sessionize…

URL Abstraction

Cluster Web Sessions

Markov Model

User Preferred Navigation Trail

Markov Model per ClusterUser defined

beginning/ending Web pages

Normalized

Probability

Significant Usage Pattern

Page 25: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 25

Experimental Result

WebKDD’05 25

On average purchase sessions are longer than those sessions without purchase

- review the information, compare the price, the quality and etc. - fill out the billing and shipping information to commit the purchase

Average Session Length (Purchase vs. Non-Purchase Clusters)

0

10

20

30

40

50

Len

gth

Purchase

Non-Purchase

Page 26: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 26

WebKDD’05 26

Cluster No. No. of Sessions Threshold ()

Average Session Length

No. of States SUPs

1 1746 0.3 9.6 98

1. S-C1-C1-C2-C3-C4-C5-C6-C7-E 2. S-C1-C1-C2-C3-C4-C5-E 3. S-C1-C1-C2-C3-E 4. S-C1-C2-C3-C3-C4-C5-C6-C7-E 5. S-C1-C2-C3-C4-C4-C5-C6-C7-E …

2 241 0.37 6.6 38

1. S-P1-P2-P3-P3-E 2. S-P1-P2-P3-P4-P4-P5-E 3. S-P1-P2-P3-P4-P4-E 4. S-P1-P2-P3-P4-P5-P4-E 5. S-P1-P2-P3-P4-P5-P5-E …

3 13 0.3 3.0 6

1. S-C1-P1-P2-E 2. S-C1-P1-E 3. S-I1-P1-P1-P2-E 4. S-I1-P1-P1-E 5. S-I1-P1-E…

S-C1-C1-C2-C3-C4-C5-C5-I1-ES-C1-C1-I1-C1-C2-C3-C4-C5-ES-I1-C1-C2-C3-C4-C5-C6-C7-E

Interested in gathering information of products in different categories.

Not serious visitors (the average session length is 3)

Interested in reviewing general pages (to gather general information).

SUPs in non-purchase cluster

Experimental Result

Page 27: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 27

WebKDD’05 27

Cluster Cluster No.

No. of Sessions

Average SessionLength

No. ofStates

Threshold()

BeginningWeb page SUPs in BNF Notation

Non-Purchase

1 1746 9.6 980.3 S S-{C}-E

0.25 P86806 P86806-{C}-E

2 241 6.6 380.37 S S-{P}-[C]-E

0.34 P86806 P86806-[I]-{P}-E

3 13 3.0 6

0.3 S S-<C | I>-{P}-E

0.2 P86806 P86806-[{P}- [P86806]]-E

Purchase

1 1858 14.9 550.47 S S-[C]-[I]-{P}-E

0.51 P86806 P86806-[I]-{P}-E

2 132 39.1 1000.457 S S -[{{C}|{I}}]-{P}-E

0.434 P86806 P86806-[{C }]-{P}-E

3 10 31.6 470.52 S S-{P}-[{I}]-[{P}]-{C}-E

0.43 P86806 P86806-[I]-[{P}]-{C}-E

The average length of SUPs is longer in the purchase cluster than in non-purchase cluster SUPs in the purchase cluster have higher probability than those in non-purchase cluster.

review the information, compare among products, and fill out the payment and shipping

information

have purchase in mind vs.random browsing behavior

Experimental Result

Page 28: 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department

07/03/06 - Tunisia 28