Upload
alexina-pitts
View
220
Download
0
Embed Size (px)
Citation preview
07/03/06 - Tunisia 1
MEMEData Mining Research at SMUData Mining Research at SMU
Margaret H. Dunham, Margaret H. Dunham,
DBGroup: Yu Meng, Jie Huang, Lin Lu, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael PierceDonya Quick, Michael Pierce
CSE DepartmentCSE Department
Southern Methodist UniversitySouthern Methodist University
Dallas, Texas 75275Dallas, Texas 75275
07/03/06 - Tunisia 2
Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003.
DILBERT reprinted by permission of United Feature Syndicate, Inc.
07/03/06 - Tunisia 3
Outline
What is Data Mining? EMM
Spatio-temporal modeling Rare Event Detection
Bioinformatics TCGR: DNA/RNA visualization miRNA prediction
Web Usage Mining
07/03/06 - Tunisia 4
Data Mining Definition
Finding hidden information in a database
Fit data to a model Similar terms
Exploratory data analysis Data driven discovery Deductive learning
07/03/06 - Tunisia 5
Query Examples
Database
Data Mining
– Find all customers who have purchased milkFind all customers who have purchased milk
– Find all items which are frequently purchased Find all items which are frequently purchased with milk. (association rules)with milk. (association rules)
– Find all credit applicants with last name of Smith.Find all credit applicants with last name of Smith.– Identify customers who have purchased more Identify customers who have purchased more than $10,000 in the last month.than $10,000 in the last month.
– Find all credit applicants who are poor credit Find all credit applicants who are poor credit risks. (classification)risks. (classification)– Identify customers with similar buying habits. Identify customers with similar buying habits. (Clustering)(Clustering)
07/03/06 - Tunisia 6
Outline
What is Data Mining? EMM
Spatio-temporal modeling Rare Event Detection
Bioinformatics TCGR: DNA/RNA visualization miRNA prediction
Web Usage Mining
07/03/06 - Tunisia 7
Spatiotemporal Environment
Events arriving in a streamAt any time, t, we can view the state of the problem as represented by a vector of n numeric values:
Vt = <S1t, S2t, ..., Snt>
V2 V2 … V2S1 S11 S12 … S1q
S2 S21 S22 … S2q
… … … … …
Sn Sn1 Sn2 … Snq
Time
07/03/06 - Tunisia 8
Technique
Spatiotemporal modeling technique based on Markov models.
However – Size of MM depends on size of dataset The required structure of the MM is not known at the
model construction time. As the real world being modeled by the MM changes,
so should the structure of the MM.
07/03/06 - Tunisia 9
Extensible Markov Model (EMM)
Time Varying Discrete First Order Markov Model Nodes are clusters of real world states. Learning continues during application phase. Learning:
Transition probabilities between nodes Node labels (centroid/medoid of cluster) Nodes are added and removed as data arrives
07/03/06 - Tunisia 10
EMM Learning
<18,10,3,3,1,0,0><18,10,3,3,1,0,0>
<17,10,2,3,1,0,0><17,10,2,3,1,0,0>
<16,9,2,3,1,0,0><16,9,2,3,1,0,0>
<14,8,2,3,1,0,0><14,8,2,3,1,0,0>
<14,8,2,3,0,0,0><14,8,2,3,0,0,0>
<18,10,3,3,1,1,0.><18,10,3,3,1,1,0.>
1/3
N1
N2
2/3
N3
1/11/3
N1
N2
2/3
1/1
N3
1/1
1/2
1/3
N1
N2
2/31/2
1/2
N3
1/1
2/3
1/3
N1
N2
N1
2/21/1
N1
1
07/03/06 - Tunisia 11
Growth of EMM
0
100
200
300
400
500
600
700
800
1 80 159
238
317
396
475
554
633
712
791
870
949
1028
1107
1186
1265
1344
1423
1502
number of input data (total 1574)
num
ber o
f st
ate
in m
ode
l
threshold 0.994
threshold 0.995
threshold 0.996
threshold 0.997
threshold 0.998
Servent Data
07/03/06 - Tunisia 12
EMM Performance – Growth Rate
Minnesota Traffic Data
07/03/06 - Tunisia 13
EMM Water Level Prediction – Ouse Data
0
1
2
3
4
5
6
7
8
1
38
75
112
14
9
18
6
22
3
26
0
29
7
33
4
37
1
40
8
44
5
48
2
51
9
55
6
59
3
63
0
66
7
Input Time Series
Wa
ter
Le
ve
l (m
)
RLF Prediction EMM Prediction Observed
07/03/06 - Tunisia 14
Rare Event
Rare - Anomalous – Surprising Out of the ordinary Not outlier detection Ex: Snow in upstate New York is not rare
Snow in upstate New York in June is rare Rare events may change over time Applications
Intrusion Detection Fraud Flooding Unusual automobile/network traffic
07/03/06 - Tunisia 15
Rare Event in Cisco Data
07/03/06 - Tunisia 16
Outline
What is Data Mining? EMM
Spatio-temporal modeling Rare Event Detection
Bioinformatics TCGR: DNA/RNA visualization miRNA prediction
Web Usage Mining
07/03/06 - Tunisia 17
Chaos Game Representation (CGR)
2D technique to visually see the distribution of subpatterns
Our technique is based on the following:
Generate totals for each subpattern
Scale totals to a [0,1] range. (Note scaling can be a problem)
Convert range to red/blue• 0-0.5: White to Blue• 0.5-1: Blue to Red
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
A CG U
07/03/06 - Tunisia 18
CGR Example
Homo Sapiens – all mature miRNA
Patterns of length 3
UUC
GUG
07/03/06 - Tunisia 19
Temporal CGR (TCGR)
Temporal version of Frequency CGR In our context temporal means the starting location of a window
2D Array Each Row represents counts for a particular window in sequence
• First row – first window
• Last row – last window
• We start successive windows at the next character location Each Column represents the counts for the associated pattern in
that window• Initially we have assumed order of patterns is alphabetic
Size of TCGR depends on sequence length and subpattern lengt As sequence lengths vary, we only examine complete windows We only count patterns completely contained in each window.
07/03/06 - Tunisia 20
TCGR Example
acgtgcacgtaactgattccggaaccaaatgtgcccacgtcgaMoving Window
Pattern Lngth: 1 2 3
07/03/06 - Tunisia 21
TCGR – Mature miRNA (Window=5; Pattern=2)
All MatureMus Musculus Homo SapiensC Elegans
07/03/06 - Tunisia 22
Outline
What is Data Mining? EMM
Spatio-temporal modeling Rare Event Detection
Bioinformatics TCGR: DNA/RNA visualization miRNA prediction
Web Usage Mining
07/03/06 - Tunisia 23
The BIG PICTURE
2003-10-0515:49:20050721435700000026210000000000 02652026520000000002003-10-0516:40:49050832595900000872710001142380 07107071070000000002003-10-0504:55:10050767799900000191300000670518 00000000000000000002003-10-0509:43:10050781766100000603030000000000 03657004690000000002003-10-0514:49:360508182420000007066200000000000811a39 09142071070000000002003-10-0521:23:57050759031600000465050002794335 11992071070000000002003-10-0511:30:16050730512600000465050000195747 1684600597corduroy+coats
CAN’T SEE THE FOREST FOR THE TREES
WebLog
Web Server
Interests…Motivations…
Preprocess Web Data:Cleanse
Sessionize…
URL Abstraction
Cluster Web Sessions
Markov Model
User Preferred Navigation Trail
Markov Model per ClusterUser defined
beginning/ending Web pages
Normalized
Probability
Significant Usage Pattern
07/03/06 - Tunisia 25
Experimental Result
WebKDD’05 25
On average purchase sessions are longer than those sessions without purchase
- review the information, compare the price, the quality and etc. - fill out the billing and shipping information to commit the purchase
Average Session Length (Purchase vs. Non-Purchase Clusters)
0
10
20
30
40
50
Len
gth
Purchase
Non-Purchase
07/03/06 - Tunisia 26
WebKDD’05 26
Cluster No. No. of Sessions Threshold ()
Average Session Length
No. of States SUPs
1 1746 0.3 9.6 98
1. S-C1-C1-C2-C3-C4-C5-C6-C7-E 2. S-C1-C1-C2-C3-C4-C5-E 3. S-C1-C1-C2-C3-E 4. S-C1-C2-C3-C3-C4-C5-C6-C7-E 5. S-C1-C2-C3-C4-C4-C5-C6-C7-E …
2 241 0.37 6.6 38
1. S-P1-P2-P3-P3-E 2. S-P1-P2-P3-P4-P4-P5-E 3. S-P1-P2-P3-P4-P4-E 4. S-P1-P2-P3-P4-P5-P4-E 5. S-P1-P2-P3-P4-P5-P5-E …
3 13 0.3 3.0 6
1. S-C1-P1-P2-E 2. S-C1-P1-E 3. S-I1-P1-P1-P2-E 4. S-I1-P1-P1-E 5. S-I1-P1-E…
S-C1-C1-C2-C3-C4-C5-C5-I1-ES-C1-C1-I1-C1-C2-C3-C4-C5-ES-I1-C1-C2-C3-C4-C5-C6-C7-E
Interested in gathering information of products in different categories.
Not serious visitors (the average session length is 3)
Interested in reviewing general pages (to gather general information).
SUPs in non-purchase cluster
Experimental Result
07/03/06 - Tunisia 27
WebKDD’05 27
Cluster Cluster No.
No. of Sessions
Average SessionLength
No. ofStates
Threshold()
BeginningWeb page SUPs in BNF Notation
Non-Purchase
1 1746 9.6 980.3 S S-{C}-E
0.25 P86806 P86806-{C}-E
2 241 6.6 380.37 S S-{P}-[C]-E
0.34 P86806 P86806-[I]-{P}-E
3 13 3.0 6
0.3 S S-<C | I>-{P}-E
0.2 P86806 P86806-[{P}- [P86806]]-E
Purchase
1 1858 14.9 550.47 S S-[C]-[I]-{P}-E
0.51 P86806 P86806-[I]-{P}-E
2 132 39.1 1000.457 S S -[{{C}|{I}}]-{P}-E
0.434 P86806 P86806-[{C }]-{P}-E
3 10 31.6 470.52 S S-{P}-[{I}]-[{P}]-{C}-E
0.43 P86806 P86806-[I]-[{P}]-{C}-E
The average length of SUPs is longer in the purchase cluster than in non-purchase cluster SUPs in the purchase cluster have higher probability than those in non-purchase cluster.
review the information, compare among products, and fill out the payment and shipping
information
have purchase in mind vs.random browsing behavior
Experimental Result
07/03/06 - Tunisia 28