Upload
laura-park
View
225
Download
2
Embed Size (px)
Citation preview
1
Association Rules Mining with SQL
Kirsten NelsonDeepen Manek
November 24, 2003
2
Organization of Presentation Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL
Candidate generation, pruning, support counting K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies Mining sequential patterns
3
Early data mining applications Most early mining systems were
developed largely on file systems, with specialized data structures and buffer management strategies devised for each
All data was read into memory before beginning computation
This limits the amount of data that can be mined
4
Advantage of SQL and RDBMS Make use of database indexing and
query processing capabilities More than a decade spent on making
these systems robust, portable, scalable, and concurrent
Exploit underlying SQL parallelization For long-running algorithms, use
checkpointing and space management
5
Organization of Presentation Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL
Candidate generation, pruning, support counting K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies Mining sequential patterns
6
Use of Database in Data Mining
“Loose coupling” of application and data How would you write an Apriori program? Use SQL statements in an application Use a cursor interface to read through
records sequentially for each pass Still two major performance problems:
Copying of record from database to memory Process context switching for each record
retrieved
7
Organization of Presentation Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL
Candidate generation, pruning, support counting K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies Mining sequential patterns
8
Tightly-coupled applications Push computations into the database
system to avoid performance degradation
Take advantage of user-defined functions (UDFs)
Does not require changes to database software
Two types of UDFs we will use: Ones that are executed only a few times,
regardless of the number of rows Ones that are executed once for each
selected row
9
Tight-coupling using UDFs
Procedure TightlyCoupledApriori():begin
exec sql connect to database;exec sql select allocSpace() into :blob from onerecord;
exec sql select * from sales where GenL1(:blob, TID, ITEMID) = 1;
notDone := true;
10
Tight-coupling using UDFs
while notDone do {exec sql select aprioriGen(:blob)
into :blob from onerecord;exec sql select *
from sales where itemCount(:blob, TID,
ITEMID)=1;exec sql select GenLk(:blob) into :notDone from onerecord
}
11
Tight-coupling using UDFs
exec sql select getResult(:blob) into :resultBlob from onerecord;
exec sql select deallocSpace(:blob) from onerecord;
compute Answer using resultBlob;end
12
Organization of Presentation Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL
Candidate generation, pruning, support counting K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies Mining sequential patterns
13
Methodology Comparison done with Association Rules against IBM
DB2 Only consider generation of frequent itemsets using
Apriori algorithm Five alternatives considered:
Loose-coupling through SQL cursor interface – as described earlier
UDF tight-coupling – as described earlier Stored-procedure to encapsulate mining algorithm Cache-mine – caching data and mining on the fly SQL implementations to force processing in the database
Consider two classes of implementations SQL-92 – four different implementations SQL-OR (with object relational extensions) – six implementations
14
Architectural Options Stored procedure
Apriori algorithm encapsulated as a stored procedure Implication: runs in the same address space as the DBMS Mined results stored back into the DBMS.
Cache-mine Variation of stored-procedure Read entire data once from DBMS, temporarily cache
data in a lookaside buffer on a local disk Cached data is discarded when execution completes Disadvantage – requires additional disk space for caching Use Intelligent Miner’s “space” option
15
Organization of Presentation Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL
Candidate generation, pruning, support counting K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies Mining sequential patterns
16
Terminology Use the following terminology
T: table of items {tid,item} pairs Data is normally sorted by transaction id
Ck: candidate k-itemsets Obtained from joining and pruning
frequent itemsets from previous iteration Fk: frequent items sets of length k
Obtained from Ck and T
17
Candidate Generation in SQL – join step
Generate Ck from Fk-1 by joining Fk-1 with itself
insert into Ck select I1.item1,…,I1.itemk-1,I2.itemk-1
from Fk-1 I1,Fk-1 I2where I1.item1 = I2.item1 and
…I1.itemk-2 = I2.itemk-2 and
I1.itemk-1 < I2.itemk-1
18
Candidate Generation Example
item1 item2 item3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
item1 item2 item3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
F3 is {{1,2,3},{1,2,4},{1,3,4},{1,3,5},{2,3,4}} C4 is {{1,2,3,4},{1,3,4,5}}Table F3 (I1) Table F3 (I2)
19
Pruning Modify candidate generation algorithm to ensure all k
subsets of Ck of length (k-1) are in Fk-1
Do a k-way join, skipping itemn-2 when joining with the nth table (2<n≤k)
Create primary index (item1, …, itemk-1) on Fk-1 to efficiently process k-way join
For k=4, this becomesinsert into C4 select I1.item1, I1.item2, I1.item3,I2.item3 from F3 I1,F3
I2,
F3 I3, F3 I4 where I1.item1 = I2.item1 … and I1.item3 < I2.item3 and
I1.item2 = I3.item1 and I1.item3 = I3.item2 and I2.item3 = I3.item3 and
I1.item1 = I4.item1 and I1.item3 = I4.item2 and I2.item3 = I4.item3
20
Pruning Example
Evaluate join with I3 using previous example C4 is {1,2,3,4}
item1 item2
item3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
item1 item2
item3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
item1 item2
item3
1 2 3
1 2 4
1 3 4
1 3 5
2 3 4
Table F3 (I1) Table F3 (I2) Table F3 (I3)
21
Support counting using SQL Two different approaches
Use the SQL-92 standard Use ‘standard’ SQL syntax such as joins and
subqueries to find support of itemsets
Use object-relational extensions of SQL (SQL-OR)
User Defined Functions (UDFs) & table functions Binary Large Objects (BLOBs)
22
Support Counting using SQL-92
4 different methods, two of which detailed in the papers K-way Joins SubQuery
Other methods not discussed because of unacceptable performance 3-way join 2 Group-Bys
23
SQL-92: K-way join Obtain Fk by joining Ck with table T of (tid,item) Perform group by on the itemset
insert into Fk select item1,…,itemk,count(*)
from Ck, T t1, …, T tk,
where t1.item = Ck.item1, … , and
tk.item = Ck.itemk and
t1.tid = t2.tid … and
tk-1.tid = tk.tid
group by item1,…,itemk
having count(*) > :minsup
24
K-way join example C3={B,C,E} and minimum support required is 2 Insert into F3 {B,C,E,2}
25
K-way join: Pass-2 optimization
When calculating C2, no pruning is required after we join F1 with itself
Don’t calculate and materialize C2- replace C2 in 2-way join algorithm with join of F1 with itself
insert into F2 select I1.item1, I2.item1,count(*)from F1 I1, F1 I2, T t1, T t2
where I1.item1 < I2.item1 andt1.item = I1.item1 and t2.item = I2.item1 andt1.tid = t2.tidgroup by I1.item1,I2.item1
having count(*) > :minsup
26
SQL-92: SubQuery based Split support counting into cascade of k subqueries nth subquery Qn finds all tids that match the
distinct itemsets formed by the first n items of Ck insert into Fk select item1, …, itemk, count(*) from (Subquery Qk) tGroup by item1, item2 … , itemk having count(*) > :minsup
Subquery Qn (for any n between 1 and k): select item1, …, itemn, tidfrom T tn, (Subquery Qn-1) as rn-1
(select distinct item1, …, itemn from CK) as dn
where rn-1.item1 = dn.item1 and … and rn-1.itemn-1 = dn.itemn
and rn-1.tid = tn.tid and tn.item = dn.itemn
27
Example of SubQuery based Using previous example from class
C3 = {B,C,E}, minimum support = 2
Q0: No subquery Q0
Q1 in this case becomesselect item1, tid
From T t1,
(select distinct item1from C3) as d1
where t1.item = d1.item1
28
Example of SubQuery based cnt’d
Q2 becomes
select item1, item2, tid from T t2, (Subquery Q1) as r1,
(select distinct item1, item2 from C3) as d2 where r1.item1 = d2.item1 and r1.tid = t2.tid and t2.item = d2.item2
29
Example of SubQuery based cnt’d
Q3 becomes
select item1,item2,item3, tid from T t3, (Subquery Q2) as r2,
(select distinct item1,item2,item3 from C3) as d3
where r2.item1 = d3.item1 and r2.item2 = d3.item2 and
r2.tid = t3.tid and t3.item = d3.item3
30
Example of SubQuery based cnt’d
Output of Q3 is
Insert statement becomesinsert into F3 select item1, item2, item3, count(*) from (Subquery Q3) t group by item1, item2 ,item3 having count(*) > :minsup
Insert the row {B,C,E,2}
For Q2, pass-2 optimization can be used
Item1 Item2 Item3 Tid
B C E 20B C E 30
31
Performance Comparisons of SQL-92 approaches
Used Version 5 of DB2 UDB and RS/6000 Model 140
200 Mhz CPU, 256 MB main memory, 9 GB of disk space, Transfer rate of 8 MB/sec
Used 4 different item sets based on real-world data
Built the following indexes, which are not included in any cost calculations
Composite index (item1, …, itemk) on Ck
k different indices on each of the k items in Ck
(item,tid) and (tid,item) indexes on the data table T
32
Performance Comparisons of SQL-92 approaches
Best performance obtained by SubQuery approach SubQuery was only comparable to loose-coupling
in some cases, failing to complete in other cases DataSet C, for support of 2%, SubQuery outperforms
loose-coupling but decreasing support to 1%, SubQuery takes 10 times as long to complete
Lower support will increase the size of Ck and Fk at each step, causing the join to process more rows
Datasets # records (millions)
# Transactions (millions)
# Items (thousands)
Avg # Items
Dataset-A 2.5 0.57 85 4.4Dataset-B 7.5 2.5 15.8 2.62Dataset-C 6.6 0.21 15.8 31Dataset-D 14 1.44 480 9.62
33
Support Counting using SQL with object-relational extensions
6 different methods, four of which detailed in the papers
GatherJoin GatherCount GatherPrune Vertical
Other methods not discussed because of unacceptable performance
Horizontal SBF
34
SQL Object-Relational Extension: GatherJoin
Generates all possible k-item combinations of items contained in a transaction and joins them with Ck
An index is created on all items of Ck
Uses the following table functions Gather: Outputs records {tid,item-list}, with item-list
being a BLOB or VARCHAR containing all items associated with the tid
Comb-K: returns all k-item combinations from the transaction
Output has k attributes T_itm1, …, T_itmk
35
GatherJoin
insert into Fk select item1,…, itemk, count(*)
from Ck,
(select t2.T_itm1,…,t2.itmk from T,
table(Gather(T.tid,T.item)) as t1,
table(Comb-K(t1.tid,t1.item-list)) as t2)
where t2.T_itm1 = Ck.item1 and … and
t2.T_itmk = Ck.itemk
group by Ck.item1,…,Ck.itemk
having count(*) > :minsup
36
Example of GatherJoin
t1 (output from Gather) looks like:
t2 (generated by Comb-K from t1) will be joined with C3 to obtain F3
1 row from Tid 10 1 row from Tid 20 4 rows from Tid 30
Insert {B,C,E,2}
t1
Tid Item-List
10 A,C,D20 B,C,E30 A,B,C,E40 B,E
37
GatherJoin: Pass 2 optimization
When calculating C2, no pruning is required after we join F1 with itself
Don’t calculate and materialize C2 - replace C2 with a join to F1 before the table function
Gather is only passed frequent 1-itemset rows
insert into F2 select I1.item1, I2.item1, count(*) from F1 I1,
(select t2.T_itm1,t2.T_itm2 from T, table(Gather(T.tid,T.item)) as t1,
table(Comb-K(t1.tid,t1.item-list)) as t2 where T.item = I1.item1)
group by t2.T_itm1,t2.T_itm2
having count(*) > :minsup
38
Variations of GatherJoin - GatherCount Perform the GROUP BY inside the table
function Comb-K for pass 2 optimization Output of the table function Comb-K
Not the candidate frequent itemsets (Ck) But the actual frequent itemsets (Fk) along
with the corresponding support Use a 2-dimensional array to store
possible frequent itemsets in Comb-K May lead to excessive memory use
39
Variations of GatherJoin - GatherPrune
Push the join with Ck into the table function Comb-K
Ck is converted into a BLOB and passed as an argument to the table function. Will have to pass the BLOB for each
invocation of Comb-K - # of rows in table T
40
SQL Object-Relational Extension: Vertical For each item, create a BLOB containing the
tids the item belongs to Use function Gather to generate {item,tid-
list} pairs, storing results in table TidTable Tid-list are all in the same sorted order
Use function Intersect to compare two different tid-lists and extract common values
Pass-2 optimization can be used for Vertical Similar to K-way join method Upcoming example does not show optimization
41
Verticalinsert into Fk select item1, …, itemk, count(tid-list) as cntfrom (Subquery Qk) t where cnt > :minsup
Subquery Qn (for any n between 2 and k)Select item1, …, itemn, Intersect(rn-1.tid-list, tn.tid-list) as tid-listfrom TidTable tn, (Subquery Qn-1) as rn-1
(select distinct item1, …, itemn from CK) as dn
where rn-1.item1 = dn.item1 and … andrn-1.itemn-1 = dn.itemn-1 andtn.item = dn.itemn
Subquery Q1: (select * from TidTable)
42
Example of Vertical Using previous example from class
C3 = {B,C,E}, minimum support = 2
Q1 is TidTable Item Tid-List
A 10,30B 20,30,40C 10,20,30D 10E 20,30,40
43
Example of Vertical cnt’d Q2 becomes
Select item1, item2, Intersect(r1.tid-list, t2.tid-list) as tid-list
from TidTable t2, (Subquery Q1) as r1
(select distinct item1, item2 from C3) as d2
where r1.item1 = d2.item1 and t2.item = d2.item2
44
Example of Vertical cnt’d Q3 becomes
select item1, item2, item3, intersect(r2.tid-list, t3.tid-list) as tid-list
from TidTable t3, (Subquery Q2) as r2
(select distinct item1, item2, item3 from C3) as d3
where r2.item1 = d3.item1 and r2.item2 = d3.item2 and
t3.item = d3.item3
45
Performance Comparisons using SQL-OR
Ver
t
Gp
run
Gjo
in
Gcn
t
Ver
t
Gp
run
Gjo
in
Gcn
t
Ver
t
Gp
run
Gjo
in
Gcn
t
Support 0.5% 0.35% 0.20%
Tim
e in
Sec
Data Set A
2000
500
1000
1500
2500
0
Datasets # records (millions)
# Transactions (millions)
# Items (thousands)
Avg # Items
Dataset-A 2.5 0.57 85 4.4Dataset-B 7.5 2.5 15.8 2.62
Ver
t
Gp
run
Gjo
in
Gcn
t
Ver
t
Gp
run
Gjo
in
Gcn
t
Ver
t
Gp
run
Gjo
in
Gcn
t
Support 0.10% 0.03% 0.01%
0
Tim
e in
Sec 8000
6000
4000
2000
Data Set B
14000
12000
10000
Legend:PrepPass 1Pass 2Pass 3Pass 4
46
Performance Comparisons using SQL-OR
Legend:PrepPass 1Pass 2Pass 3Pass 4
Datasets # records (millions)
# Transactions (millions)
# Items (thousands)
Avg # Items
Dataset-C 6.6 0.21 15.8 31Dataset-D 14 1.44 480 9.62
Ver
t
Gp
run
Gjo
in
Gcn
t
Ver
t
Gp
run
Gjo
in
Gcn
t
Ver
t
Gp
run
Gjo
in
Gcn
t
Support 2.0% 1.0% 0.25%
Tim
e in
Sec
8000
6000
4000
2000
0
Data Set C
14000
12000
10000
Ver
t
Gjo
in
Gcn
t
Ver
t
Gjo
in
Gcn
t
Ver
t
Gjo
in
Gcn
t
Support 0.02%
4000
2000
0
Data Set D
0.2% 0.07%
14000
12000
10000
Tim
e in
Sec
8000
6000
47
Performance comparison of SQL object-relational approaches Vertical has best overall performance,
sometimes an order of magnitude better than other 3 approaches
Majority of time is transforming the data in {item,tid-list} pairs
Vertical spends too much time on the second pass Pass-2 optimization has huge impact on
performance of GatherJoin For Dataset-B with support of 0.1 %, running time for
Pass 2 went from 5.2 hours to 10 minutes Comb-K in GatherJoin generates large number of
potential frequent itemsets we must work with
48
Hybrid approach Previous charts and algorithm analysis show
Vertical spends too much time on pass 2 compared to other algorithms, especially when the support is decreased
GatherJoin degrades when the # of frequent items per transaction increases
To improve performance, use a hybrid algorithm Use Vertical for most cases When size of candidate itemset is too large, GatherJoin
is a good option if number of frequent items per transaction (Nf) is not too large
When Nf is large, GatherCount may be the only good option
49
Architecture Comparisons Compare five alternatives
Loose-Coupling, Stored-procedure Basically the same except for address space
program is being run in Because of limited difference in performance,
focus solely on stored procedure in following charts
Cache-Mine UDF tight-coupling Best SQL approach (Hybrid)
50
Performance Comparisons of Architectures
Datasets # records (millions)
# Transactions (millions)
# Items (thousands)
Avg # Items
Dataset-A 2.5 0.57 85 4.4Dataset-B 7.5 2.5 15.8 2.62
Cac
he
Spr
oc
UD
F
SQ
L
Cac
he
Spr
oc
UD
F
SQ
L
Cac
he
Spr
oc
UD
F
SQ
L
Support
200
100
0
700
600
500
0.5% 0.35% 0.2%
300
Tim
e in
Sec 400
Data Set A
500C
ache
Spr
oc
UD
F
SQ
L
Cac
he
Spr
oc
UD
F
SQ
L
Cac
he
Spr
oc
UD
F
SQ
L
Support 0.1% 0.03% 0.01%
1000
0
3500
3000
2500
Tim
e in
Sec 2000
1500
4000
4500
5000
Data Set B
Legend:Pass 1Pass 2Pass 3Pass 4
51
Performance Comparisons of Architectures cnt’d
Datasets # records (millions)
# Transactions (millions)
# Items (thousands)
Avg # Items
Dataset-C 6.6 0.21 15.8 31Dataset-D 14 1.44 480 9.62
Legend:Pass 1Pass 2Pass 3Pass 4
Cac
he
Spr
oc
UD
F
SQ
L
Cac
he
Spr
oc
UD
F
SQ
L
Cac
he
Spr
oc
UD
F
SQ
L
Support
4000
2000
12000
10000
Tim
e in
Sec
8000
6000
Data Set D
0.2% 0.07% 0.02%
0
Cac
he
Spr
oc
UD
F
SQ
L
Cac
he
Spr
oc
UD
F
SQ
L
Cac
he
Spr
oc
UD
F
SQ
L
Support
500
0
Tim
e in
Sec 2000
1000
1500
3500
3000
2500
2.0% 1.00% 0.25%
Data Set C
52
Performance Comparisons of Architectures cnt’d
Cache-Mine is the best or close to the best performance in all cases
Factor of 0.8 to 2 times faster than SQL approach
Stored procedure is the worst Difference between Cache-Mine directly related to
the number of passes through the data Passes increase when the support goes down May need to make multiple passes if all candidates
cannot fit in memory
UDF time per pass decreases 30-50% compared to stored procedure because of tighter coupling with DB
53
Performance Comparisons of Architectures cnt’d SQL approach comes in second in performance
to Cache-Mine Somewhat better than Cache-Mine for high support
values 1.8 – 3 times better than Stored-procedure/loose-
coupling approach, getting better when support value decreases
Cost of converting to Vertical format is less than cost of converting to binary format in Cache-Mine
For second pass through data, SQL approach takes much more time than Cache-Mine, particularly when we decrease the support
54
Organization of Presentation Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL
Candidate generation, pruning, support counting K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies Mining sequential patterns
55
Taxonomies - example
Beverages Snacks
Soft Drinks Alcoholic Drinks
Pretzels Chocolate Bar
Pepsi Coke Beer
Parent Child
Beverages Soft Drinks
Beverages Alcoholic Drinks
Soft Drinks Pepsi
Soft Drinks Coke
Alcoholic Drinks Beer
Snacks Pretzels
Snacks Chocolate Bar
Example rule:
Soft Drinks Pretzels with 30% confidence, 2% support
56
Taxonomy augmentation
Algorithms similar to previous slides Requires two additions to algorithm
Pruning itemsets containing an item and its ancestor
Pre-computing the ancestors for each item
Will also consider support counting
57
Pruning items and ancestors
In the second pass we will join F1 with F1 to give C2
This will give, for example:beverages,pepsi
snacks,coke
pretzels,chocolate bar But beverages,pepsi is redundant!
58
Pruning items and ancestors The following modification to the SQL
statement eliminates such redundant combinations from being selected:
insert into C2 (select I1.item1, I2.item1 from F1 I1, F1 I2
where I1.item1 < I2.item1) except(select ancestor, descendant from Ancestor union
select descendant, ancestor from Ancestor)
59
Pre-computing ancestors An ancestor table is created
Format (ancestor, descendant) Use the transitive closure operation
insert into Ancestor with R-Tax (ancestor, descendant) as
(select parent, child from Tax union all select p.ancestor, c.child from R-Tax
p, Tax c where p.descendant = c.parent)select ancestor, descendant from R-Tax
60
Support Counting
Extensions to handle taxonomies Straightforward, but Non-trivial
Need an extended transaction table For example, if we have {coke, pretzels} We add also {soft drinks, pretzels},
{beverages, pretzels}, {coke, snacks}, {soft drinks, snacks}, {beverages, snacks}
61
Extended transaction table Can be obtained by the following SQLQuery to generate T*select item, tid from T unionselect distinct A.ancestor as item, T.tidfrom T, Ancestor Awhere A.descendant = T.item The “select distinct” clause gets rid of items
with common ancestor – e.g. don’t want {beverages, beverages} being added twice from {pepsi, coke}
62
Pipelining of Query
No need to actually build T* Make following modification to
SQL:insert into Fk with T*(tid, item) as (Query for T*)select item1,…,itemk,count(*)from Ck, T* t1, …, T* tk,where t1.item = Ck.item1, … , andtk.item = Ck.itemk andt1.tid = t2.tid … and tk-1.tid = tk.tidgroup by item1,…,itemk
having count(*) > :minsup
63
Organization of Presentation Overview – Data Mining and RDBMS Loosely-coupled data and programs Tightly-coupled data and programs Architectural approaches Methods of writing efficient SQL
Candidate generation, pruning, support counting K-way join, SubQuery, GatherJoin, Vertical, Hybrid
Integrating taxonomies Mining sequential patterns
64
Sequential patterns Similar to papers covered on Nov 17 Input is sequences of transactions
E.g. ((computer,modem),(printer)) Similar to association rules, but dealing
with sequences as opposed to sets Can also specify maximum and
minimum time gaps, as well as sliding time windows Max-gap, min-gap, window-size
65
Input and output formats Input has three columns:
Sequence identifier (sid) Transaction time (time) Idem identifier (item)
Output format is a collection of frequent sequences, in a fixed-width table (item1, eno1,…,itemk, enok, len) For smaller lengths, extra column values are
set to NULL
66
GSP algorithm Similar to algorithms shown earlier Each Ck has transactions and times, but no
length – has fixed length of k Candidates are generated in two steps
Join – join Fk-1 with itself Sequence s1 joins with s2 if the subsequence obtained
by dropping the first item of s1 is the same as the one obtained by dropping the last item of s2
When generating C2, we need to generate sequences where both of the items appear as a single element as well as two separate elements
Prune All candidate sequences that have a non-frequent
contiguous (k-1) subsequence are deleted
67
GSP – Join SQL
insert into Ck
select I1.item1, I1.eno1, ... , I1.itemk-1, I1.enok-1, I2.itemkk-1, I1.enok-1 + I2.enok-1 – I2.enok-2
from Fk-1 I1, Fk-1 I2
where I1.item2 = I2.item1 and ... and I1.itemk-1 = I2.itemk-2 and
I1.eno3-I1.eno2 = I2.eno2 – I2.eno1 and ... and
I1.enok-1 – I1.enok-2 = I2.enok-2 – I2.enok-3
68
GSP – Prune SQL Write as a k-way join, similar to before There are at most k contiguous
subsequences of length (k-1) for which Fk-1 needs to be checked for membership
Note that all (k-1) subsequences may not be contiguous because of the max-gap constraint between consecutive elements.
69
GSP – Support Counting In each pass, we use the candidate table
Ck and the input data-sequences table D to count the support
K-way join We use select distinct before the group by to
ensure that only distinct data-sequences are counted
We have additional predicates between sequence numbers to handle the special time elements
70
GSP – Support Counting SQL
(Ck.enoj = Ck.enoi and abs(dj.time – di.time)≤ window-size) or (Ck.enoj = Ck.enoi + 1 and dj.time – di.time max-gap and dj.time – di.time > min-gap) or (Ck.enoj > Ck.enoi + 1)
71
References1. Developing Tightly-Coupled Data Mining
Applications on a Relational Database System Rakesh Agrawal, Kyuseok Shim, 1996
2. Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications Sunita Sarawagi, Shiby Thomas, Rakesh Agrawal, 1998 Refers to 1) above
3. Mining Generalized Association Rules and Sequential Patterns Using SQL Queries Shiby Thomas, Sunita Sarawagi, 1998 Refers to 1) and 2) above