Upload
akiko
View
52
Download
0
Tags:
Embed Size (px)
DESCRIPTION
V Storage Manager. Shahram Ghandeharizadeh Computer Science Department University of Southern California. Traces. Make sure your persistent BDB is configured with 256 MB of memory. - PowerPoint PPT Presentation
Citation preview
V Storage ManagerV Storage Manager
Shahram GhandeharizadehShahram GhandeharizadehComputer Science DepartmentComputer Science DepartmentUniversity of Southern CaliforniaUniversity of Southern California
TracesTraces
Make sure your persistent BDB is configured Make sure your persistent BDB is configured with 256 MB of memory.with 256 MB of memory.
With a trace, say 21, use its “21Objs.Save” to With a trace, say 21, use its “21Objs.Save” to create and populate your persistent create and populate your persistent database. Subsequently, use its database. Subsequently, use its “Trace21.1KGet” to debug your software.“Trace21.1KGet” to debug your software. Start with 1 thread and expand to 2, 3, and 4.Start with 1 thread and expand to 2, 3, and 4.
Try to make your software as efficient as Try to make your software as efficient as possible. If it is too slow (maybe because of possible. If it is too slow (maybe because of low byte hit rates) then you may not be able low byte hit rates) then you may not be able to run “Trace21.1MGet”.to run “Trace21.1MGet”.
QuestionsQuestions
QuestionsQuestions
Will there be another release of the workload Will there be another release of the workload generator before Friday?generator before Friday? I do not anticipate one unless there is a bug I do not anticipate one unless there is a bug
report.report.
Is there an obvious item missing from the Is there an obvious item missing from the current workload generator?current workload generator? Mandatory: Invocation of the method to report Mandatory: Invocation of the method to report
cache and byte hit rates.cache and byte hit rates. Optional: Dump the content of the cache to Optional: Dump the content of the cache to
analyze the behavior of your cache replacement analyze the behavior of your cache replacement technique.technique.
HintsHints
BDB-Disk is a full-fledged storage manager BDB-Disk is a full-fledged storage manager with a buffer pool, locking, crash-recovery, with a buffer pool, locking, crash-recovery, index structures.index structures. Configure its buffer pool size to be 256 MB.Configure its buffer pool size to be 256 MB.
V FunctionalitiesV Functionalities
Cache ReplacementCache Replacement
BDB-DiskBDB-Disk BDB-MemBDB-Mem
HintsHints
Your implementation may need to keep track Your implementation may need to keep track of different counters. Example: count the of different counters. Example: count the number of requests issued (and the number number of requests issued (and the number of requests serviced from the main-memory of requests serviced from the main-memory instance of BDB) to compute the cache hit instance of BDB) to compute the cache hit rate.rate.
How to do this with multiple worker threads? How to do this with multiple worker threads?
HintsHints
Your implementation may need to keep track Your implementation may need to keep track of different counters. Example: count the of different counters. Example: count the number of requests issued to compute the number of requests issued to compute the cache hit rate.cache hit rate.
How to do this with multiple worker threads?How to do this with multiple worker threads? The interlocked function provides a mechanism The interlocked function provides a mechanism
for synchronizing access to a variable that is for synchronizing access to a variable that is shared by multiple threads. shared by multiple threads.
You may define a “long” variable and use You may define a “long” variable and use InterlockedIncrement: “long cntr; InterlockedIncrement: “long cntr; InterlockedIncrement(&cntr);”InterlockedIncrement(&cntr);”
Make sure to include <windows.h> Make sure to include <windows.h>
HintsHints
To compute byte hit rates, you need to To compute byte hit rates, you need to maintain two counters and increment them maintain two counters and increment them by the size of the referenced object. by the size of the referenced object.
Use “InterlockedExchangeAdd” function to Use “InterlockedExchangeAdd” function to perform an atomic addition of two 32 bit perform an atomic addition of two 32 bit values. values. Example: a = a + b;Example: a = a + b; InterlockedExchangeAdd(&a, &b);InterlockedExchangeAdd(&a, &b);
Other Interlocked methods might be useful Other Interlocked methods might be useful to you, such as InterlockedExchangePointer.to you, such as InterlockedExchangePointer.
HintsHints
With invocation of methods, local variables With invocation of methods, local variables are pushed on the stack of a thread.are pushed on the stack of a thread. 4 different threads invoking a method will have 4 4 different threads invoking a method will have 4
different sets of mutually exclusive local different sets of mutually exclusive local variables as declared by that method.variables as declared by that method.
Foo(){Foo(){Char res[200];Char res[200];
Int cntr;Int cntr;
……
}}
A global variable is not part of the stack and A global variable is not part of the stack and must be protected when multiple threads are must be protected when multiple threads are manipulating it. How?manipulating it. How?
HintsHints
With invocation of methods, local variables With invocation of methods, local variables are pushed on the stack of a thread.are pushed on the stack of a thread. 4 different threads invoking a method will have 4 4 different threads invoking a method will have 4
different sets of mutually exclusive local different sets of mutually exclusive local variables as declared by that method.variables as declared by that method.
Foo(){Foo(){Char res[200];Char res[200];Int cntr;Int cntr;……
}}
A global variable is not part of the stack and A global variable is not part of the stack and must be protected when multiple threads are must be protected when multiple threads are manipulating it. How?manipulating it. How? Consider making it a variable local to a method. Ask: Consider making it a variable local to a method. Ask:
Does this variable have to be global?Does this variable have to be global? Use critical sections.Use critical sections. Manage memory.Manage memory.
HintsHints
With invocation of methods, local variables With invocation of methods, local variables are pushed on the stack of a thread.are pushed on the stack of a thread. 4 different threads invoking a method will have 4 4 different threads invoking a method will have 4
different sets of mutually exclusive local different sets of mutually exclusive local variables as declared by that method.variables as declared by that method.
Foo(){Foo(){Char res[200];Char res[200];
Int cntr;Int cntr;
……
}}
Similarly, memory allocated from the heap Similarly, memory allocated from the heap (new/malloc) is not a part of the stack and must (new/malloc) is not a part of the stack and must be managed.be managed. No memory-leaks.No memory-leaks.
HintsHints
Consider an admission control technique.Consider an admission control technique. Without admission control:Without admission control:
Everytime an object is referenced and it is not in Everytime an object is referenced and it is not in memory then you place it in memory.memory then you place it in memory.
With admission control:With admission control: Every time a disk resident object is referenced, Every time a disk resident object is referenced,
compare its Q value with the minimum Q value to see if compare its Q value with the minimum Q value to see if it should be admitted into memory.it should be admitted into memory.
Fast Algorithms for Mining Association Fast Algorithms for Mining Association Rules (by R. Agrawal and R. Srikant)Rules (by R. Agrawal and R. Srikant)
Shahram GhandeharizadehShahram GhandeharizadehComputer Science DepartmentComputer Science DepartmentUniversity of Southern CaliforniaUniversity of Southern California
TerminologyTerminology
Objective: Discover association Rule over Objective: Discover association Rule over basket data.basket data.
Example: 98% of customers who purchase Example: 98% of customers who purchase tires and auto accessories also get tires and auto accessories also get automotive services done.automotive services done.
Motivation: valuable for cross-marketing Motivation: valuable for cross-marketing and attached mailing applications.and attached mailing applications. Watch Googlezon, Watch Googlezon,
http://www.youtube.com/watch?v=AT9ho2G0N_Yhttp://www.youtube.com/watch?v=AT9ho2G0N_Y
Requirements:Requirements: Fast algorithms,Fast algorithms, Must manipulate large data sets.Must manipulate large data sets.
Problem StatementProblem Statement
TerminologyTerminology
Association rule XAssociation rule XY has Y has confidenceconfidence c, c,
Out of those transactions that contain X, c% Out of those transactions that contain X, c% also contain Y.also contain Y.
Association rule XAssociation rule XY has Y has supportsupport s, s,
s% of transactions in D contain X and Y.s% of transactions in D contain X and Y.
Note:Note: X X A doesn’t mean X+Y A doesn’t mean X+YAA
May not have minimum supportMay not have minimum support
X X A and A A and A Z Z
doesn’t mean X doesn’t mean X Z Z May not have minimum confidenceMay not have minimum confidence
ExampleExample
I = {beer, chips, salsa, nail-polish, toothpaste, toilet-I = {beer, chips, salsa, nail-polish, toothpaste, toilet-paper}paper}
D = {T1, T2, T3, …., T9999999}D = {T1, T2, T3, …., T9999999} T1 = {beer, chips, salsa}T1 = {beer, chips, salsa} T2 = {beer, toilet-paper}T2 = {beer, toilet-paper} T3 = {nail-polish, toothpaste}T3 = {nail-polish, toothpaste}
TID is the unique identifier for each transaction.TID is the unique identifier for each transaction. If X = {beer} then both T1 and T2 contain X.If X = {beer} then both T1 and T2 contain X. If X = {beer, chips} then T1 contains X.If X = {beer, chips} then T1 contains X. If X = {beer, nail-polish} then no transaction If X = {beer, nail-polish} then no transaction
contains X.contains X. The rule {beer, chips} => {salsa} with confidence The rule {beer, chips} => {salsa} with confidence
90% if 90% of transactions that contain {beer, chips} 90% if 90% of transactions that contain {beer, chips} also contain {salsa}. also contain {salsa}. NOTE: {beer, chips} intersect {salsa} is empty, satisfying NOTE: {beer, chips} intersect {salsa} is empty, satisfying
the constraint of the formal problem specification.the constraint of the formal problem specification.
The rule {beer, chips} => {salsa} has support 75% if The rule {beer, chips} => {salsa} has support 75% if 75% of transactions contain {beer, chips, salsa}.75% of transactions contain {beer, chips, salsa}.
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the confidence in {nail-polish} => What is the confidence in {nail-polish} => {tooth-paste}?{tooth-paste}?
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the confidence in {nail-polish} => What is the confidence in {nail-polish} => {tooth-paste}?{tooth-paste}? 100% because 5000 out of 5,000 transactions 100% because 5000 out of 5,000 transactions
that contain {nail-polish} also contain {tooth-that contain {nail-polish} also contain {tooth-paste}.paste}.
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the confidence in {beer} => {salsa}?What is the confidence in {beer} => {salsa}? 25% because 1000 out of 5000 transactions that 25% because 1000 out of 5000 transactions that
contain {beer} also contain {salsa}contain {beer} also contain {salsa}
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the confidence in {salsa} => {chips}?What is the confidence in {salsa} => {chips}?
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the confidence in {salsa} => {chips}?What is the confidence in {salsa} => {chips}? 100% because 6000 out of 6000 transactions that 100% because 6000 out of 6000 transactions that
contain {salsa} also contain {chips}contain {salsa} also contain {chips}
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the confidence in {salsa} => {nail-What is the confidence in {salsa} => {nail-polish}?polish}?
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the confidence in {salsa} => {nail-What is the confidence in {salsa} => {nail-polish}?polish}? 5/6 (83.33%) because 5000 out of 6000 5/6 (83.33%) because 5000 out of 6000
transactions that contain {salsa} also contain transactions that contain {salsa} also contain {chips}{chips}
Note:Note: Support for {salsa, nail-polish} is Support for {salsa, nail-polish} is
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the confidence in {salsa} => {nail-What is the confidence in {salsa} => {nail-polish}?polish}? 5/6 (83.33%) because 5000 out of 6000 5/6 (83.33%) because 5000 out of 6000
transactions that contain {salsa} also contain transactions that contain {salsa} also contain {chips}{chips}
Note:Note: Support for {salsa, nail-polish} is 50% (5000 out of Support for {salsa, nail-polish} is 50% (5000 out of
10000)10000) Support for {slasa} is Support for {slasa} is
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the confidence in {salsa} => {nail-What is the confidence in {salsa} => {nail-polish}?polish}? 5/6 (83.33%) because 5000 out of 6000 5/6 (83.33%) because 5000 out of 6000
transactions that contain {salsa} also contain transactions that contain {salsa} also contain {chips}{chips}
Note:Note: Support for {salsa, nail-polish} is 50% (5000 out of Support for {salsa, nail-polish} is 50% (5000 out of
10000)10000) Support for {slasa} is 60% (6000 out of 10000)Support for {slasa} is 60% (6000 out of 10000) Conf = 50% / 60% = 83.33%Conf = 50% / 60% = 83.33%
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the confidence in {beer, chips} => What is the confidence in {beer, chips} => {toilet-paper}?{toilet-paper}?
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the confidence in {beer, chips} => What is the confidence in {beer, chips} => {toilet-paper}?{toilet-paper}? 0% because none of the transactions satisfy this 0% because none of the transactions satisfy this
association rule.association rule.
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the support in {beer} => {toilet-What is the support in {beer} => {toilet-paper}?paper}?
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the support in {beer} => {toilet-What is the support in {beer} => {toilet-paper}?paper}? 40% because 4000 transactions (out of 10,000) 40% because 4000 transactions (out of 10,000)
contain {beer, toilet-paper}contain {beer, toilet-paper}
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the support in {chips} => {salsa}?What is the support in {chips} => {salsa}?
Example (Cont…)Example (Cont…)
Assume:Assume: 1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
What is the support in {chips} => {salsa}?What is the support in {chips} => {salsa}? 60%, 6000 transactions contain {chips, salsa}.60%, 6000 transactions contain {chips, salsa}.
Example QueriesExample Queries
Compute all association rules with support Compute all association rules with support and confidence greater than 55%.and confidence greater than 55%. Assume:Assume:
1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
Answer:Answer:
Example QueriesExample Queries
Compute all association rules with support Compute all association rules with support and confidence greater than 55%.and confidence greater than 55%. Assume:Assume:
1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
Answer: Answer: {chips} => {salsa}, {chips} => {salsa}, {salsa} => {chips}{salsa} => {chips}
Example QueriesExample Queries
Compute all association rules with support > Compute all association rules with support > 30% and confidence greater than 40%.30% and confidence greater than 40%. Assume:Assume:
1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
Answer:Answer:
Example QueriesExample Queries
Compute all association rules with support > Compute all association rules with support > 30% and confidence greater than 45%.30% and confidence greater than 45%. Assume:Assume:
1000 transactions {beer, chips, salsa} 1000 transactions {beer, chips, salsa} 4000 transactions {beer, toilet-paper} 4000 transactions {beer, toilet-paper} 5000 transactions {nail-polish, tooth-paste, chips, salsa}5000 transactions {nail-polish, tooth-paste, chips, salsa}
Answer: Answer: {chips} => {salsa}, {chips} => {salsa}, {salsa} => {chips}, {salsa} => {chips}, {nail-polish} => {tooth-paste}, {nail-polish} => {tooth-paste}, {tooth-paste} => {nail-polish}, {tooth-paste} => {nail-polish}, {nail-polish} => {chips}, {nail-polish} => {chips}, {nail-polish}=>{tooth-paste}, {nail-polish}=>{tooth-paste}, {nail-polish} => {salsa}{nail-polish} => {salsa} ……..
Divide the Problem into TwoDivide the Problem into Two
1.1. Find all sets of items that have support above Find all sets of items that have support above minimum support.minimum support. Itemsets with minimum support are called large itemsets Itemsets with minimum support are called large itemsets
and all others small itemsets.and all others small itemsets. Algorithms: Apriori and AprioriTid.Algorithms: Apriori and AprioriTid.
2.2. Use large itemsets to generate the desired rules.Use large itemsets to generate the desired rules. For every large itemset l, find all non-empty subsets of l. For every large itemset l, find all non-empty subsets of l.
Let a denote one subset.Let a denote one subset. For every subset a, output a rule of the form a => { {l} – For every subset a, output a rule of the form a => { {l} –
{a} } if support(l) / support(a) is at least minconf.{a} } if support(l) / support(a) is at least minconf. Say ABCD and AB are large itemsetsSay ABCD and AB are large itemsets ComputeCompute
conf = support(ABCD) / support(AB)conf = support(ABCD) / support(AB) If conf >= minconfIf conf >= minconf
AB AB CD holds. CD holds.
Conquer Conquer
Focus on item 1:Focus on item 1:1.1. Find all sets of items that have support above a Find all sets of items that have support above a
pre-specified minimum support.pre-specified minimum support.
Example:Example: Assume the following database:Assume the following database: Itemsets with minimum support of 2 Itemsets with minimum support of 2
transactions?transactions?
Conquer Conquer
Focus on item 1:Focus on item 1:1.1. Find all sets of items that have support above a Find all sets of items that have support above a
pre-specified minimum support.pre-specified minimum support.
Example:Example: Assume the following database:Assume the following database: Itemsets with minimum support of 2 Itemsets with minimum support of 2
transactions?transactions?
How? How?
General idea:General idea: Multiple passes over the dataMultiple passes over the data First passFirst pass – count the support of individual items. – count the support of individual items. Subsequent passSubsequent pass
Generate Generate CandidatesCandidates using previous pass’s large using previous pass’s large itemset.itemset.
Go over the data and check the Go over the data and check the actualactual support of the support of the candidates.candidates.
Stop when no new large itemsets are found.Stop when no new large itemsets are found.
How?How?
Make several passes of DB.Make several passes of DB. Pass 1: count item occurrences to Pass 1: count item occurrences to
determine the large 1-itemsets.determine the large 1-itemsets.
How?How?
Make several passes of DB.Make several passes of DB. Pass 1: count item occurrences to Pass 1: count item occurrences to
determine the large 1-itemsets.determine the large 1-itemsets. Notice that {4} is missing!Notice that {4} is missing!
Pass 2: Compute the following query:Pass 2: Compute the following query:SELECT SELECT p.item1, q.item1p.item1, q.item1
FROM FROM L1 p, L1 qL1 p, L1 q
WHERE WHERE p.item1 < q.item1p.item1 < q.item1
How?How?
Make several passes of DB.Make several passes of DB. Pass 1: count item occurrences to Pass 1: count item occurrences to
determine the large 1-itemsets.determine the large 1-itemsets. Notice that {4} is missing!Notice that {4} is missing!
Pass 2: Compute the priori-gen query and Pass 2: Compute the priori-gen query and count the support for each by making a pass count the support for each by making a pass of DB.of DB.
How?How?
Make several passes of DB.Make several passes of DB. Pass 1: count item occurrences to Pass 1: count item occurrences to
determine the large 1-itemsets.determine the large 1-itemsets. Notice that {4} is missing!Notice that {4} is missing!
Pass 2: Compute the priori-gen query and Pass 2: Compute the priori-gen query and count the support for each by making a pass count the support for each by making a pass of DB.of DB. Drop those with support < minsupDrop those with support < minsup
Pass j (j >= 3): Compute candidate set using Pass j (j >= 3): Compute candidate set using apriori-gen algorithmapriori-gen algorithm
Apriori-gen AlgorithmApriori-gen Algorithm
Intuition: Generate the candidate itemsets to Intuition: Generate the candidate itemsets to be counted in a pass by using only the be counted in a pass by using only the itemsets found large in the previous pass.itemsets found large in the previous pass. How?How?
Note that when k=2, this query computes a large Note that when k=2, this query computes a large number of rows: the cartesian product of L1 – number of rows: the cartesian product of L1 – number of rows in L1. If L1 has 100 rows, the number of rows in L1. If L1 has 100 rows, the resulting number of rows is 9900 (10000-100). resulting number of rows is 9900 (10000-100).
Apriori-gen AlgorithmApriori-gen Algorithm
Intuition: Generate the candidate itemsets to Intuition: Generate the candidate itemsets to be counted in a pass by using only the be counted in a pass by using only the itemsets found large in the previous pass.itemsets found large in the previous pass. What is the result when k = 3?What is the result when k = 3?
What is the SQL command?What is the SQL command?
Apriori-gen AlgorithmApriori-gen Algorithm
Intuition: Generate the candidate itemsets to Intuition: Generate the candidate itemsets to be counted in a pass by using only the be counted in a pass by using only the itemsets found large in the previous pass.itemsets found large in the previous pass. What is the result when k = 3?What is the result when k = 3?
INSERT into CkINSERT into CkSELECT p.item1, p.item2, q.item2SELECT p.item1, p.item2, q.item2FROM L2 p, L2 qFROM L2 p, L2 qWHERE p.item1 = q.item1 and p.item2 < q.item2WHERE p.item1 = q.item1 and p.item2 < q.item2
Result?Result?
Apriori-gen AlgorithmApriori-gen Algorithm
Intuition: Generate the candidate itemsets to Intuition: Generate the candidate itemsets to be counted in a pass by using only the be counted in a pass by using only the itemsets found large in the previous pass.itemsets found large in the previous pass. What is the result when k = 3?What is the result when k = 3?
INSERT into CkINSERT into CkSELECT p.item1, p.item2, q.item2SELECT p.item1, p.item2, q.item2FROM L2 p, L2 qFROM L2 p, L2 qWHERE p.item1 = q.item1WHERE p.item1 = q.item1
Apriori-gen AlgorithmApriori-gen Algorithm
Intuition: Generate the candidate itemsets to Intuition: Generate the candidate itemsets to be counted in a pass by using only the be counted in a pass by using only the itemsets found large in the previous pass.itemsets found large in the previous pass. What is the result when k = 3?What is the result when k = 3?
Computed by the SQL query.Computed by the SQL query.
Computed by making a pass on the DB.Computed by making a pass on the DB.
IntuitionIntuition
Any subset of large itemset is large.Any subset of large itemset is large.
ThereforeTherefore
To find large k-itemsetTo find large k-itemset Create candidates by combining large k-Create candidates by combining large k-
1 itemsets.1 itemsets. Delete those that contain any subset Delete those that contain any subset
that is not large.that is not large.
Assumptions & DefinitionsAssumptions & Definitions
Items in each transaction are kept sorted in Items in each transaction are kept sorted in their lexicographic order.their lexicographic order.
Number of items in an itemset is its size.Number of items in an itemset is its size. An itemset of size k is a k-itemset.An itemset of size k is a k-itemset. Each itemset has a count field to store the Each itemset has a count field to store the
support for this itemset.support for this itemset. LLkk is set of large k-itemsets (those with is set of large k-itemsets (those with
minimum support).minimum support). CCkk is set of candidate k-itemsets. Its is set of candidate k-itemsets. Its
members are potential members of Lmembers are potential members of Lkk..
Apriori AlgorithmApriori Algorithm
Apriori AlgorithmApriori Algorithm
Important detail:Important detail: With apriori-gen, the join may compute items With apriori-gen, the join may compute items
whose subset do NOT exist in Lwhose subset do NOT exist in Lk-1k-1. Prune these . Prune these by deleting an item c of Cby deleting an item c of Ckk such that some (k-1)- such that some (k-1)-subset of c is not in Lsubset of c is not in Lk-1k-1..
Example:Example: Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}
What is output for CWhat is output for C44??
Apriori AlgorithmApriori Algorithm
Important detail:Important detail: With apriori-gen, the join may compute items With apriori-gen, the join may compute items
whose subset do NOT exist in Lwhose subset do NOT exist in Lk-1k-1. Prune these . Prune these by deleting an item c of Cby deleting an item c of Ckk such that some (k-1)- such that some (k-1)-subset of c is not in Lsubset of c is not in Lk-1k-1..
Example:Example: Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}
{ {1 2 3 4}, {1 3 4 5} }{ {1 2 3 4}, {1 3 4 5} } Subsets of {1 2 3 4} are { {1 2 3}, {2 3 4}, {1 3 4}, {1 2 4}}Subsets of {1 2 3 4} are { {1 2 3}, {2 3 4}, {1 3 4}, {1 2 4}} Subsets of {1 3 4 5} are { {1 3 4}, {1 3 5}, {3 4 5}, {1 4 5}}Subsets of {1 3 4 5} are { {1 3 4}, {1 3 5}, {3 4 5}, {1 4 5}}
CorrectnessCorrectness
Show thatShow that
1k1k2k2k11
1k1k
1k1k1
k
q.itemp.item,q.itemp.item,...,q.itemp.item
qp,LL
itemqitempitempp.item
C
where
from
.,.,.,select
intoinsert
2
k
k-1
k
c from C
) L(s
ets s of c(k-1)-subs
C itemsets c
delete
then if
do forall
do forall
kk LC
Join extends Lk-1 with all items
Apriori removes those whose (k-1) subsets are not in Lk-1 Prevents duplications
Any subset of large itemset must also be large
AIS & STEMAIS & STEM
AIS & STEM generate candidate itemsets AIS & STEM generate candidate itemsets based on transactions.based on transactions. Apriori uses the large itemsets to generate larger Apriori uses the large itemsets to generate larger
itemsets.itemsets.
Example:Example: Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}} With AIS, in pass 4, when encountering a With AIS, in pass 4, when encountering a
transaction with items {1 2 4 5}, AIS and STEM transaction with items {1 2 4 5}, AIS and STEM generate the following five candidate sets:generate the following five candidate sets: {1 2 3} => {1 2 3 4} and {1 2 3 5}{1 2 3} => {1 2 3 4} and {1 2 3 5} {1 2 4} => {1 2 4 5}{1 2 4} => {1 2 4 5} {1 3 4} => {1 3 4 5}{1 3 4} => {1 3 4 5} {2 3 4} => {2 3 4 5} {2 3 4} => {2 3 4 5}
AprioriTidAprioriTid
Uses the database only once to count Uses the database only once to count support for 1-itemsets in Pass 1.support for 1-itemsets in Pass 1.
Builds a storage set C^Builds a storage set C^kk
Members has the form < TID, {XMembers has the form < TID, {Xkk} >} > XXk k are potentially large k-items in transaction TID.are potentially large k-items in transaction TID. For k=1, C^For k=1, C^11 is the database. is the database.
Uses C^Uses C^k k in pass k+1.in pass k+1. Advantages:Advantages:
C^C^k k could be smaller than the database.could be smaller than the database. If a transaction does not contain a candidate k-itemset, If a transaction does not contain a candidate k-itemset,
then C^then C^k k will not have an entry for this transaction.will not have an entry for this transaction.
For large k, each entry may be smaller than the For large k, each entry may be smaller than the transactiontransaction The transaction might contain only few candidates.The transaction might contain only few candidates.
How? (Assume minsup = 2)How? (Assume minsup = 2)
1.1. Make a pass of DB and count item Make a pass of DB and count item occurrences to determine the large 1-occurrences to determine the large 1-itemsets.itemsets.
How? (Assume minsup = 2)How? (Assume minsup = 2)
1.1. Make a pass of DB and count item Make a pass of DB and count item occurrences to determine the large 1-occurrences to determine the large 1-itemsets.itemsets.
How? (Assume minsup = 2)How? (Assume minsup = 2)
1.1. Make a pass of DB and count item Make a pass of DB and count item occurrences to determine the large 1-occurrences to determine the large 1-itemsets.itemsets.
You areYou areHere!Here!
How? (Assume minsup = 2)How? (Assume minsup = 2)
2. Construct C^12. Construct C^1 Note that C^1 = DatabaseNote that C^1 = Database
You areYou areHere!Here!
How? (Assume minsup = 2)How? (Assume minsup = 2)
4. Compute C2 by invoking apriori-gen4. Compute C2 by invoking apriori-gen
You areYou areHere!Here!
How? (Assume minsup = 2)How? (Assume minsup = 2)
9. Compute C2 by invoking apriori-gen9. Compute C2 by invoking apriori-gen
You areYou areHere!Here!
How? (Assume minsup = 2)How? (Assume minsup = 2)
10.10. Compute C^2Compute C^2 Notice what happened to T100Notice what happened to T100
You areYou areHere!Here!
How? (Assume minsup = 2)How? (Assume minsup = 2)
12.12. Compute L2Compute L2All entries of C2 with Support >= 2All entries of C2 with Support >= 2
You areYou areHere!Here!
How? (Assume minsup = 2)How? (Assume minsup = 2)
Iter 2, Step 4: Compute C3Iter 2, Step 4: Compute C3
You areYou areHere!Here!
??
How? (Assume minsup = 2)How? (Assume minsup = 2)
Iter 2, Step 4: Compute C3Iter 2, Step 4: Compute C3
You areYou areHere!Here!
How? (Assume minsup = 2)How? (Assume minsup = 2)
Iter 2, Step 9: Count SupportIter 2, Step 9: Count Support
You areYou areHere!Here!
How? (Assume minsup = 2)How? (Assume minsup = 2)
Iter 2, Step 10: Compute C^3Iter 2, Step 10: Compute C^3Transactions 100 and 400 are gone!Transactions 100 and 400 are gone!
You areYou areHere!Here!
How? (Assume minsup = 2)How? (Assume minsup = 2)
Iter 2, Step 12: Generate L3Iter 2, Step 12: Generate L3
You areYou areHere!Here!
How? (Assume minsup = 2)How? (Assume minsup = 2)
Iter 3, Step 4: Generate C4Iter 3, Step 4: Generate C4
You areYou areHere!Here!
??
How? (Assume minsup = 2)How? (Assume minsup = 2)
Iter 3, Step 4: Generate C4Iter 3, Step 4: Generate C4Since C4 is empty, terminate the algorithm.Since C4 is empty, terminate the algorithm.
You areYou areHere!Here!
Empty setEmpty set
Apriori versus Apriori-TDApriori versus Apriori-TD
Sizes of the candidate sets, Ck, is smaller Sizes of the candidate sets, Ck, is smaller with Apriori-TD with larger values of k.with Apriori-TD with larger values of k.
CCkk with with
Apriori & Apriori & AprioriTidAprioriTid
LLkk
Apriroi versus AprioriTidApriroi versus AprioriTid
AprioriTid outperforms Apriori when AprioriTid outperforms Apriori when C^C^kk fits in memory, and fits in memory, and the distribution of the large itemsets has a long the distribution of the large itemsets has a long
tail.tail.
AprioriTid jumps AprioriTid jumps BecauseBecause
C^k does notC^k does notfit in memoryfit in memory
Execution Time Per PassExecution Time Per Pass
In the earlier passes, Apriori does better In the earlier passes, Apriori does better than AprioriTid.than AprioriTid.
AprioriTid is better than Apriori in later AprioriTid is better than Apriori in later passes.passes.
Apriori & AprioriTidApriori & AprioriTid
Similarities; both:Similarities; both: Use the same candidate Use the same candidate
generation procedure, generation procedure, counting the same counting the same itemsets.itemsets.
Observe a drop in the Observe a drop in the number of candidate number of candidate itemsets in the later itemsets in the later passes.passes.
Differences:Differences: In each pass, Apriroi In each pass, Apriroi
examine every examine every transaction. AprioriTid transaction. AprioriTid scan C^k and the size scan C^k and the size of C^k becomes of C^k becomes smaller than the smaller than the database size in each database size in each pass.pass.
When C^k fits in main When C^k fits in main memory, AprioriTid memory, AprioriTid does not incur the cost does not incur the cost of writing and reading of writing and reading C^k.C^k.
AprioriHybridAprioriHybrid
Key idea:Key idea: Use Apriori in the initial passesUse Apriori in the initial passes Switch to AprioriTid when it expects C^k at the Switch to AprioriTid when it expects C^k at the
end of the pass will fit in memory.end of the pass will fit in memory.
How to esimtate if C^k fits in memory in the How to esimtate if C^k fits in memory in the next pass?next pass?
Cost of SwitchingCost of Switching
Switching in the last pass incurs the cost of Switching in the last pass incurs the cost of constructing C^ without using it.constructing C^ without using it. In the kth pass, AprioriHybird incurs the cost of In the kth pass, AprioriHybird incurs the cost of
constructing C^constructing C^k+1k+1. . If there are no large (k+1)-itmesets (i.e., this is If there are no large (k+1)-itmesets (i.e., this is
the last pass), the algorithm terminates.the last pass), the algorithm terminates. With Apriori, the algorithm also terminates without With Apriori, the algorithm also terminates without
making a pass of the transactions.making a pass of the transactions. AprioriHybrid build C^AprioriHybrid build C^k+1k+1 and then terminates. and then terminates.
ComparisonComparison
AprioriHybrid is faster if there is a gradual AprioriHybrid is faster if there is a gradual decline in the size of C^k.decline in the size of C^k.
AprioriHybrid AprioriHybrid switched in the switched in the
last pass!last pass!
Comparison (Cont…)Comparison (Cont…)
If C^k remains large until nearly the end and If C^k remains large until nearly the end and then has an abrupt drop then AprioriHybrid then has an abrupt drop then AprioriHybrid will be the same as Apriori.will be the same as Apriori.
QuestionQuestion
QuestionQuestion
Why is AprioriTid worse than Apriori?Why is AprioriTid worse than Apriori? Is AprioriTid better than Apriori for some Is AprioriTid better than Apriori for some
experiment reported in this paper? If not then experiment reported in this paper? If not then why?why?
AnswerAnswer
Why is AprioriTid worse than Apriori?Why is AprioriTid worse than Apriori? C^k is large in the first few passes, killing the C^k is large in the first few passes, killing the
overall execution time.overall execution time.
CharacteristicisCharacteristicis
For a fixed collection of system parameters For a fixed collection of system parameters (e.g., minimum support level):(e.g., minimum support level): Response time increases linearly as a function of Response time increases linearly as a function of
the number of transactions.the number of transactions. With larger number of items (1000 versus With larger number of items (1000 versus
10,000), the execution time decreases a little as 10,000), the execution time decreases a little as the average support for an item decreased. the average support for an item decreased. Fewer itemsets provides faster execution times.Fewer itemsets provides faster execution times.
Rest of this SemesterRest of this Semester
Project is due mid-night on Friday, April 24.Project is due mid-night on Friday, April 24. Review for midterm on April 28Review for midterm on April 28thth. 4 papers:. 4 papers:
Variant indexes.Variant indexes. Access path selection.Access path selection. Overview of query optimization.Overview of query optimization. Mining Association Rules.Mining Association Rules.
Midterm 2 on April 30Midterm 2 on April 30thth.. Meeting with the teams during 1Meeting with the teams during 1stst week of week of
May.May. E-mail to schedule meeting to follow.E-mail to schedule meeting to follow.