Upload
gefjun
View
38
Download
1
Embed Size (px)
DESCRIPTION
Α. Νανόπουλος & Γ. Μανωλόπουλος Εισαγωγή στην Εξόρυξη & τις Αποθήκες Δεδομένων Κεφάλαιο 8 : Κανόνες Συσχέτισης http://delab.csd.auth.gr/books/grBooks/grBooks.html. Α. Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη και τις Αποθήκες Δεδομένων. 1. Associations. - PowerPoint PPT Presentation
Citation preview
1
Α. Νανόπουλος & Γ. Μανωλόπουλος
Εισαγωγή στην Εξόρυξη & τις Αποθήκες Δεδομένων
Κεφάλαιο 8: Κανόνες Συσχέτισης
http://delab.csd.auth.gr/books/grBooks/grBooks.html
Α. Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη και τις Αποθήκες Δεδομένων
AssociationsRules expressing relationships between items
Example
cereal, milk fruit
“People who bought cereal and milk also bought fruit.”
Stores might want to offer specials on milk and cereal to get people to buy more fruit.
2Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Market Basket Analysis
Analyze tables of transactions
Can we hypothesize? Chips => Salsa Lettuce => Spinach
Person Basket
A Chips, Salsa, Cookies, Crackers, Coke, Beer
B Lettuce, Spinach, Oranges, Celery, Apples, Grapes
C Chips, Salsa, Frozen Pizza, Frozen Cake
D Lettuce, Spinach, Milk, Butter
3Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Definitions
Let I ={i1, …, im} a set of literals called items
Let D be a set of transactions for each T D, T I
Itemset X: any X I k-itemset: |X| = k
X contained-by T, if X T
support s(X): fraction of transactions containing X
||
|},|{|)(
D
TXDTTXs
4Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Rule Measures: Support and ConfidenceFind all the rules Beer Diaper with minimum confidence and support
support, s, probability that a transaction contains {Beer Diaper}
confidence, c, conditional probability that a transaction having {Beer} also contains Diaper
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Let minimum support 50%, and minimum confidence 50%, we have
A C (50%, 66.6%) C A (50%, 100%)
Customerbuys diaper
Customerbuys both
Customerbuys beer
5Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Association rules: Problem description
Association rule: X Y, X I, Y I, X Y = Support of rule = s(X Y)Confidence of rule = s(X Y) / s(X)Problem: Find all association rules with confidence larger that MINCONF and support larger than MINSUPNotice that: if X Y, Y Z, then not necessarily X Z
because X Z may not have enough support or confidence
6Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Mining Association Rules—An Example
For rule A C:support = support({A C}) = 50%
confidence = support({A C})/support({A}) = 66.6%
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%
7Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Association Rule Mining
Two basic steps Find all frequent itemsets
Satisfying minimum support Find all strong association rules
Generate association rules from frequent itemsets Keep rules satisfying minimum confidence
8Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Generating Frequent Itemsets: Naive algorithm
n <- |D|
for each subset s of I do
l <- 0
for each transaction T in D do
if s is a subset of T then
l <- l + 1
if minimum support <= l/n then
add s to frequent subsets
9Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Analysis of naive algorithm
O(2m) subsets of I
Scan n transactions for each subset
O(2mn) tests of s being subset of T
Growth is exponential in the number of items!
Can we do better?
10Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Reducing Number of Candidates: Apriori
Apriori principle: If an itemset is frequent, then all of its subsets must also
be frequent
Apriori principle holds due to the following property of the support measure:
Support of an itemset never exceeds the support of its subsets
This is known as the anti-monotone property of support
)()()(:, YsXsYXYX
11Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Found to be Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Illustrating Apriori Principlenull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDEPruned supersets
12Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Illustrating Apriori Principle
Item CountBread 4Coke 2Milk 4Beer 3Diaper 4Eggs 1
Itemset Count{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3
Itemset Count {Bread,Milk,Diaper} 3
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generatecandidates involving Cokeor Eggs)
Triplets (3-itemsets)Minimum Support = 3
If every subset is considered, 6C1 + 6C2 + 6C3 = 41
With support-based pruning,6 + 6 + 1 = 13
13Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
The Apriori Algorithm
Join Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k
L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;
14Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
15Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Example of Generating Candidates
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
16Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
How to Count Supports of Candidates?
Candidate counting: Scan the database of transactions to determine the support of
each candidate itemset To reduce the number of comparisons, store the candidates in a
hash structure Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets
TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
N
Transactions Hash Structure
k
Buckets
17Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Example
b=2, c=3
18Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
The Apriori Algorithm — Example
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
itemset sup.{1} 2{2} 3{3} 3{5} 3
Scan D
C1
L1
itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2
C2 C2
Scan D
C3 L3itemset{2 3 5}
Scan D itemset sup{2 3 5} 2
19Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Another example of Apriori
Minsup = 2
20Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Another example of Apriori
21Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Generating rules (2nd sub-problem)
Given a frequent itemset L, find all non-empty subsetsf L such that f L – f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BC AD, BD AC, CD AB,
If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L)
22Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Rule Generation with anti-monotone property
How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone
propertyc(ABC D) can be larger or smaller than c(AB D)
But confidence of rules generated from the same itemset has an anti-monotone property
e.g., L = {A,B,C,D}:
c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the
RHS of the rule
23Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Rule Generation: example of anti-monotonicity
ABCD=>{ }
BCD=>A ACD=>B ABD=>C ABC=>D
BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD
Lattice of rules
ABCD=>{ }
BCD=>A ACD=>B ABD=>C ABC=>D
BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD
Pruned Rules
Low Confidence Rule
24Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Methods to Improve Apriori’s Efficiency
Hash-based itemset counting: A k-itemset whose corresponding
hashing bucket count is below the threshold cannot be frequent
Transaction reduction: A transaction that does not contain any
frequent k-itemset is useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
Sampling: mining on a subset of given data, lower support threshold
+ a method to determine the completeness
Dynamic itemset counting: add new candidate itemsets only when all
of their subsets are estimated to be frequent
25Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Hash-based method
|C2| = comb(|L1|, 2) = O(|L1|2) |L1| maybe thousands, |C2| bottleneck
While counting the support of C1
find the 2-itemsets in each transaction hash them into buckets, accumulating counts if a 2-itemset belongs to a bucket with count
less than MINSUP, then it cannot be large
26Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Example of hash-based method100 ACD
200 BCE
300 ABCE
400 BE
C1
{A} (2) {B} (3) {C} (3) {D} (1) {E} (3)
L1
{A}
{B}
{C}
{E}
3
{C,E}
{C,E}
{A,D}
3
{B,C}
{B,C}
{A,E}
0 3
{B,E}
{B,E}
{B,E}
1
{A,B}
3
{A,C}
{C,D}
{A,C}
L1 x L1
{A,B}
{A,C}
{A,E}
{B,C}
{B,E}
{C,E}
C2
{A,C}
{A,E}
{B,C}
{B,E}
{C,E}
27Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Transaction trimming
While counting Ck (k phase) an item in a transaction t can be trimmed if it does
not appear in at least k members of Ck (k-itemset candidates)
example: candidates ABC, ABD, BCD, and ACD is not if t contains A, then A can be trimmed from t, because
will not participate in forthcoming passes (ABCD will not be generated)
if t contains no more than k items, discard it
28Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Example of transaction trimming
100 ACD
200 BCE
300 ABCE
400 BE
C2
{A,C}
{B,C}
{B,E}
{C,E}
{C}
{B,C,E}
{B,C,E}
{B,E}
discard
{B,C,E}
{B,C,E}
discard
due to length
29Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Partitioning method
Logically partitions the database into a number of non-overlapping partitions each partition fits entirely in main memory
Finds the large itemsets in each partition (local large itemsets) multiple phases are performed in main memory, I/O
only to read the partition once
Determine which of the local large are also global large itemsets in one scan of the database no global large is missed
30Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Partition Algorithm
partition_database(D)
n = Number of partitions
//Phase I
for i=1 to n do
read i -th partition
Li = gen_all_large_itemsets(i-th partition)
C = L1 … Ln
for i=1 to n do
read i -th partition
for each candidate c C update c.count
L = {c | c C, c.count MINSUP}31Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Partitioning example
32Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Why Partition is correct?
We have to show that no large itemset (global large) is missed
A global large itemset X is not missed if it is canidate in the 2nd phase, i.e., when its is local large in at least one partition Let X be a global large which is not local large in any
partition t(X, P
i ) < MINSUP |P i| 1 i n Σ t(X, P
i )
< MINSUP Σ |P i| t(X, D) < MINSUP |D|
s(X) < MINSUP (contradiction)
33Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Size of candidate set for the 2nd phase
Based on the assumption that the size |C | of all local large itemsets is small enough comparable to |L| (global large)
The same MINSUP is used, one can expect the same as |L| at worst case n |L| in practice, many overlap (the same in different
partitions)
For small n, |C | very close to |L|
For larger n, |C | larger but not the worst case
34Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Sampling method
Main idea get a random sample s of database D find large itemsets S in s verify which of S are large in D (one scan) identify potentially “missed” large (possibly a 2nd scan)
Gains mining s is for free in terms of I/O one scan of D in the best and two in the worst case
35Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Negative border
36Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
How not to miss anything?
Goal is to avoid missing any itemset that is frequent in the full set of baskets If X is frequent in D, but not frequent in S, then
there Y X, Y Bd-(S)
We check if something in the negative border is actually frequent in D Try to choose Minsup in S so the probability of
failure is low, while the number of itemsets checked on the second pass fits in main-memory
37Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Sampling algorithm
38Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Example
Domain = {A, B, C, D, E, F}
S = {A,B,C}, {A,C,F}, {A,D}, {B,D}
Bd-(S) = {B,F}, {C,D}, {D,F}, {E} candidates not found large in sample
L = {A,B}, {B,F}, {A,C,F} large in D
{B,F} represents a “miss” {A,B},{B,F} and {A,F} generate {A,B,F} which
can possibly be large but was not count
39Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Is Apriori Fast Enough? — Performance Bottlenecks
The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-
itemsets Use database scan and pattern matching to collect counts for the
candidate itemsets
The bottleneck of Apriori: candidate generation Huge candidate sets:
104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate 2100 1030 candidates.
Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern
40Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Mining Frequent Patterns Without Candidate Generation
Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern
mining avoid costly database scans
Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose
mining tasks into smaller ones Avoid candidate generation: sub-database test only!
41Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Construct FP-tree from a Transaction DB
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 0.5
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Steps:
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree 42Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Mining Frequent Patterns Using FP-tree
General idea (divide-and-conquer) Recursively grow frequent pattern path using the FP-tree
Method For each item, construct its conditional pattern-base, and
then its conditional FP-tree Repeat the process on each newly created conditional
FP-tree Until the resulting FP-tree is empty, or it contains only
one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)
43Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each node in the
FP-tree
2) Construct conditional FP-tree from each conditional
pattern-base
3) Recursively mine conditional FP-trees and grow
frequent patterns obtained so far
If the conditional FP-tree contains a single path,
simply enumerate all the patterns
44Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Step 1: From FP-tree to Conditional Pattern Base
Starting at the frequent header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item
Accumulate all of transformed prefix paths of that item to form a conditional pattern base
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
45Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Step 2: Construct Conditional FP-tree
For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the
pattern base
m-conditional pattern base:
fca:2, fcab:1
{}
f:3
c:3
a:3m-conditional FP-tree
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header TableItem frequency head f 4c 4a 3b 3m 3p 3
46Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Mining Frequent Patterns by Creating Conditional Pattern-Bases
EmptyEmptyf
{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}a
Empty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m
{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern-baseItem
47Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Single FP-tree Path Generation
Suppose an FP-tree T has a single path P
The complete set of frequent pattern of T can be generated
by enumeration of all the combinations of the sub-paths of P
{}
f:3
c:3
a:3
m-conditional FP-tree
All frequent patterns concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam
48Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Άλλο ένα παράδειγμα
Minsup = 2
49Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Άλλο ένα παράδειγμα
50Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Άλλο ένα παράδειγμα
ΥΣ FP-tree | f
51Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Άλλο ένα παράδειγμα
52Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Άλλο ένα παράδειγμα
Συνδυασμός ef με κάθε υποσύνολο του dca (και το κενό): ef, def, cef, aef, dcef, daef, caef, dcaef Όλα με υποστήριξη ίση με 2
(όμοια και για όλα τα άλλα ΥΣ μονοπάτια του f)
53Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Άλλο ένα παράδειγμα
54Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Benefits of the FP-tree Structure
Performance study shows FP-growth is an order of
magnitude faster than Apriori, and is also faster than tree-projection
Reasoning No candidate generation,
no candidate test Use compact data structure Eliminate repeated
database scan Basic operation is counting
and FP-tree building
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Ru
n t
ime(s
ec.)
D1 FP-grow th runtime
D1 Apriori runtime
55
Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ.
Δεδομένων
Closed Itemset
An itemset is closed if none of its immediate supersets has the same support as the itemset
TID Items
1 ABC
2 ABCD
3 BCE
4 ACDE
5 DE
Minimum support = 2
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
56Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Μειονεκτήματα υποστήριξης, εμπιστοσύνης
Θέατρο => Κινηματογράφος (15%, 75%)P(Κινηματογράφος) = 80% > 75%
57Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Correlation
We need a measure of dependent or correlated events
If corrA,B = 1: independence
If corrA,B > 1: positive correlation
If corrA,B < 1: negative correlation
)()(
)(, BPAP
BAPcorr BA
58Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Παράδειγμα
Corr = 0.15/(0.2 * 0.8) = 0.9375 < 1
59Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Παράδοξο Simpson
Τηλεόραση => Ραδιόφωνο (εμπ = 99/180= 55%)Όχι Τηλεόραση => Ραδιόφωνο (εμπ = 54/120 = 45%)Corr(Τηλεόραση, Ραδιόφωνο) = 1.07Θετική συσχέτιση μεταξύ τηλεόρασης και ραδιοφώνου
60Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Παράδοξο Simpson
Τηλεόραση => Ραδιόφωνο (εμπ = 1/10= 10%)Όχι Τηλεόραση => Ραδιόφωνο (εμπ = 4/34 = 11.8%)Corr(Τηλεόραση, Ραδιόφωνο) = 0.88
ανήλικοι
61Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων
Παράδοξο Simpson
Τηλεόραση => Ραδιόφωνο (εμπ = 98/170= 57.7%)Όχι Τηλεόραση => Ραδιόφωνο (εμπ = 50/86 = 58.1%)Corr(Τηλεόραση, Ραδιόφωνο) = 0.60
ενήλικοι
62Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων