27
4. ASSOCIATION ANALYSIS: BASIC CONCEPTS & ALGORITHMS This chapter presents a methodology known as association analysis, which is useful for discovering interesting relationships hidden in large data sets. The uncovered relationships can be represented in the form of association rules or sets of frequent items. For example, the following rule can be extracted from the data set shown in Table 6.1: Problem Definition Binary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds to a transaction and each column corresponds to an item. An item can be treated as a binary variable whose value is one if the item is present in a transaction and zero otherwise. Because the presence of an item in a transaction is often considered more important than its absence, an item is an asymmetric binary variable. 1

jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

  • Upload
    vodat

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

4. ASSOCIATION ANALYSIS: BASIC CONCEPTS & ALGORITHMS

This chapter presents a methodology known as association analysis, which is useful for discovering interesting relationships hidden in large data sets. The uncovered relationships can be represented in the form of association rules or sets of frequent items. For example, the following rule can be extracted from the data set shown in Table 6.1:

Problem Definition

Binary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds to a transaction and each column corresponds to an item. An item can be treated as a binary variable whose value is one if the item is present in a transaction and zero otherwise. Because the presence of an item in a transaction is often considered more important than its absence, an item is an asymmetric binary variable.

1

Page 2: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

2

Page 3: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

6.2 Frequent Itemset Generation

3

Page 4: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation

The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties

Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

4

Page 5: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

5

Page 6: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

Table 5.1

Example 5.3 Apriori. Let’s look at a concrete example, based on the AllElectronics transaction database, D, of Table 5.1. There are nine transactions in this database, that is, |D |= 9. We useFigure 5.2 to illustrate the Apriori algorithm for finding frequent itemsets in D.

6

Page 7: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets, C1. The algorithm simply scans all of the transactions in order to count the number of occurrences of each item.

2. Suppose that the minimum support count required is 2, that is, min sup = 2. (Here, we are referring to absolute support because we are using a support count. The corresponding relative support is 2/9 = 22%). The set of frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets satisfying minimum support. In our example, all of the candidates in C1 satisfy minimum support.

3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join to

generate a candidate set of 2-itemsets, C2. C2 consists of 2-itemsets. Note that no candidates are removed fromC2 during the prune step because each subset of the candidates is also frequent.

4. Next, the transactions in D are scanned and the support count of each candidate itemset inC2 is accumulated, as shown in the middle table of the second row in Figure 5.2.

5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having minimum support.

7

Page 8: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

7. The transactions in D are scanned in order to determine L3, consisting of those candidate 3-itemsets in C3 having minimum support (Figure 5.2).

8

Page 9: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

9

Page 10: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

10

Page 11: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

11

Page 12: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

12

Page 13: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

13

Page 14: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

14

Page 15: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

15

Page 16: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

16

Page 17: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

17

Page 18: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

18

Page 19: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

19

Page 20: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

20

Page 21: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

6.6 Mining Frequent Itemsets without Candidate Generation (FP Growth)

As we have seen, in many cases the Apriori candidate generate-and-test method significantly reduces the size of candidate sets, leading to good performance gain. However, it can suffer from two nontrivial costs:

21

Page 22: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

22

Page 23: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

23

Page 24: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

24

Page 25: jntuabookblog.files.wordpress.com file · Web viewBinary Representation Market basket data can be represented in a binary format as shown in Table 6.2, where each row corresponds

25